Harvesting big text data for under-resourced languages

Department 58 – International Relations

Published 10-11-2014

Updated 23-12-2014

Project
Region	National coverage
Title of the Programme	Czech-Norwegian Research Programme
Title of the Project	Harvesting big text data for under-resourced languages
Number of the Project	7F14047
Project Promoter	Masaryk University, Brno www.muni.cz
Name of Norwegian Partner(s)	Norges teknisk-naturvitenskapelige universitet
Objective of the Project	The main goal of the project is to harvest large-scale textual data from the Web for under-resourced languages (Norwegian, partly Czech and the four major languages in Ethiopia - Amharic, Afaan Oromo, Tigrinya, Somali) and to build shallow processing applications for them. The data will be annotated and parsed to make it usable in various language processing applications, such as information extraction and retrieval, machine translation, etc. The project results will be utilized also within external cooperation with the niversity of Oslo and two Ethiopian universities in a project to support linguistic esource building in Ethiopia funded by Norad. The developed applications will serve for investigating and separating multiple senses of the words in the corpora, for word sense induction, as well as for creating multi-sense vector spaces and parallel multilingual vector spaces for word translation disambiguation.
Approved grant	923 321 EUR
Project Duration	Start date: 15th July 2014, End date: 30th April 2017