Harvesting big text data for under-resourced languages
Published
Updated 23-12-2014
- region changed
Region | National coverage |
---|---|
Title of the Programme | Czech-Norwegian Research Programme |
Title of the Project | Harvesting big text data for under-resourced languages |
Number of the Project | 7F14047 |
Project Promoter |
Masaryk University, Brno |
Name of Norwegian Partner(s) |
Norges teknisk-naturvitenskapelige universitet |
Objective of the Project |
The main goal of the project is to harvest large-scale textual data from the Web for under-resourced languages (Norwegian, partly Czech and the four major languages in Ethiopia - Amharic, Afaan Oromo, Tigrinya, Somali) and to build shallow processing applications for them. The data will be annotated and parsed to make it usable in various language processing applications, such as information extraction and retrieval, machine translation, etc. The project results will be utilized also within external cooperation with the niversity of Oslo and two Ethiopian universities in a project to support linguistic esource building in Ethiopia funded by Norad. The developed applications will serve for investigating and separating multiple senses of the words in the corpora, for word sense induction, as well as for creating multi-sense vector spaces and parallel multilingual vector spaces for word translation disambiguation. |
Approved grant |
923 321 EUR |
Project Duration | Start date: 15th July 2014, End date: 30th April 2017 |