lundi 6 juin 2016

Finding Relevant Data in a Sea of Languages

MIT News (05/27/16) Ariana Tantillo; Dorothy Ryan 

Researchers in the Massachusetts Institute of Technology Lincoln Laboratory's Human Language Technology (HLT) Group seek to address the challenge of providing multilingual content analysis amid a shortage of analysts with the necessary skills. Their work could potentially benefit law enforcement and the U.S. Department of Defense and intelligence communities. The HLT team is exploiting innovations in language recognition, speaker recognition, speech recognition, machine translation, and information retrieval to automate language-processing tasks so the available linguists who analyze text and spoken foreign languages are more efficiently utilized. The team is concentrating on cross-language information retrieval (CLIR) using the Cross-LAnguage Search Engine (CLASE), which enables English monolingual analysts to help look for and filter foreign language documents. The researchers use probabilistic CLIR based on machine-translation lattices. The method entails documents being machine-translated into English as a lattice containing all possible translations with their respective probabilities of accuracy. Documents containing the most likely translations are extracted from the collection for analysis, based on an analyst's query of a document collection; CLIR results are assessed according to precision, recall, and their harmonic average or F-measure. Meanwhile, HLT's Jennifer Williams is developing algorithms to identify languages in text data so CLASE can select the appropriate machine translation models, and others are working on automatic multilingual text-translation systems.