CLIRch, an extensible open source framework for query translation: evaluated for use on the Norwegian/Spanish language pair.
CLIR, Cross-Lingual Information Retrieval, is a field of research that can be
highly useful in web search and for several other applications. Extensive
research has been done on possible CLIR implementations, but as of yet there
are no open source frameworks or applications readily available. The thesis
focuses on building such a framework and evaluating it for use on the
Norwegian/Spanish language pair.
The framework implemented uses query translation to submit queries to existing
information retrieval (IR) implementations, and the framework itself holds no
low-level IR algorithms. Experiments were performed on a small parallel corpus
of Norwegian and Spanish texts, using the Xapian and PostgreSQL IR
implementations. A comprehensive comparison of possible configurations was
done, and certain measures were shown to be effective when searching for
documents in either language.
The framework is implemented in a modular architecture, allowing the suggested
additions and amendments to be implemented as add-on components. This is the
main intent of the framework, and eases the process of building support for
additional languages as well. For easing the adoption of the framework,
additional components and data may be beneficial.
Some improvements are also possible for the tested language pair, through
obtaining larger data sets or implementing certain language specific
algorithms. Of particular interest is implementing effective decompounding of
Norwegian compound words and phrase translation support. Suggestions are also
made for how the system can be used to perform CLIR tasks in other languages.
Place, publisher, year, edition, pages
Institutt for datateknikk og informasjonsvitenskap , 2012. , 64 p.
ntnudaim:5665, MIT informatikk, Kunstig intelligens og læring
IdentifiersURN: urn:nbn:no:ntnu:diva-18357Local ID: ntnudaim:5665OAI: oai:DiVA.org:ntnu-18357DiVA: diva2:565861
Gamback, Bjørn, Professor