I.Alegria(1),A.Gurrutxaga (2),P.Lizaso(2),X.Saralegi (2),S. Ugartetxea(2),R.Urizar(1)
(1)Ixa taldea–University of the Basque Country
acpalloi@si.ehu.es
(2)Elhuyar Fundazioa
agurrutxaga@elhuyar.com
ERAUZTERM: a xml-based term extraction tool for basque
1. Introduction
The aim of this project is to develop a modular term extraction tool for Basque. It is based on linguistic and statistic techniques, and XML technology. Being Basque a highly inflected language, morphosyntactic information is vital. Furthermore, as the unification process of the language is not finished yet, texts present higher term dispersion than in a highly normalized language.
This project is part of the Hizking21 ( www.hizking21.org ) project, which is developed to support basic research in language engineering and to achieve technical and scientific aims in the field of language technologies.
The specialized corpus used consists of 2,706,809 words and includes all the articles published by Elhuyar Foundation in Zientzia.net site ( www.zientzia.net ) until 2003. A sample of 13,756 words and composed exclusively by computer science divulgation articles has been processed manually. In the second phase of the project (2004), a larger hand-tagged corpus will be available.
2. Extractor design
For the selection of the terminological units, only NPs are taken into account. In order to select the most usual NP structures of Basque terms, we have taken as starting point the previous work done by the IXA group (Urizar et al. 2000). Apart from the morphosyntactic pattern detection, term normalization is adjusted to remove term variation. After the linguistic process, a raw list of term-candidates is generated. The statistical module classifies this list and selects the most probable combinations.
2.1 Linguistic process
The system uses Euslem for morphological tagging/disambiguation process, and it is essential to detect and to mark up the word combinations corresponding to the structures of the terms. Consequently, different word combinations of the same canonical form can be related. The selection of morphosyntactic patterns is based on shallow syntactic analysis (using the information of the tagger), and is carried out by a finite-state transducer. In order to improve the accuracy of the extractor, the results of this process have been checked with the hand-tagged corpora in a feedback process.
In order to deal with terms included in longer combinations, nested terms, the linguistic module decomposes maximal NPs into sub-structures. Those syntactic constituents (head and expansion) that follow the maximal probability of including a term are maintained.
2.2 Statistical process
Finding individual terms is a very important challenge for a real system. In order to achieve this goal, the relative frequency of the nouns is contrasted against the relative frequency of a general corpus.
Relative frequency conjointly with word association measures command the multiword terms. Our approach here is based on Mutual Information (MI), Log-likelihood (LR), Mutual Expectation (ME) and Log-lineal measures. The best results until now are shown by LR measure in NN combination.
The result is a classified list, which can be shown completely for semiautomatic term-extraction or can be cut off to increase the precision in automatic processing (i.e. Information Retrieval). In the first case, a tool helps managing the terminology.
3 The tool
The application has been designed to accept various document formats. The context of the extracted terms will be also available for the user. Three-tier logic architecture has been defined: user interface, process logic, and data management. The physical design lies on a web browser, a web server, and a native XML database (Berkeley DB XML).
4 Evaluation
State-of-art in this area shows that good recall and precision is not possible. However, when the terms are manually checked, and this is the first requirement of our system, high recall and a good interface is the main aim. LR bigram results:
Recall Precision Automatic Manual Guessed
Respect to hand corpora 78.43 27.78 720 255 200These results are near from the related work and we think they can be improved in the close future. The analysis of errors indicates some limitations of the linguistic tools in two main matters: a) identifying foreign words; b) postpositions. Postpositions produce a non-negligible amount of noise, and to avoid it the insertion of new rules in the grammar is being carried out. Improving the tagger is in progress in order to solve the first problem.
5 Conclusion
The results are satisfactory and we are planning to improve them in the second phase of the project, where Machine-learning paradigm and semantics are planned to be used. Besides, this project is integrative, because it is a continuation of the research in this field, and it is a key project because tools developed for other languages show unfavourable results for Basque.