One of the most essential components in the foreign language learning process are online dual-language dictionaries. The efficiency in obtaining required lexical data from them directly translates into the efficiency of the learning process. Among other aspects, the order of presentation of retrieved translations is of great importance. Developers of online dictionaries often merge different source dictionaries in a single dataset, which results in translations placed in non-systematic order.
Ranking using word frequencies obtained from national corpora does not solve the problem since the place of the translation in the list depends not from how often the word occurs in texts, but rather from how frequent the sense in which this word is used occurs. Besides, ranking must be done separately for translations belonging to different part-of-speech groups (nouns, verbs etc.). To understand the magnitude of the ranking effort, the scope of work needed should be determined beforehand.
This paper describes the analysis of lexical units to be ranked in the lexical dataset used in the online dictionary LexSite. A subset made for this analysis includes 45,625 English words and 24,628 Russian words with corpora-based ranks below 60,000 in 4 part-of-speech categories (nouns, verbs, adjectives and adverbs). The study found that words with large number of lexical senses makes up about 3% of all words in the subset. Although words with small number of senses are in abundance (e.g. 3,955 English words in the 5-10 senses category), large efforts are unnecessary for these groups because small lists of translations normally are easy to understand.
The paper discusses detailed results of the studies described above and substantiates the need and techniques for translation ranking in online dual-language dictionaries.
Keywords |
lexicography, word frequency, ranking, online dictionary |