1 Corpus
2 Dictionary draft
3 Postediting
The Dictionary Express method of building dictionaries and other lexicographic works primarily draws on automatic generation of dictionary entries which are then validated and post-edited by humans. This approach is far more effective than the traditional methods.
The current state-of-the-art corpus systems, such as Sketch Engine, together with large corpora can automatically retrieve various types of linguistic data and use them for different components of dictionary entries. Data which can be retrieved automatically are, for example:
- headwords
- word frequency
- word forms
- word senses
- example sentences
- collocations
- multiword expressions
- synonyms and antonyms (thesaurus)
- translations
- usage flags (e.g. formal/informal, written/spoken, domains and topics)
The challenges of working with authentic language
A dictionary should provide information about how language is used by its users. This is why the Dictionary Express method generates dictionary entries from corpora of authentic language. However, authentic language is naturally noisy and this is why we developed tools and methodology that allows us to generate good quality dictionary data even if the corpus contains a certain level of noise.
Headwords
headword frequency
word forms
We only work with corpora which are tagged for part of speech and which are lemmatized. This way we can get reliable frequency for each part of speech a word is used in. In addition, lemmatization ensures that different forms of the same word (plural/singular, cases, genders, tenses) are correctly included in the count for headword.
Word senses
We developed technology which uses collocations to identify the different senses of the word. The output is then presented to human editors who name, validate, merge and edit the suggested senses.
Example sentences
The GDEX technology was developed to automatically identify illustrative sentences which serve well as Good Dictionary EXamples. GDEX has been available to the Sketch Engine users for many years. The same technology is used by Dictionary Express. Example sentences are identified based on word senses identified in the previous step.
The sentences preselected by GDEX are then validated and edited by human editors.
Collocations
Collocations, or typical word combinations, are identified using the Word Sketch technology. It has been continuously developed since 2003 and the users of Sketch Engine take advantage of it every day. Rather than on frequency, the Word Sketch focusses on typicality, which is much more valuable. (What do we mean by typical?)
Image on the right: A screenshot from the Sketch Engine interface showing part of the Word Sketch screen with collocations, their examples and frequency counts. More statistics and view options are available, including the possibility to view the authentic source sentences.