A corpus of authentic texts serves as the main source of information that will be included in the dictionary. We work with corpora of at least billions of words where possible.

The corpus size is vital, especially for mid- or low-frequency words and for multiword expressions, collocations, and phraseology. A large corpus ensures that their number of occurrences is large enough to serve as a reliable source of information for creating a dictionary.

The words in the corpus must also be tagged for parts of speech and lemmatized so that the dictionary draft generation can take into account the morphology, conjugations, and declensions.

Corpus data collection

The only viable option to collect the desired volume of text is by downloading texts from the web. At Lexical Computing, we developed tools and procedures to collect only linguistically valuable content. The web content is downloaded, automatically filtered, cleaned and deduplicated, and finally, manually checked.

What is a corpus?

A corpus is a massive collection of authentic texts. Together with a corpus analysis software, such as Sketch Engine, it is used by linguists to learn how language is used by its speakers.

In the past, linguists relied mainly on introspection – on what they know about language. This sometimes led to subjective judgements. Modern linguistics relies on corpora, i.e. on examples of authentic use, for objective conclusions reflecting the real use of the language.

How we build our corpora

At Sketch Engine, we have more than 20 years of corpus building experience. Over the years, we developed our own technology, infrastructure and methodology for using the web as a source of data for linguistically valuable corpora which are used as sources of data for lexicography and dictionaries.

Please read about how we do it.

The Dictionary Express method of building dictionaries and other lexicographic works primarily draws on automatic generation of dictionary entries which are then validated and post-edited by humans. This approach is far more effective than the traditional methods.

The current state-of-the-art corpus systems, such as Sketch Engine, together with large corpora can automatically retrieve various types of linguistic data and use them for different components of dictionary entries. Data which can be retrieved automatically are, for example:

  • headwords
  • word frequency
  • word forms
  • word senses
  • example sentences
  • collocations
  • multiword expressions
  • synonyms and antonyms (thesaurus)
  • translations
  • audio (pronunciation)
  • multimedia (images, videos)
  • usage flags (e.g. formal/informal, written/spoken, domains and topics)

The challenges of working with authentic language

A dictionary should provide information about how language is used by its users. This is why the Dictionary Express method generates dictionary entries from corpora of authentic language. However, authentic language is naturally noisy, and this is why we developed tools and methodology that allow us to create dictionary content of high quality even if the corpus contains a certain level of noise.

Headwords
headword frequency
word forms

We only work with corpora which are tagged for part of speech and which are lemmatized. This way we can get reliable frequency for each part of speech a word is used in. In addition, lemmatization ensures that different forms of the same word (plural/singular, cases, genders, tenses) are correctly included in the count for the headword.

Word senses

We developed a sophisticated word sense induction technology which uses collocations to identify the different senses of the word. The output is then presented to human editors who name, validate, merge and edit the suggested senses.

Example sentences

The GDEX technology was developed to automatically identify illustrative sentences which serve well as Good Dictionary EXamples. GDEX has been available to the Sketch Engine users for many years. The same technology is used by Dictionary Express. Example sentences are identified based on word senses identified in the previous step.

The sentences preselected by GDEX are then validated and edited by human editors.

Collocations

Collocations, or typical word combinations, are identified using the Word Sketch technology. It has been continuously developed since 2003 and the users of Sketch Engine take advantage of it every day. Rather than on frequency, the Word Sketch focusses on typicality,  which is much more valuable. (What do we mean by typical?)

Image on the right: A screenshot from the Sketch Engine interface showing part of the Word Sketch screen with collocations, their examples and frequency counts. More statistics and view options are available, including the possibility to view the authentic source sentences.

An automatically generated dictionary draft requires a team of human editors. With the Dictionary Express method, editors never edit the complete entry. Instead, they edit one entry component, for example, word senses, of all entries in the dictionary. This component-by-component approach (rather than the traditional entry-by-entry approach) makes it possible to quickly train the editors in one clearly defined task.

As long as the editors are native speakers, they are not required to be expert lexicographers. This makes it easier to build the editorial team. The benefit is even greater for less resourced languages, where the availability of trained lexicographers or linguists may be very limited or non-existent at all.

Headword list postediting

The dictionary postediting starts with the headword list. The editors assign flags to headwords to distinguish items which qualify for dictionary headwords. Keyboard shortcuts and advancing to the next entry automatically make the task even more efficient.

Image on the right: An example of the headword list editing interface. The concrete flags differ between dictionary projects based on the requirements of the customer.

Word sense postediting

With Dictionary Express, word senses of  headword are suggested automatically based on collocations. The editors name the senses, and validate or reassign the collocations to the correct senses. The sense names can be included in the entries as sense flags or labels (also called a disambiguating gloss).

The process also leads to merging word senses or splitting senses as necessary.

The postedited word sense information is then fed back to the source corpus and used for generating the remaining entry components, e.g. examples, which are  identified taking word senses into account.

Image on the right: Word sense postediting interface. Automatically suggested word senses of the English word bat. Normally, a higher number of collocations is presented to the editor for each sense. This is only an example.

1
2
3
1

Editors name the senses and add new ones if required. The sense names can be included in the dictionary as sense labels or flags.

2

Editors validate the collocations, reassign them to different senses, mark them as MIXED (non-disambiguating collocations) or exclude them completely.

3

A link to sentences from which this collocation was extracted can be used by the editors to refer to the source corpus data to better understand the usage and context.

Example sentences

Example sentences extracted using the GDEX method are validated and edited by the editors in an easy to use interface. Editors mark the examples as good or bad or edit the example. For example, they change a question to an affirmative sentence or remove wording which does contribute to understanding the usage.

Image on the right: Interface for editing example sentences. The sentences are grouped by word senses identified in the previous step. The editors may also be required to pick one best example or to only mark a specified number of examples as good. The interface will then contain the corresponding controls and will check the conditions are met.

Thesaurus

synonyms, antonyms and similar words (semantic field)

Dictionary entries can also contain thesaurus items. The candidate words are generated automatically using the thesaurus tool in Sketch Engine. The editors then sort the suggestions into three groups:

  • synonyms – words with a similar meaning
  • antonyms – words with an opposite meaning
  • similar words – words which belong to the same semantic group (semantic field) but are neither synonyms or antonyms

The finalized content of the custom dictionary, glossary or the database is supplied in a format that allows publishing or deployment in various formats:

  • print material
  • online dictionary
  • software/app integration

In the case of an online dictionary, we offer a dictionary publishing interface which can be customized.

your publishing options