2 Dictionary draft
The corpus serves as the main source of information that will be included in the dictionary. We aim at corpora of at least billions of words where possible.
The corpus size is vital, especially for mid- or low-frequency words and for multiword expressions, collocations, and phraseology. A large corpus ensures that their number of occurrences is large enough to serve as a reliable source of information.
The words in the corpus must also be tagged for parts of speech and lemmatized so that the dictionary draft generation can take into account the morphology, conjugations and declensions.
Corpus data collection
The only viable option to collect the desired volume of text is by downloading texts from the web. At Lexical Computing, we developed tools and procedures to collect only linguistically valuable content. The web content is downloaded, automatically filtered, cleaned and deduplicated and, finally, manually checked.
A corpus is a massive collection of authentic texts. Together with a corpus analysis software, such as Sketch Engine, it is used by linguists to learn how language is used by its speakers.
In the past, linguists relied mainly on introspection – on what they know about language. This sometimes led to subjective judgements. Modern linguistics relies on corpora, i.e. on examples of authentic use, for objective conclusions reflecting the real use of the language.
At Sketch Engine, we have more than 20 years of corpus building experience. Over the years, we developed our own technology, infrastructure and methodology for using the web as a source of data for linguistically valuable corpora which are used as sources of data for lexicography and dictionaries.
Please read about how we do it.