User Tools

Site Tools


Working Group 2: Lexicon-corpus interface


In the context of a quest for diversity, electronic lexica are complementary to corpora because they aim at holistic language modelling, describing possibly many linguistic objects, whereas in corpora many phenomena occur rarely or never (§ Lexica can also be useful in unifying terminologies, e.g., when a category can be described as a closed word list. In this context WG2 will be dedicated to:

  1. Cross-language unification of lexical features:
    • harmonizing the definition of a “syntactic word” across languages,
    • harmonizing lemmatization rules (for words and MWEs) and lexical features across languages,
    • standardizing lists of lexemes for auxiliaries, pronouns and determiners;
  2. Design of a lexicon-corpus interface aiming at:
    • interlinking MWE lexicon entries with their occurrences in corpora,
    • cross-lingually unified lexicography of idiosyncratic constructions;
  3. Proof-of-concept lexical encoding of MWEs following the above design.


The monthly online meetings of WG2 will be taking place every first Thursday of the month from 13:00 CEST (for an hour). See the list of past and upcoming WG meetings.

Current Subtasks

  • Task 2.1: Cross-language unification of lexical features [co-leaders: Kilian Evang, Dan Zeman, Petya Osenova]
  • Task 2.2: Design of a lexicon-corpus interface [co-leaders: Simon Krek, Carole Tiberius, Jaka Čibej]
  • Task 2.3: Proof-of-concept lexicon encoding of MWEs [co-leaders: Stella Markantonatou, Ivelina Stoyanova, Christian Chiarcos, Ranka Stanković]



Jan Odijk, Canonical form of MWEs - short presentation, long presentation

Translations of this page:
  • en
wg2/wg2.txt · Last modified: 2024/06/07 14:17 by verginica.mititelu