====== PARSEME/UniDive annotation campaign on multiword expressions ====== * **Event title**: PARSEME/UniDive annotation campaign (UniDive WG1 task 1.2) * **Dates**: September 2023 -- September 2025 * **Co-leaders**: * [[https://people.auth.gr/pgiouli/?lang=en|Voula Giouli]], Aristotle University of Thessaloniki, Greece * [[https://www.ilsp.gr/en/members/markantonatou-stella-2/|Stella Markantonatou]], Language and Speech Processing/ATHENA RC, Athens, Greece * Takuya Nakamura, Université Paris-Saclay, France * [[https://pageperso.lis-lab.fr/carlos.ramisch/|Carlos Ramisch]], Aix-Marseille Université, France * [[https://perso.lisn.upsaclay.fr/savary/|Agata Savary]], Université Paris-Saclay, France * [[https://www2.lingfil.uu.se/cl/sara/|Sara Stymne]], Uppsala University, Sweden |[[https://www.cost.eu/|{{ :cost_logo_rgb_lowresolution-cropped.jpg?100 |}}]]|{{ :en-funded_by_the_eu-pos.png?200 |}}|[[other-events:logo-tessaloniki.png|{{:wg1:wg1:task1.2:logo-tessaloniki.png?100|}}]]|[[https://www.athenarc.gr/|{{:wg1:wg1:task1.2:logo-athena.png?100|}}]]|[[https://www.universite-paris-saclay.fr/|{{:other-events:logo-univ-saclay.png?100|}}]]|[[https://www.univ-amu.fr/|{{:other-events:logo-amu.png?100|}}]]|[[https://www.uu.se/|{{:wg1:wg1:task1.2:logo-uppsala.png?40|}}]]| The [[https://unidive.lisn.upsaclay.fr/|UniDive]] COST action (task 1.2) and the [[https://gitlab.com/parseme/corpora/-/wikis|PARSEME]] community are carrying on a **multilingual corpus annotation campaign** dedicated to multiword expressions (MWEs). Three past PARSEME annotation campaigns were dedicated exclusively to __verbal__ MWEs (VMWEs) and resulted in 4 editions of the [[https://gitlab.com/parseme/corpora/-/wikis/home|PARSEME corpus]], which jointly covers **26 languages**. Three [[https://gitlab.com/parseme/corpora/-/wikis/home#shared-tasks|PARSEME shared tasks]] on automatic identification of VMWEs have been organized on the basis of this corpus and set the state of the art in the task. The current annotation campaign covers MWEs of **all syntactic types** (including nominal, adjectival, adverbial and functional MWEs). It follows the spirit of **universality**. Namely, the [[https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0|annotation guidelines]] are unified across all participating languages, whenever possible, still leaving room for truly language-specific phenomena. This approach is expected to promote meaningful cross-language comparisons. The resulting corpus will be used in a [[other-events:parseme-st|PARSEME/UniDive shared task]] on identifying and understanding MWEs, submitted as a proposal for [[https://semeval.github.io/SemEval2026/cft|SemEval 2026]]. ===== Teams ===== Each language should be annotated by a team on **native annotators** (except when this is not possible, e.g. in the case of extinct languages like Ancient Greek or Egyptian). A language team should consist of **at least 2 annotators** (including the Language Leader), for the sake of inter-annotator agreement estimation. It is possible to start annotating alone and recruit more annotators at a later stage (May 2025 at latest). See the [[https://gitlab.com/parseme/corpora/-/wikis/home#language-teams|language teams]] from past and present annotation campaigns. Each language team should have at least one **Language Leader**. See the [[wg1:wg1:task1.2:call-for-language-leaders|call for Language Leaders]]. ===== Annotation work ===== For the [[https://gitlab.com/parseme/corpora/-/wikis/home#languages|languages already present]] in the PARSEME corpus, the agenda is to: * Re-annotate |the existing corpus with MWEs other than verbal. Annotating only part of the existing corpus is an option. In this case we recommend a **minimum of 2000 annotated MWEs** (so that each selected text is exhaustively annotated for all syntactic types of MWEs). A lower number of annotations can do but the system results are expected not to be representative. * Add some **new texts** annotated from scratch (to counterbalance language model contamination from previously published data) For [[https://gitlab.com/parseme/corpora/-/wikis/home#upcoming-languages|new languages]], corpora will be annotated for all syntactic types at once. Conversions from other MWE annotation schemes are fine, if curated so as to fit the PARSEME guidelines. ===== Timeline ===== * **[task leaders: 14 February]** [[wg1:wg1:task1.2:call-for-language-leaders|Call for Language Leaders]] * **[language leaders: 27 February]** Expression of interest from Language Leaders * **[task leaders: late-February]** Creating FLAT accounts * **[language leaders: mid-March]** * Reading the [[https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0/|annotation guidelines 2.0]] * Reading the [[https://gitlab.com/parseme/corpora/-/wikis/PARSEME-Language-Leader-Guide|Language Leader's guide]] * Filling in MWE examples in the guidelines * Recruiting annotators * Selecting corpora * **[all: 28 March]** Pilot annotation, submitting [[https://gitlab.com/parseme/sharedtask-guidelines/-/issues|issues]] * **[shared task leaders: 31 March]** SEMEVAL 2016 shared task proposal * **[SEMEVAL: 19 May]** Notification from SemEval about the selected shared tasks => rejected * **[language teams: April-1 September]** Annotation for subtask 1 (PARSEME corpus) * Annotating the PARSEME corpus with all syntactic types of MWEs * Double-annotating a sample for inter-annotator agreement estimation * Consistency checks * **[task leaders: 15 September]** Preparing the data for subtask 2 * **[language teams: 1 October]** MWE paraphrasing for subtask 2 * **[task leaders: 30 October]** Consolidating and splitting the corpora for both subtasks * **[task leaders: autumn]** Shared task proposal ===== Documents and tools ===== * PARSEME/UniDive annotation campaign [[https://docs.google.com/document/d/1u_ycAUIB8Fw7kYI3M_Xkj5_ZWftlXcIpSa42pHGaBdc/edit?usp=sharing| master document]] * [[https://gitlab.com/parseme/corpora/-/wikis/|PARSEME corpus wiki]] * Annotation guidelines * [[https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0/|PARSEME annotation guidelines 2.0]] * [[https://docs.google.com/document/d/1meuelqTYyTeIEW3ezqNydEZTYhXhh_8jKcv9r93y1mU/edit?usp=sharing|what’s new in version 2.0]] * [[https://gitlab.com/parseme/sharedtask-guidelines/-/issues|Gitlab issues]] from the guidelines * [[https://gitlab.com/parseme/corpora/-/wikis/PARSEME-Language-Leader-Guide|Language Leader's guide]] * [[https://flat.lisn.upsaclay.fr|FLAT]] annotation platform and [[https://docs.google.com/document/d/1gAQ1yC0xR-nkJVbVNMgtN6gCRrizjfoqH-z6SQ_pDSk/edit?usp=sharing|FLAT User's Guide]] * Minutes from [[https://docs.google.com/document/d/1jvOGO2Q_pJpm1rB0B6sAprKzEh2n95Jc-VTktaAW_j8/edit?usp=sharing|task 1.2 co-leaders’ meetings]] * Minutes from [[https://docs.google.com/document/d/1r-OcsGUOMFZFewTigj9arGhDQ4ePzbnfYYSOhFxMQ_A/edit?usp=sharing|Language Leaders' meetings]] ===== Language Leaders' meetings ===== Language Leaders meet weekly online during the annotation campaign. The timeline is the following: * Tuesday 8 April 6 p.m. CEST * Friday 18 April 9 a.m. CEST * Friday 2 May 9 a.m. CEST * Tuesday 6 May 6 p.m. CEST * Friday 16 May 9 a.m. CEST * Friday 30 May 9 a.m. CEST * Tuesday 3 June 6 p.m. CEST * Friday 13 June 9 a.m. CEST * Friday 27 June 6 p.m. CEST * Tuesday 1 July 6 p.m. CEST * Friday 25 July 9 a.m. CEST We are using the recurrent [[https://cnrs.zoom.us/j/92794488497?pwd=Smtmdm4rTCs1S3hFdjZsUk1rZlU1dz09 |zoom link]].