====== PARSEME/UniDive annotation campaign on multiword expressions ======
* **Event title**: PARSEME/UniDive annotation campaign (UniDive WG1 task 1.2)
* **Dates**: September 2023 -- September 2025
* **Co-leaders**:
* [[https://people.auth.gr/pgiouli/?lang=en|Voula Giouli]], Aristotle University of Thessaloniki, Greece
* [[https://www.ilsp.gr/en/members/markantonatou-stella-2/|Stella Markantonatou]], Language and Speech Processing/ATHENA RC, Athens, Greece
* Takuya Nakamura, Université Paris-Saclay, France
* [[https://pageperso.lis-lab.fr/carlos.ramisch/|Carlos Ramisch]], Aix-Marseille Université, France
* [[https://perso.lisn.upsaclay.fr/savary/|Agata Savary]], Université Paris-Saclay, France
* [[https://www2.lingfil.uu.se/cl/sara/|Sara Stymne]], Uppsala University, Sweden
|[[https://www.cost.eu/|{{ :cost_logo_rgb_lowresolution-cropped.jpg?100 |}}]]|{{ :en-funded_by_the_eu-pos.png?200 |}}|[[other-events:logo-tessaloniki.png|{{:wg1:wg1:task1.2:logo-tessaloniki.png?100|}}]]|[[https://www.athenarc.gr/|{{:wg1:wg1:task1.2:logo-athena.png?100|}}]]|[[https://www.universite-paris-saclay.fr/|{{:other-events:logo-univ-saclay.png?100|}}]]|[[https://www.univ-amu.fr/|{{:other-events:logo-amu.png?100|}}]]|[[https://www.uu.se/|{{:wg1:wg1:task1.2:logo-uppsala.png?40|}}]]|
The [[https://unidive.lisn.upsaclay.fr/|UniDive]] COST action (task 1.2) and the [[https://gitlab.com/parseme/corpora/-/wikis|PARSEME]] community are carrying on a **multilingual corpus annotation campaign** dedicated to multiword expressions (MWEs).
Three past PARSEME annotation campaigns were dedicated exclusively to __verbal__ MWEs (VMWEs) and resulted in 4 editions of the [[https://gitlab.com/parseme/corpora/-/wikis/home|PARSEME corpus]], which jointly covers **26 languages**. Three [[https://gitlab.com/parseme/corpora/-/wikis/home#shared-tasks|PARSEME shared tasks]] on automatic identification of VMWEs have been organized on the basis of this corpus and set the state of the art in the task.
The current annotation campaign covers MWEs of **all syntactic types** (including nominal, adjectival, adverbial and functional MWEs). It follows the spirit of **universality**. Namely, the [[https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0|annotation guidelines]] are unified across all participating languages, whenever possible, still leaving room for truly language-specific phenomena. This approach is expected to promote meaningful cross-language comparisons. The resulting corpus will be used in a [[other-events:parseme-st|PARSEME/UniDive shared task]] on identifying and understanding MWEs, submitted as a proposal for [[https://semeval.github.io/SemEval2026/cft|SemEval 2026]].
===== Teams =====
Each language should be annotated by a team on **native annotators** (except when this is not possible, e.g. in the case of extinct languages like Ancient Greek or Egyptian). A language team should consist of **at least 2 annotators** (including the Language Leader), for the sake of inter-annotator agreement estimation. It is possible to start annotating alone and recruit more annotators at a later stage (May 2025 at latest). See the [[https://gitlab.com/parseme/corpora/-/wikis/home#language-teams|language teams]] from past and present annotation campaigns.
Each language team should have at least one **Language Leader**. See the [[wg1:wg1:task1.2:call-for-language-leaders|call for Language Leaders]].
===== Annotation work =====
For the [[https://gitlab.com/parseme/corpora/-/wikis/home#languages|languages already present]] in the PARSEME corpus, the agenda is to:
* Re-annotate |the existing corpus with MWEs other than verbal. Annotating only part of the existing corpus is an option. In this case we recommend a **minimum of 2000 annotated MWEs** (so that each selected text is exhaustively annotated for all syntactic types of MWEs). A lower number of annotations can do but the system results are expected not to be representative.
* Add some **new texts** annotated from scratch (to counterbalance language model contamination from previously published data)
For [[https://gitlab.com/parseme/corpora/-/wikis/home#upcoming-languages|new languages]], corpora will be annotated for all syntactic types at once.
Conversions from other MWE annotation schemes are fine, if curated so as to fit the PARSEME guidelines.
===== Timeline =====
* **[task leaders: 14 February]** [[wg1:wg1:task1.2:call-for-language-leaders|Call for Language Leaders]]
* **[language leaders: 27 February]** Expression of interest from Language Leaders
* **[task leaders: late-February]** Creating FLAT accounts
* **[language leaders: mid-March]**
* Reading the [[https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0/|annotation guidelines 2.0]]
* Reading the [[https://gitlab.com/parseme/corpora/-/wikis/PARSEME-Language-Leader-Guide|Language Leader's guide]]
* Filling in MWE examples in the guidelines
* Recruiting annotators
* Selecting corpora
* **[all: 28 March]** Pilot annotation, submitting [[https://gitlab.com/parseme/sharedtask-guidelines/-/issues|issues]]
* **[shared task leaders: 31 March]** SEMEVAL 2016 shared task proposal
* **[SEMEVAL: 19 May]** Notification from SemEval about the selected shared tasks => rejected
* **[language teams: April-1 September]** Annotation for subtask 1 (PARSEME corpus)
* Annotating the PARSEME corpus with all syntactic types of MWEs
* Double-annotating a sample for inter-annotator agreement estimation
* Consistency checks
* **[task leaders: 15 September]** Preparing the data for subtask 2
* **[language teams: 1 October]** MWE paraphrasing for subtask 2
* **[task leaders: 30 October]** Consolidating and splitting the corpora for both subtasks
* **[task leaders: autumn]** Shared task proposal
===== Documents and tools =====
* PARSEME/UniDive annotation campaign [[https://docs.google.com/document/d/1u_ycAUIB8Fw7kYI3M_Xkj5_ZWftlXcIpSa42pHGaBdc/edit?usp=sharing| master document]]
* [[https://gitlab.com/parseme/corpora/-/wikis/|PARSEME corpus wiki]]
* Annotation guidelines
* [[https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0/|PARSEME annotation guidelines 2.0]]
* [[https://docs.google.com/document/d/1meuelqTYyTeIEW3ezqNydEZTYhXhh_8jKcv9r93y1mU/edit?usp=sharing|what’s new in version 2.0]]
* [[https://gitlab.com/parseme/sharedtask-guidelines/-/issues|Gitlab issues]] from the guidelines
* [[https://gitlab.com/parseme/corpora/-/wikis/PARSEME-Language-Leader-Guide|Language Leader's guide]]
* [[https://flat.lisn.upsaclay.fr|FLAT]] annotation platform and [[https://docs.google.com/document/d/1gAQ1yC0xR-nkJVbVNMgtN6gCRrizjfoqH-z6SQ_pDSk/edit?usp=sharing|FLAT User's Guide]]
* Minutes from [[https://docs.google.com/document/d/1jvOGO2Q_pJpm1rB0B6sAprKzEh2n95Jc-VTktaAW_j8/edit?usp=sharing|task 1.2 co-leaders’ meetings]]
* Minutes from [[https://docs.google.com/document/d/1r-OcsGUOMFZFewTigj9arGhDQ4ePzbnfYYSOhFxMQ_A/edit?usp=sharing|Language Leaders' meetings]]
===== Language Leaders' meetings =====
Language Leaders meet weekly online during the annotation campaign. The timeline is the following:
* Tuesday 8 April 6 p.m. CEST
* Friday 18 April 9 a.m. CEST
* Friday 2 May 9 a.m. CEST
* Tuesday 6 May 6 p.m. CEST
* Friday 16 May 9 a.m. CEST
* Friday 30 May 9 a.m. CEST
* Tuesday 3 June 6 p.m. CEST
* Friday 13 June 9 a.m. CEST
* Friday 27 June 6 p.m. CEST
* Tuesday 1 July 6 p.m. CEST
* Friday 25 July 9 a.m. CEST
We are using the recurrent [[https://cnrs.zoom.us/j/92794488497?pwd=Smtmdm4rTCs1S3hFdjZsUk1rZlU1dz09
|zoom link]].