subtask 1 (PARSEME 2.0)
quite established framework
novelty: non-verbal MWEs, diversity measures
subtask 2 (MWE generation)
given a context with eliminated MWEs, restore this MWE
Problems: how to evaluate the system
[ALINE] Consider taking into account the level of difficulty of the items? For example, some items will be more ambiguous and more difficult to determine
[JOAKIM] It is unclear which capacity of models we test
[TOM] Very difficult to evaluate, even manually.
subtask 3 (MWE comprehension/disambiguation)
Given a sentence and a span of a potential idiomatic expressions, classify it as idiomatic, literal or coincidental
[GULSEN] There are some datasets for this task. Maybe the 3rd category complicates the things.
[JOAKIM]
[TOM] The same as SemEval 2022 (EN, PT, Galician). There are artefact issues (the models don’t really pay attention to the context).
subtask 4 (paraphrasing)
Given a sentence, rephrase it so that there are no MWEs
[AGATA] The input should be raw text, without a span. Objective: simplification of a text.
[JOAKIM] The most natural tasks among (2, 3 and 4). Close to what people do with LLMs.
Can we avoid doing manual evaluation? (LLM as judge)
[TOM] His favorite
[ALINE] They work with questionnaires for humans for this problem. There is a synonym dataset. Another task: collect sentences with synonyms of MWEs.
[ALINE] Sometimes the simplest way to express a meaning is with a MWE.
Questions:
Which subtasks to choose?
How to evaluate them?