wg3:wg3_meeting_2023-03-17 [Universality, diversity and idiosyncrasyin language technologyCA21167 COST Action]

wg3:wg3_meeting_2023-03-17

This is an old revision of the document!

Table of Contents

WG3 1st Meeting Minutes -- 2023-03-17
- Session 1
- Session 2

WG3 1st Meeting Minutes -- 2023-03-17

Session 1

10.45–11.00 Introduction to WG3 (slides)

11.00–11.30 Brainstorming on ideas and expectations

Discussion questions:

What is most important for you in multilingual and cross-lingual NLP?
What activities do you think we should prioritize?
How can we work together to make progress towards our goals?

Points raised:

Large language models are most important
Articulating linguistic theories underlying tools
Defining idiosyncrasy and diversity
The user perspective is important
Supporting low-resource languages through cross-lingual technology
Supporting low-resource languages through annotation tools
Supporting low-resource languages through data collection
Supporting low-resource languages with semantics
Tools for all languages – start with morphology
Low-resource language is not a homogeneous concept
Building resources for specific languages (Serbian)
Linking corpus resources between languages
Standardized tools applicable to different languages
Evaluation of tools – coordinate with other WGs
Tracking evaluation status for different types of tools
Improved benchmarking and experimental design
Organize shared tasks

11.30–12.00 Initial discussion on documentation of tools

Discussion questions:

Which types of tools do we want to include?
Where do we want to keep the documentation?
How do we create this documentation/inventory?

Points raised:

A huge multidimensional matrix
A shared repository
Tools shared between typologically similar languages
Consider end users
Too many languages have nothing – document what is missing rather than what exists
Connect to CLARIN
Flagship project on MWE
Include all tools or be selective?
What about commercial tools?
What about tools without documentation?

WG tasks emerging from the discussion:

Define multidimensional taxonomy of tools for documentation
Define infrastructure and procedure for creating documentation

Session 2

13.30–13.35 Recap of Session 1 (for new participants)

13.35–14.20 Initial discussion on evaluation campaigns

Background on goals and previous shared tasks

Brainstorming – define a novel shared task/evaluation campaign:

How is the task defined?
What are the evaluation metrics?
What kind of data is needed?
Which languages should be included?

Ideas:

Task = provide resources for shared tasks (eval metrics, test sets)
Instead of a shared task, build a dynamic leaderboard for LMs
Compare “traditional methods” to LMs on UD and MWE data
UD parsing with only surprise test languages, minimize training data
NLP tasks on top of UD data using linguistically defined embeddings
Distinguish similar languages or dialects (for example, using MWEs)
Objective: make every language appear at the center of the world
Collect idiom data using LLMs, evaluate on gold data

14.20–14.30 Next steps

Next WG3 meeting in Istanbul, September 8, 2023
We will focus on documentation of tools
Two tasks in preparation for the meeting:

A taxonomy of multi- and cross-lingual language technology
An infrastructure for multi- and cross-lingual language technology

Volunteers for these tasks are encouraged to contact WG leaders by email

14.30–14.45 Presentation of the European Language Equality project (slides)

Translations of this page:

en

wg3/wg3_meeting_2023-03-17.1695198523.txt.gz · Last modified: 2023/09/20 10:28 by joakim.nivre