Frequently Asked Questions (FAQ)

Are there other observatories ?

Yes there are, we are not the only ones providing analysis of AI models. In particular one can cite :

Open LLM Leaderboard comparing open-source Large Langage Models, developed by and running on HuggingFace platform
Trust LLM benchmark comparing both open-source and proprietary Large Langage Models, developped by a consortium of North American research and private institutions
Toloka's LLM leaderboard comparing the 5 most popular LLM models based on human generated and curated prompts, developed by Toloka an EU company specialized on AI
The LLM leaderboard from streamlit an open-source community-based LLM leaderboard aggregating the scores of other leaderboards
Stanford's LLM leaderboard comparing 30 LLM on 10 tests, developed by the Center for Reseach on Foundational Models (CRFM) of Stanford University

Beyond large language models, generic benchmarks of AI models do exists like:

ML Perf focusing on the assessment of the performance of AI models and developed by MLCommons, an Artificial Intelligence engineering consortium based in USA.

What are the current tests your provide ?

Until now we have been focusing on one main testing suites for Large Language Models :

LangBiTe: a tool for testing biases in large language models developped by the SOM Research Lab, a research team at the Open University of Catalonia (UOC)

We plan to implement soon :

langtest a tool to generate and execute test cases for a variety of LLM and NLP models developed by the American company John Snow Labs
part of the tests from Beyond Imitation Game tests suite initially developed by Google research team, and now open to contributions from other researchers

We are also open to suggestions from actors in the Luxembourg ecosystems.

Why do you focus on those measures and those models ?

While LIST has a background on general AI models development (for prediction, classification, images processing, etc.), the very fast development and adoption of Large Language Models since 2022 has led us to focus on these specific models to make the public aware of the performance, limitations and evolution of those specific models over the time.

Developers or integrators of AI models do not necessarilly have accurate expectations of how the technology will be used, especially when deployed outside of its original intent. Biases negatively impact both individuals and society by amplifying and reinforcing discrimination at a speed and scale far beyond the traditional discriminatory practices, the reason why we focus on bias measurement as a starting point.

Nevertheless, we plan to implement benchmarks on other types of models as part of our roadmap.

Can I suggest more tests/measures/models ?

Yes we are open to incorporate new tests, new measures and new models as part of our development roadmap. The suggestions and contributions will be assessed case by case.

Either fill this form, or drop us an email to ai-sandbox@list.lu

Is there a formal definition of the measures you test ?

Unfortunately, at the moment there is no formal definition that is commonly shared or agreed. Standardisation bodies are active in this field to define the concepts, the metrics and the measurement processes, in particular for AI models testing and quality characteristics of AI models.

In Luxembourg, ILNAS has published a white paper on AI and technical standardisation in 2021.

At EU level, as part of the development of the AI act, on December 5th 2022, the European Commission has issued a request to the European Standardisation agencies to develop drafts of such standards by 2025.

At international level, works of the AI standardisation working group (ISO/IEC Joint Technical committe 1 /sub committee 42) can be followed.