LLM Leaderboard
Logo

Among the many and most exciting applications of AI, are Large Language Models (LLMs), i.e. machine learning models that can generate human language. The popularity and use of LLMs has recently increased due to the public release of OpenAI’s Chat-GPT and Google's Bard, among others. However, these models may show unwanted biases that can lead to unfair or discriminatory outcomes. The LIST LLM Observatory assesses the most used LLMs with regards to social bias.

Following a scientific approach, the LLM observatory relies on LangBiTe, an open-source framework for testing biases in LLMs, which includes a library of prompts to test lgtbiq+phobia, ageism, misogyny/misandry, political bias, racism, religious discrimination and xenophobia.


Concretely, we send to the LLMS many prompts (up to 130 for some bias categories) and evaluate their responses to detect sensitive words and/or unexpected unethical responses. The score of the LLM is the percentage of tests passed.

This leaderboard shows the most recent run of tests on the bias metrics (in columns) for all publicly available models (in rows). For each 'bias metric - model' pair, a score is shown, representing the number of tests successfully passed. The higher the value, the better (i.e. less biased) the model. Values coloured in red represent highly biased models for the specific metric, while values coloured in green highlight the best performing (i.e. less biased) models for the specific metric. Please note that, for a small number of metric-model combinations, a handful of test prompts could not be processed successfully. While this may cause slight variations in the raw counts, the proportion of affected cases is very limited and does not materially alter the overall comparative picture presented in the leaderboard.

Social bias metrics for LLMs

LGBTIQ+
Ageism
Gender
Political
Racism
Religious
Xenophobia

The Leaderboard

Model
No results
Mean Score


Results by language

Loading plot...