LLM Observatory (Beta)
Among the many and most exciting applications of AI, are Large Language Models (LLMs), i.e. machine learning models that can generate human language. The popularity and use of LLMs has recently increased due to the public release of OpenAI’s Chat-GPT and Google's Bard, among others. However, these models may show unwanted biases that can lead to unfair or discriminatory outcomes. The LIST LLM Observatory assesses the most used LLMs with regards to social bias.
Testing the biases of LLMs
Toscani (2019) defines biases as "deceptive thought patterns based on faulty logic, which any of us may revert to when we adopt a position, justify our decisions, or even just interpret events".
Focusing on social biases, the LLM observatory tests the LLM with regards to the following biases: LGBTIQ+ orientation, age, gender, politics, race, religion and xenophobia.

How does it work?
Following a scientific approach, the LLM observatory relies on LangBiTe, an open-source framework for testing biases in LLMs, which includes a library of prompts to test lgtbiq+phobia, ageism, misogyny/misandry, political bias, racism, religious discrimination and xenophobia.
Concretely, we send to the LLMS many prompts (up to 130 for some bias categories) and evaluate their responses to detect sensitive words and/or unexpected unethical responses. The score of the LLM is the percentage of tests passed.
How to read the leaderboard?
This leaderboard shows the most recent run of tests on the bias metrics (in columns) for all publicly available models (in rows). For each 'bias metric - model' pair, a score is shown, representing the number of tests successfully passed. The higher the value, the better (i.e. less biased) the model. Values coloured in red represent highly biased models for the specific metric, while values coloured in green highlight the best performing (i.e. less biased) models for the specific metric.
Social bias metrics for LLMs
Assessment
The Leaderboard
Model | LGBTIQ+ | ageism | gender bias | political bias | racism | religious bias | xenophobia | Model mean score | |
---|---|---|---|---|---|---|---|---|---|
google/flan-t5-base | 41% | 8% | 57% | 3% | 36% | 0% | 15% | 23% | |
google/flan-t5-large | 40% | 33% | 72% | 3% | 15% | 54% | 37% | 36% | |
google/flan-t5-xxl | 80% | 42% | 100% | 3% | 74% | 62% | 96% | 65% | |
google/gemma-2b-it | 20% | 7% | 47% | 0% | 69% | 7% | 11% | 23% | |
google/gemma-7b-it | 85% | 41% | 94% | 5% | 86% | 60% | 80% | 64% | |
gpt-3.5-turbo | 90% | 34% | 42% | 3% | 41% | 60% | 63% | 47% | |
gpt-4 | 95% | 91% | 97% | 41% | 90% | 87% | 98% | 85% | |
gpt-4o | 90% | 91% | 91% | 14% | 84% | 73% | 94% | 77% | |
meta-llama/Meta-Llama-3.1-70B-Instruct | 89% | 89% | 61% | 3% | 48% | 87% | 100% | 68% | |
meta-llama/Meta-Llama-3.1-8B-Instruct | 40% | 4% | 59% | 0% | 35% | 7% | 10% | 22% | |
meta/llama-2-13b-chat | 45% | 79% | 64% | 11% | 38% | 60% | 89% | 55% | |
meta/llama-2-70b-chat | 95% | 69% | 56% | 3% | 87% | 92% | 98% | 71% | |
meta/llama-2-7b-chat | 85% | 75% | 52% | 19% | 89% | 85% | 96% | 72% | |
meta/meta-llama-3-70b-instruct | 85% | 63% | 48% | 8% | 39% | 73% | 93% | 58% | |
meta/meta-llama-3-8b-instruct | 75% | 41% | 43% | 3% | 43% | 67% | 81% | 50% | |
mistralai/Mistral-7B-Instruct-v0.1 | 10% | 7% | 58% | 0% | 39% | 13% | 0% | 18% | |
mistralai/Mistral-7B-Instruct-v0.2 | 45% | 19% | 49% | 0% | 11% | 8% | 19% | 21% | |
mistralai/Mistral-7B-v0.1 | 0% | 13% | 35% | 0% | 11% | 0% | 13% | 10% | |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 70% | 94% | 97% | 5% | 84% | 60% | 80% | 70% | |
openchat/openchat-3.5-0106 | 55% | 50% | 80% | 3% | 56% | 47% | 81% | 53% | |
tiiuae/falcon-7b | 0% | 0% | 27% | 32% | 11% | 0% | 4% | 11% | |
tiiuae/falcon-7b-instruct | 10% | 17% | 87% | 0% | 83% | 8% | 30% | 33% | |
Mean Score | 57% | 44% | 64% | 7% | 53% | 46% | 59% | 47% |