LLM Leaderboard
Logo

LLM Observatory (Beta)

Among the many and most exciting applications of AI, are Large Language Models (LLMs), i.e. machine learning models that can generate human language. The popularity and use of LLMs has recently increased due to the public release of OpenAI’s Chat-GPT and Google's Bard, among others. However, these models may show unwanted biases that can lead to unfair or discriminatory outcomes. The LIST LLM Observatory assesses the most used LLMs with regards to social bias.

Testing the biases of LLMs

Toscani (2019) defines biases as "deceptive thought patterns based on faulty logic, which any of us may revert to when we adopt a position, justify our decisions, or even just interpret events".

Focusing on social biases, the LLM observatory tests the LLM with regards to the following biases: LGBTIQ+ orientation, age, gender, politics, race, religion and xenophobia.

Illustration

How does it work?

Following a scientific approach, the LLM observatory relies on LangBiTe, an open-source framework for testing biases in LLMs, which includes a library of prompts to test lgtbiq+phobia, ageism, misogyny/misandry, political bias, racism, religious discrimination and xenophobia.


Concretely, we send to the LLMS many prompts (up to 130 for some bias categories) and evaluate their responses to detect sensitive words and/or unexpected unethical responses. The score of the LLM is the percentage of tests passed.

How to read the leaderboard?

This leaderboard shows the most recent run of tests on the bias metrics (in columns) for all publicly available models (in rows). For each 'bias metric - model' pair, a score is shown, representing the number of tests successfully passed. The higher the value, the better (i.e. less biased) the model. Values coloured in red represent highly biased models for the specific metric, while values coloured in green highlight the best performing (i.e. less biased) models for the specific metric.

Social bias metrics for LLMs

LGBTIQ+
Ageism
Gender
Political
Racism
Religious
Xenophobia

The Leaderboard

Select one or more models in the table to download a detailed report.
Model
LGBTIQ+
ageism
gender bias
political bias
racism
religious bias
xenophobia
Model mean score
google/flan-t5-base

41%

8%

57%

3%

36%

0%

15%

23%

google/flan-t5-large

40%

33%

72%

3%

15%

54%

37%

36%

google/flan-t5-xxl

80%

42%

100%

3%

74%

62%

96%

65%

google/gemma-2b-it

20%

7%

47%

0%

69%

7%

11%

23%

google/gemma-7b-it

85%

41%

94%

5%

86%

60%

80%

64%

gpt-3.5-turbo

90%

34%

42%

3%

41%

60%

63%

47%

gpt-4

95%

91%

97%

41%

90%

87%

98%

85%

gpt-4o

90%

91%

91%

14%

84%

73%

94%

77%

meta-llama/Meta-Llama-3.1-70B-Instruct

89%

89%

61%

3%

48%

87%

100%

68%

meta-llama/Meta-Llama-3.1-8B-Instruct

40%

4%

59%

0%

35%

7%

10%

22%

meta/llama-2-13b-chat

45%

79%

64%

11%

38%

60%

89%

55%

meta/llama-2-70b-chat

95%

69%

56%

3%

87%

92%

98%

71%

meta/llama-2-7b-chat

85%

75%

52%

19%

89%

85%

96%

72%

meta/meta-llama-3-70b-instruct

85%

63%

48%

8%

39%

73%

93%

58%

meta/meta-llama-3-8b-instruct

75%

41%

43%

3%

43%

67%

81%

50%

mistralai/Mistral-7B-Instruct-v0.1

10%

7%

58%

0%

39%

13%

0%

18%

mistralai/Mistral-7B-Instruct-v0.2

45%

19%

49%

0%

11%

8%

19%

21%

mistralai/Mistral-7B-v0.1

0%

13%

35%

0%

11%

0%

13%

10%

mistralai/Mixtral-8x7B-Instruct-v0.1

70%

94%

97%

5%

84%

60%

80%

70%

openchat/openchat-3.5-0106

55%

50%

80%

3%

56%

47%

81%

53%

tiiuae/falcon-7b

0%

0%

27%

32%

11%

0%

4%

11%

tiiuae/falcon-7b-instruct

10%

17%

87%

0%

83%

8%

30%

33%

Mean Score
57%
44%
64%
7%
53%
46%
59%
47%