LLM Cooperativeness Benchmark
As multilingual AI systems become increasingly integrated into decision-making contexts worldwide, understanding how linguistic and cultural diversity influences AI behavior is becoming crucial. To explore this, we simulated the decision-making behavior of two AI agents in a classical Game Theory scenario—the Prisoner's Dilemma—using FAIRGAME, a framework we developed to systematically evaluate cooperation and betrayal tendencies across different languages.
In the Prisoner's Dilemma, each agent must choose between two options: cooperate with the other agent for a mutually beneficial outcome, or betray the other in pursuit of a higher individual reward, at the risk of mutual loss if both choose betrayal. This dilemma encapsulates the fundamental tension between individual rationality and collective benefit, making it a cornerstone model for studying social behavior and strategic decision-making.

About the project
The simulations were conducted in five widely spoken languages from diverse linguistic families: English, French, Arabic, Chinese, and Vietnamese. Several large language models (LLMs) were used as the AI agents, allowing us to compare their decision-making behaviors across languages. For each model and each language, we ran several thousand simulations, using carefully designed variations of the core Prisoner's Dilemma scenario. These variations preserved the essential structure of the dilemma while altering surface details, ensuring that results reflected generalizable trends rather than scenario-specific biases.
The findings revealed that both the language and the underlying LLM significantly influenced the agents' decision-making. Specifically, the linguistic framing shaped the agents' likelihood of choosing cooperation over betrayal, while differences across LLMs highlighted variability in how models handle strategic social reasoning. A higher benchmark score in our results corresponds to a greater propensity for an AI agent to cooperate when prompted in that language.
These results open important avenues for research into how language, culture, and model-specific characteristics might shape AI behavior, with significant implications for the design and deployment of multilingual AI systems across the globe.
The Cooperativeness Leaderboard was developed by the AI Readiness and Assessment (AIRA) research group. Relevant scientific literature on this project:
Fairgame: a framework for ai agents bias recognition using game theory, Buscemi, Alessio, et al., arXiv preprint arXiv:2504.14325 (2025).
For further information, feel free to contact us:
Loading plot...