MLA-Reject
Logo

MLA-Reject (MultiLingual Augmented Reject) is a research tool designed to evaluate the security robustness of Large Language Models (LLMs) against adversarial prompting techniques. The system enables researchers to develop and assess prompt-based attacks targeting sensitive content generation, while measuring how model defenses vary across different languages through automated translation capabilities.

The tool incorporates paraphrasing functionality to generate multiple semantic variations of test prompts, ensuring that evaluations focus on meaning rather than specific wording or syntax. This approach allows for comprehensive testing of both the original prompts and any adversarial techniques that modify or extend them.

By providing systematic analysis of model vulnerabilities across linguistic and syntactic variations, MLA-Reject offers valuable insights into LLM security weaknesses. These findings support researchers and organizations in developing more robust AI safety measures and making evidence-based decisions regarding secure model deployment in multilingual environments.

Illustration

About the project​

MLABiTe is a multilingual and enhanced testing extension of StrongReject by Berkeley University, developed by the AI Readiness and Assessment research group (AIRA). Relevant scientific literature on this project:

Souly, Alexandra, et al. "A strongreject for empty jailbreaks." arXiv preprint arXiv:2402.10260 (2024).

For further information, feel free to contact us:

Create new jailbreaking techniques

Create new jailbreaking techniques