Introducing Mezura: The Premier Benchmark for Evaluating Turkish and Multilingual Large Language Models
We are thrilled to announce the launch of Mezura, our comprehensive and open benchmarking platform designed to rigorously evaluate large language models (LLMs) on Turkish and multilingual tasks. Mezura offers systematic, multi-dimensional, and transparent evaluation tools built to serve academic researchers, industry practitioners, and the open-source community.

Introducing Mezura: The Premier Benchmark for Evaluating Turkish and Multilingual Large Language Models
We are thrilled to announce the launch of Mezura, our comprehensive and open benchmarking platform designed to rigorously evaluate large language models (LLMs) on Turkish and multilingual tasks. Mezura offers systematic, multi-dimensional, and transparent evaluation tools built to serve academic researchers, industry practitioners, and the open-source community.
What is Mezura?
Mezura is a leaderboard and evaluation framework that benchmarks language models across a diverse set of realistic and challenging tasks tailored to Turkish language and legal domains, as well as multilingual capabilities. It provides robust insights into model strengths and weaknesses, enabling fair and scalable comparisons across the latest LLMs.
Evaluation Categories
Our benchmarking suite currently features six core evaluation categories, each designed to target specific capabilities of modern LLMs:
-
⚔️ Auto Arena: A cutting-edge automatic tournament-style evaluation framework that ranks models head-to-head using an Elo rating system. It benchmarks model performance on 11 legal tasks drawn from an extensive dataset of Turkish legal question-answer pairs, incorporating automated judging with dynamic prompt adaptation.
-
👥 Human Arena: A community-driven evaluation platform where models are compared via blind human preference voting. This captures nuanced qualitative judgments on response quality, clarity, helpfulness, creativity, and safety from expert reviewers.
-
📚 Retrieval: A benchmark focused on Retrieval-Augmented Generation (RAG) capabilities, evaluating how accurately models retrieve and incorporate information from Turkish legal knowledge bases, alongside metrics for factual reliability and response coherence.
-
⚡ Light Eval: A fast and modular framework for quick benchmarking across academic, logical, scientific, and mathematical reasoning tasks, including professional law-related questions.
-
🔄 EvalMix: A comprehensive multi-dimensional evaluation pipeline combining LLM-based judges, traditional lexical metrics (BLEU, ROUGE), and semantic similarity measures. EvalMix supports both Turkish and multilingual evaluation scenarios.
-
🐍 Snake Bench: A unique evaluation where models play the classic Snake game using step-by-step reasoning, testing spatial awareness, problem-solving, and logical thinking.
-
🧩 Structured Outputs: Coming soon!
Why Mezura on Hugging Face?
Hosting Mezura on Hugging Face Spaces allows us to offer:
-
Transparent, unbiased, and reproducible benchmarking accessible to researchers and developers worldwide.
-
Robust, scalable infrastructure enabling deep, continuous comparative insights.
-
Specialized focus on Turkish language and culture alongside multilingual benchmarks.
-
An open ecosystem fostering collaboration among academia, startups, and industry leaders.
Join the Mezura Leaderboard
We invite researchers, developers, and model creators to submit their models to Mezura. Showcase your AI’s capabilities on real-world tasks, benefit from research-grade evaluations, and stay updated with ongoing task additions and performance analyses.
Join Us on This Journey
Together, we’re building a transparent, rigorous, and dynamic ecosystem for large language model benchmarking — empowering Türkiye's AI community and beyond. Follow our progress and be part of the journey redefining LLM evaluation for Turkish and multilingual AI.
LinkedIn: https://www.linkedin.com/company/newmind-ai
Mezura on Hugging Face Spaces: https://huggingface.co/spaces/newmindai/Mezura