← All articles
LLM Techniques & Metrics · 27 Feb, 2025

Evaluation of Large Language Models on Turkish Reasoning Datasets

Recent advancements in large language models (LLMs) have enhanced their ability to tackle complex reasoning tasks across varied datasets.

Evaluation of Large Language Models on Turkish Reasoning Datasets

Evaluation of Large Language Models on Turkish Reasoning Datasets

  • Recent breakthroughs in large language models (LLMs) have improved their reasoning capabilities, but evaluating performance across non-English datasets remains a challenge.

  • This study benchmarks two advanced LLMs—Qwen/QwQ-32B-Preview and DeepSeek-R1-Distill-Qwen-32B—on three machine-translated Turkish reasoning datasets: MMLU-TR, GPQA-TR, and ARC-TR.

  • Using GPT-4o-mini as an unbiased judge model, the evaluation measures model accuracy, token usage, and latency to assess trade-offs between speed and reasoning precision.

  • The results reveal important differences in model strengths across domains such as law, science, and general knowledge—insights that can inform future LLM development and deployment in Turkish NLP tasks.

Figure 1: Overview of the Reasoning Large Language Model

Structured Evaluation Pipeline for Machine-Translated Turkish Reasoning Datasets

Pipeline Overview

To effectively assess LLM performance on Turkish reasoning tasks, we designed a three-part evaluation pipeline that ensures consistent processing, accurate scoring, and structured output across all datasets.

Inference Models

For this study, we selected two high-performing LLMs with strong reasoning capabilities, accessed through modern inference APIs, to generate answers across all datasets.

  • DeepSeek-R1-Distill-Qwen-32B (accessed via Groq API)

  • Qwen/QwQ-32B-Preview (accessed via Nebius AI API)

Scoring Mechanism

To maintain objectivity and consistency, GPT-4o-mini was used as an automated judge model, evaluating the accuracy of responses based on ground truth comparisons.

Key Features:

  • Automated Response Parsing: Extracts the selected answer from text-based model outputs.

  • Comparison with Ground Truth: Checks whether the selected answer matches the expected label.

  • JSON-Formatted Evaluation Output: Ensures structured and interpretable evaluation results.

Dataset Processing & Evaluation Execution

The pipeline utilizes machine-translated Turkish reasoning datasets for model evaluation. The dataset processing component includes:

  • Dataset Loading: Retrieves reasoning datasets (MMLU-TR, GPQA-TR, ARC-TR) via the dataset library.

  • Standardized Formatting: Converts each dataset sample into a uniform multiple-choice question format.

  • Evaluation Execution: Calls the inference models and scoring mechanism to generate structured evaluation reports.

Dataset Details

MMLU-TR (Massive Multitask Language Understanding - Turkish Version)

  • Source: MMLU-TR Dataset

  • Description: The MMLU-TR dataset is a Turkish adaptation of the MMLU benchmark, covering diverse topics. Selected dataset part specifically focuses on professional law questions, evaluating LLMs’ ability to process and reason about legal concepts, principles, and case-based scenarios.

  • Task Type: Multiple-choice reasoning

Table 1: MMLU-TR-v0.2 Dataset Details

GPQA-TR (General-Purpose Question Answering - Turkish Version)

  • Source: GPQA-Formatted-TR Dataset

  • Description: The GPQA-TR dataset is a Turkish adaptation of the DIAMOND dataset from Jegger/GPQA. This dataset includes various scientific and general knowledge-based multiple-choice questions.

  • Task Type: Open-domain reasoning questions

Table 2: GPQA-formatted-TR Dataset Details

ARC-TR (AI2 Reasoning Challenge - Turkish Version)

  • Source: ARC-TR-v0.2 Dataset

  • Description: The ARC-TR dataset is a Turkish version of the ARC, a dataset designed to test complex reasoning skills in AI models. It focuses on scientific question answering using structured multiple-choice formats.

  • Task Type: Complex reasoning in science-related questions

Table 3: ARC-TR-v0.2 Dataset Details

Model & Inference Platform Summary

In this evaluation pipeline, large language models (LLMs) serve as generators, producing responses for Turkish reasoning datasets. Below is a summary of the inference platforms used:

  • Nebius AI Studio (Qwen-QwQ-32B-Preview)

  • Groq Cloud (DeepSeek-R1-Distill-Qwen-32B)

  • Qwen/QwQ-32B-Preview Model Parameters

  • DeepSeek-R1-Distill-Qwen-32B Model Parameters

Evaluation Insights

Key Findings:

DeepSeek-R1-Distill-Qwen-32B outperformed Qwen/QwQ-32B-Preview in GPQA-formatted-TR but was weaker in MMLU and ARC.

  • GPQA-formatted-TR: DeepSeek-R1-Distill-Qwen-32B (46.6%) outperformed Qwen/QwQ-32B-Preview (40.6%).

  • MMLU-TR-v0.2: Qwen/QwQ-32B-Preview achieved a higher score (45%) compared to DeepSeek-R1-Distill-Qwen-32B (44.1%).

  • ARC-TR-v0.2: Qwen/QwQ-32B-Preview led significantly with 87.2%, while DeepSeek-R1-Distill-Qwen-32B scored 81.1%.

Latency Comparison:

  • DeepSeek-R1-Distill-Qwen-32B exhibited significantly lower latency across all datasets.

  • On GPQA-formatted-TR, DeepSeek completed evaluations in 1h 58m, whereas Qwen took 3h 24m.

  • In MMLU-TR-v0.2 and ARC-TR-v0.2, DeepSeek was nearly twice as fast.

Token Consumption:

  • DeepSeek processed ARC-TR-v0.2 with fewer tokens (650K vs. 1.07M), showing better token efficiency.

  • However, in MMLU-TR-v0.2, Qwen used more tokens (1.75M) but achieved a marginally better score.

Prompt Formatting and Answer Selection

To ensure model responses aligned with the prompt's multiple-choice format (four options with indexed answers), structured formatting was used. Indices were placed before each answer choice, and the correct index was highlighted within oxed{} notation. The model was instructed to select an answer based on these indices, improving consistency and preventing format deviations.

  • MMLU-TR-v0.2 Data

  • GPQA-formatted-TR Data

  • ARC-TR-v0.2 Data

Our Mind

In our evaluation, DeepSeek-R1-Distill-Qwen-32B stood out for its speed and token efficiency, particularly in the GPQA-formatted-TR dataset. Its high throughput and reduced latency make it an attractive option for scenarios that demand fast, large-scale inference. On the other hand, Qwen/QwQ-32B-Preview demonstrated stronger reasoning performance on the MMLU and ARC datasets, indicating its suitability for more structured, domain-specific tasks like legal or scientific reasoning.

These results highlight a fundamental trade-off between speed and precision. While DeepSeek offers clear advantages in efficiency, Qwen consistently produced more accurate outputs in complex reasoning scenarios. This suggests that model selection should be guided by the specific requirements of the task—whether the priority is inference speed or deeper accuracy in knowledge-heavy domains.

Additionally, we found that prompt formatting played a critical role in stabilizing model behavior. By structuring the prompts with clearly indexed answer options and highlighting the correct choice using oxed{}, we reduced format drift and improved response consistency across both models. This small but important design choice proved especially effective in maintaining evaluation standards when working with machine-translated Turkish datasets.

Looking ahead, both models show promise, but further fine-tuning and domain-specific adaptation could enhance their performance even more. As the demand for high-performing multilingual LLMs continues to grow, refining model alignment for languages like Turkish will be key to unlocking broader, more inclusive AI capabilities.

Key Takeaways

  • Qwen/QwQ-32B-Preview demonstrated stronger reasoning performance on legal and scientific datasets (MMLU-TR and ARC-TR), making it a better fit for domain-specific tasks requiring higher accuracy.
  • DeepSeek-R1-Distill-Qwen-32B excelled in efficiency, delivering faster inference speeds and lower token consumption—especially notable on the GPQA-formatted-TR dataset.
  • There is a clear trade-off between speed and precision. While DeepSeek offers significant latency and throughput advantages, Qwen performs better in complex reasoning scenarios.
  • Prompt formatting using structured indices and oxed{} notation improved answer consistency, helping models stay aligned with multiple-choice formats during evaluation.
  • Machine translation introduces challenges in Turkish NLP, particularly around specialized terminology. These datasets require further refinement or domain adaptation for optimal performance.
  • Using GPT-4o-mini as a judge model ensured standardized, unbiased evaluation, enabling a fair comparison between LLM outputs across all tasks.
  • Overall, both models show promise for Turkish-language reasoning, but task type, latency needs, and token costs should guide model selection based on use case.

References

LLM Techniques & Metrics