EXAONE-Deep Performance Evaluation

Performance Overview

EXAONE-Deep models demonstrate exceptional performance across multiple benchmarks, particularly excelling in tasks that require complex logical thinking such as mathematical reasoning, scientific understanding, and code writing. Below is a detailed assessment of the model's performance in various benchmarks, showing that:

Mathematical Capability

In the MATH-500 test, EXAONE-Deep 32B model achieved a score of 95.7, the 7.8B model reached 94.8, and the 2.4B model scored 92.3, all leading in their respective model size categories.

Notably, in the CSAT 2025 Math test, the 32B model scored 94.5, outperforming all similar models.

Scientific Reasoning

In the GPQA Diamond test, which evaluates PhD-level problem-solving capabilities in physics, chemistry, and biology, EXAONE-Deep models achieved excellent scores of 66.1 (32B), 62.6 (7.8B), and 54.3 (2.4B).

The 7.8B and 2.4B models ranked first in their respective size categories, demonstrating exceptional scientific reasoning capabilities.

Code Generation

In the LiveCodeBench test, EXAONE-Deep models achieved scores of 59.5 (32B), 55.2 (7.8B), and 46.6 (2.4B).

Compared to other models of similar size, the EXAONE-Deep series demonstrated leading advantages in code writing and understanding, with the 7.8B model outperforming the OpenAI o1-mini model.

Mathematical Reasoning Performance Comparison

Model	MATH-500 (pass@1)	AIME 2024 (pass@1 / cons@64)	AIME 2025 (pass@1 / cons@64)	CSAT Math 2025 (pass@1)
EXAONE Deep 32B	95.7	72.1 / 90.0	65.8 / 80.0	94.5
DeepSeek-R1-Distill-Qwen-32B	94.3	72.6 / 83.3	55.2 / 73.3	84.1
QwQ-32B	95.5	79.5 / 86.7	67.1 / 76.7	94.4
DeepSeek-R1-Distill-Llama-70B	94.5	70.0 / 86.7	53.9 / 66.7	88.8
DeepSeek-R1 (671B)	97.3	79.8 / 86.7	66.8 / 80.0	89.9
EXAONE Deep 7.8B	94.8	70.0 / 83.3	59.6 / 76.7	89.9
DeepSeek-R1-Distill-Qwen-7B	92.8	55.5 / 83.3	38.5 / 56.7	79.7
DeepSeek-R1-Distill-Llama-8B	89.1	50.4 / 80.0	33.6 / 53.3	74.1
OpenAI o1-mini	90	63.6 / 80.0	54.8 / 66.7	84.4
EXAONE Deep 2.4B	92.3	52.5 / 76.7	47.9 / 73.3	79.2
DeepSeek-R1-Distill-Qwen-1.5B	83.9	28.9 / 52.7	23.9 / 36.7	65.6

In terms of mathematical reasoning, the EXAONE-Deep series models demonstrate exceptional performance:

• EXAONE-Deep 32B achieved the highest score of 94.5 in the CSAT 2025 Math test, and leads all models with a score of 90.0 in AIME 2024 cons@64.

• EXAONE-Deep 7.8B scored 94.8 in the MATH-500 test, far exceeding other models of similar size and even outperforming OpenAI o1-mini.

• EXAONE-Deep 2.4B significantly outperforms the DeepSeek-R1-Distill-Qwen-1.5B model in all tests, demonstrating its superiority among lightweight models.

Benchmark Descriptions

MATH-500

A collection of 500 high-quality math problems covering algebra, geometry, combinatorics, probability theory, and other fields, used to evaluate a model's mathematical reasoning ability.

AIME

American Invitational Mathematics Examination, serving as a qualifying standard for the US Mathematical Olympiad, with a high level of difficulty. AIME 2024/2025 represents recent contest evaluations, and cons@64 indicates accuracy with consideration for diverse sampling.

CSAT Math

The mathematics section of the Korean College Scholastic Ability Test, used to assess the model's ability to solve high school-level math problems.

GPQA Diamond

A test that evaluates PhD-level problem-solving capabilities, covering difficult problems in physics, chemistry, and biology.

LiveCodeBench

A real-time code writing evaluation benchmark that tests the model's performance in solving programming problems, writing functional code, and debugging.

MMLU

Massive Multitask Language Understanding, a comprehensive assessment of the model's knowledge mastery across various academic disciplines.