Term

GSM-8K

Overview

最終更新: 2026年7月9日

GSM-8K（Grade School Math 8K）は、小学校レベルの算数の文章題約8,500問で構成されるベンチマークデータセットである。多段階の推論を必要とする問題が含まれており、AIモデルが論理的なステップを経て正解にたどり着けるかを測定する。近年、多くの最先端AIモデルがこのテストで90%以上のスコアを記録するようになり、ベンチマークとしての飽和が指摘されている。

Mentioned Articles

1 件

テクノロジー
最先端AIの実際の数学能力はそこまで高くない？新たなFrontierMathベンチマークでは2%未満の解答率となり、AGIへの課題が鮮明に
人工知能（AI)の進化が加速度的な発展を遂げ、画像生成や自然言語処理で人間の能力に迫る成果を上げる中、その限界を鮮明に示す新たな指標が登場した。AI研究機関Epoch AIが開発した高度な数学ベンチマークテスト「Fron […]
2024年11月12日約 6 分

External Mentions

6 件

Hacker NewsShow HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training
▲ 265xlayn2026年3月18日
arXivEfficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
▲ 0Rudransh Agnihotri2025年6月6日
arXivDeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning
▲ 0Atharva Pandey2025年4月9日
arXivGSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
▲ 0Yang Zhou2025年2月7日
arXivState Stream Transformer (SST) : Emergent Metacognitive Behaviours Through Latent State Persistence
▲ 0Thea Aviss2025年1月30日
arXivDiversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks
▲ 0Mahmood Hegazy2024年10月10日

GSM-8K

Overview

Mentioned Articles

最先端AIの実際の数学能力はそこまで高くない？新たなFrontierMathベンチマークでは2%未満の解答率となり、AGIへの課題が鮮明に

External Mentions