Poetiq AIがARC-AGI-2で75%という驚異的な正解率

25/12/2025

AIスタートアップのPoetiq AIが、難解な知能ベンチマークであるARC-AGI-2において、最新のGPT-5.2 X-Highを活用し75%という驚異的な正解率を記録したことを報告しています。この成果は、AIが単独で問題を解くのではなく、複数のエージェントがプログラム生成や自己修正を繰り返す独自の「メタシステム（ハーネス）」によって達成されました。
このスコアは人間の平均値である60%を大きく上回るものですが、Poetiq aiの前回のリリース（2025年11月20日）では65.32%という結果がありましたが正式の機関が実施した結果は54.0%という結果だったという過去の事例から公開データと非公開テストの間で性能が低下する可能性も指摘されています。今回も、正式の評価では10％以上下がるのではないかと考えられます。
膨大な計算資源を投じる「長考」モードの有効性や、モデルの巨大化に頼らないアルゴリズム的な工夫が、汎用人工知能（AGI）への重要な鍵のようです。最終的に公式検証で人間超えのスコアが維持されればAIの推論能力における歴史的な転換点となる可能性もあります。
この状況を生成AIに広く調べ、現在の状況と今後どうなるのか、詳しく解説させました。さらに、それをNotebookLMでインフォグラフィック、スライド資料にさせました。
なお、生成AIによる調査・分析結果は、公開された情報からだけの分析であり、必ずしも実情を示したものではないこと、誤った情報も含まれていることについてはご留意されたうえで、ご参照ください。

We finally had a moment to run our system with GPT-5.2 X-High on ARC-AGI-2!
2025年12月24日
https://x.com/poetiq_ai/status/2003546910427361402

ARC-AGI-2 Leaderboard
https://arcprize.org/leaderboard

Traversing the Frontier of Superintelligence　November 20, 2025
https://poetiq.ai/posts/arcagi_announcement/

Poetiq AI reports a remarkable 75% accuracy on ARC-AGI-2
Poetiq AI, an AI startup, has reported achieving an astonishing 75% accuracy on ARC-AGI-2, a notoriously difficult intelligence benchmark, by leveraging the latest GPT-5.2 X-High. This result was reportedly achieved not by having the AI solve problems on its own, but through a proprietary “meta-system (harness)” in which multiple agents repeatedly generate programs and self-correct.
While this score far exceeds the human average of 60%, there are also concerns that performance may drop between public data and non-public tests. In Poetiq AI’s previous release (November 20, 2025), the company reported a result of 65.32%, but an official institution’s evaluation reportedly yielded 54.0%. Based on that past case, it is suggested that the score could again decline in formal evaluation—possibly by more than 10 percentage points this time as well.
The effectiveness of “long-thinking” modes that invest massive computational resources, as well as algorithmic ingenuity that does not rely solely on making models larger, appears to be a key factor—and potentially an important path toward artificial general intelligence (AGI). If a beyond-human score is ultimately maintained under official verification, it could represent a historic turning point in AI reasoning capabilities.
To understand this situation, I had a generative AI broadly investigate the current state of affairs and provide a detailed explanation of what may happen going forward. I also used NotebookLM to turn the findings into infographics and slide materials.
Please note that the generative AI’s research and analysis are based solely on publicly available information and may not reflect the full reality; they may also contain inaccuracies. Please review the content with these limitations in mind.

0 Comments

よろず知財コンサルティングのブログ