「既にコードの80%がAI製」:Anthropicが直視する自律化の臨界点と止め方の設計
Anthropicのレポートによれば、同社のコードの8割以上がAIにより生成されており、開発の主導権が人間から離れつつある。同社はAIが自律的に自身を改良する「再帰的自己改善」の到来を警告し、業界全体で制御枠組みを構築すべきだと主張している。
別名: Claude Opus 4.6
Anthropicの最新フラッグシップAIモデル。100万トークンのコンテキストウィンドウを搭載し、コーディング、財務分析、研究開発などの専門領域における自律的なエージェント能力が大幅に強化されている。MRCR v2やTerminal-Benchなどのベンチマークで高いスコアを記録し、ビジネス現場での実質的な業務完遂能力に特化している。
The diagnostic performance of large language artificial intelligence (AI) models when utilizing radiological images has yet to be investigated. We employed Claude 3 Opus (released on March 4, 2024) and Claude 3.5 Sonnet (released on June 21, 2024) to investigate their diagnostic performances in response to the Radiology’s Diagnosis Please quiz questions. In this study, the AI models were tasked with listing the primary diagnosis and two differential diagnoses for 322 quiz questions from Radiology’s “Diagnosis Please” cases, which included cases 1 to 322, published from 1998 to 2023. The analyses were performed under the following conditions: (1) Condition 1: submitter-provided clinical history (text) alone. (2) Condition 2: submitter-provided clinical history and imaging findings (text). (3) Condition 3: clinical history (text) and key images (PNG file). We applied McNemar’s test to evaluate differences in the correct response rates for the overall accuracy under Conditions 1, 2, and 3 for each model and between the models. The correct diagnosis rates were 58/322 (18.0%) and 69/322 (21.4%), 201/322 (62.4%) and 209/322 (64.9%), and 80/322 (24.8%) and 97/322 (30.1%) for Conditions 1, 2, and 3 for Claude 3 Opus and Claude 3.5 Sonnet, respectively. The models provided the correct answer as a differential diagnosis in up to 26/322 (8.1%) for Opus and 23/322 (7.1%) for Sonnet. Statistically significant differences were observed in the correct response rates among all combinations of Conditions 1, 2, and 3 for each model (p < 0.01). Claude 3.5 Sonnet outperformed in all conditions, but a statistically significant difference was observed only in the comparison for Condition 3 (30.1% vs. 24.8%, p = 0.028). Two AI models demonstrated a significantly improved diagnostic performance when inputting both key images and clinical history. The models’ ability to identify important differential diagnoses under these conditions was also confirmed.
Large language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications in interpreting patient histories and documented imaging findings. As LLMs continue to improve, their diagnostic abilities are expected to be enhanced further. However, there is a lack of comprehensive comparisons between LLMs from different manufacturers. In this study, we aimed to test the diagnostic performance of the three latest major LLMs (GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, a monthly diagnostic quiz series for radiology experts. Clinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane’s Q and post hoc McNemar’s tests. The respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models. Claude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.
In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. We use this dataset to study and evaluate the problem-solving abilities of these LLMs in the context of control engineering. We present evaluations conducted by a panel of human experts, providing insights into the accuracy, reasoning, and explanatory prowess of LLMs in control engineering. Our analysis reveals the strengths and limitations of each LLM in the context of classical control, and our results imply that Claude 3 Opus has become the state-of-the-art LLM for solving undergraduate control problems. Our study serves as an initial step towards the broader goal of employing artificial general intelligence in control engineering.
Purpose Large language models (LLMs) are increasingly employed across various fields, including medicine and dentistry. In the field of dental anesthesiology, LLM is expected to enhance the efficiency of information gathering, patient outcomes, and education. This study evaluates the performance of different LLMs in answering questions from the Japanese Dental Society of Anesthesiology Board Certification Examination (JDSABCE) to determine their utility in dental anesthesiology. Methods The study assessed three LLMs, ChatGPT-4 (OpenAI, San Francisco, California, United States), Gemini 1.0 (Google, Mountain View, California, United States), and Claude 3 Opus (Anthropic, San Francisco, California, United States), using multiple-choice questions from the 2020 to 2022 JDSABCE exams. Each LLM answered these questions three times. The study excluded questions involving figures or deemed inappropriate. The primary outcome was the accuracy rate of each LLM, with secondary analysis focusing on six subgroups: (1) basic physiology necessary for general anesthesia, (2) local anesthesia, (3) sedation and general anesthesia, (4) diseases and patient management methods that pose challenges in systemic management, (5) pain management, and (6) shock and cardiopulmonary resuscitation. Statistical analysis was performed using one-way ANOVA with Dunnett's multiple comparisons, with a significance threshold of p<0.05. Results ChatGPT-4 achieved a correct answer rate of 51.2% (95% CI: 42.78-60.56, p=0.003) and Claude 3 Opus 47.4% (95% CI: 43.45-51.44, p<0.001), both significantly higher than Gemini 1.0, which had a rate of 30.3% (95% CI: 26.53-34.14). In subgroup analyses, ChatGPT-4 and Claude 3 Opus demonstrated superior performance in basic physiology, sedation and general anesthesia, and systemic management challenges compared to Gemini 1.0. Notably, ChatGPT-4 excelled in questions related to systemic management (62.5%) and Claude 3 Opus in pain management (61.53%). Conclusions ChatGPT-4 and Claude 3 Opus exhibit potential for use in dental anesthesiology, outperforming Gemini 1.0. However, their current accuracy rates are insufficient for reliable clinical use. These findings have significant implications for dental anesthesiology practice and education, including educational support, clinical decision support, and continuing education. To enhance LLM utility in dental anesthesiology, it is crucial to increase the availability of high-quality information online and refine prompt engineering to better guide LLM responses.
Background/Objectives: Multimodal large language models (LLMs) are increasingly used in radiology. However, their ability to recognize fundamental imaging features, including modality, anatomical region, imaging plane, contrast-enhancement status, and particularly specific magnetic resonance imaging (MRI) sequences, remains underexplored. This study aims to evaluate and compare the performance of three advanced multimodal LLMs (ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro) in classifying brain MRI sequences. Methods: A total of 130 brain MRI images from adult patients without pathological findings were used, representing 13 standard MRI series. Models were tested using zero-shot prompts for identifying modality, anatomical region, imaging plane, contrast-enhancement status, and MRI sequence. Accuracy was calculated, and differences among models were analyzed using Cochran’s Q test and McNemar test with Bonferroni correction. Results: ChatGPT-4o and Gemini 2.5 Pro achieved 100% accuracy in identifying the imaging plane and 98.46% in identifying contrast-enhancement status. MRI sequence classification accuracy was 97.7% for ChatGPT-4o, 93.1% for Gemini 2.5 Pro, and 73.1% for Claude 4 Opus (p < 0.001). The most frequent misclassifications involved fluid-attenuated inversion recovery (FLAIR) sequences, often misclassified as T1-weighted or diffusion-weighted sequences. Claude 4 Opus showed lower accuracy in susceptibility-weighted imaging (SWI) and apparent diffusion coefficient (ADC) sequences. Gemini 2.5 Pro exhibited occasional hallucinations, including irrelevant clinical details such as “hypoglycemia” and “Susac syndrome.” Conclusions: Multimodal LLMs demonstrate high accuracy in basic MRI recognition tasks but vary significantly in specific sequence classification tasks. Hallucinations emphasize caution in clinical use, underlining the need for validation, transparency, and expert oversight.
Anthropicのレポートによれば、同社のコードの8割以上がAIにより生成されており、開発の主導権が人間から離れつつある。同社はAIが自律的に自身を改良する「再帰的自己改善」の到来を警告し、業界全体で制御枠組みを構築すべきだと主張している。
MozillaはAnthropicのAI「Mythos」をFirefox 150の検証に導入し、人間では見つけられなかった種類のバグではないものの、271件もの脆弱性を発見した。これはAIが従来のファジングでは困難なコード読解による高密度な探索で、未処理のバグを大量に可視化し、防御側の修正能力と運用の再設計が重要であることを示している。AIによる脆弱性検出コストの低下は、攻撃側の優位性を薄め、防御側の持久力を底上げする可能性を秘めている。
AnthropicのClaude Opus 4.7は、料金単価はOpus 4.6と変わらないものの、新しいトークナイザーの導入により、同じテキストでも消費トークン数が最大1.35倍程度に増加する可能性がある。特に英語やコード中心のワークロードでは実質的なコスト増やレート制限の消費速度上昇につながるため、開発者は移行前に自社のプロンプトでトークン数を比較検証する必要がある。
Anthropicは生成AIモデル「Claude Opus 4.7」の一般提供を開始した。本モデルは一般提供モデルとしては最上位で、コーディングや知識労働、GUI理解において性能が向上しているが、限定公開中の「Claude Mythos Preview」が関連評価でより高い結果を示しており、能力フロンティアを更新するものではない。移行時にはAPI挙動の変更やトークン会計の再計算が必要となる。
NVIDIAは、量子コンピューターの最大課題であるエラー訂正をAIで解決するため、オープンソースAIモデル群「NVIDIA Ising」を発表した。このモデルは、量子プロセッサーの自動キャリブレーションとリアルタイムエラー訂正をAIに担わせ、エラー率を劇的に削減することで、量子コンピューティングの実用化を加速させることを目指している。
AIエージェントは、「スキル」と呼ばれる構造化テキストファイルを活用することで、特定タスクの専門知識を動的に引き出せると期待されている。しかし、UC Santa Barbara、MIT CSAIL、MIT-IBM Wat […]
AIが27年前のOSバグを発見し、16年前に埋もれていた脆弱性を掘り起こす。Anthropicが2026年4月7日に発表したClaude Mythos Previewは、そのサイバーセキュリティ能力が突出しすぎているがゆ […]
Anthropicは2026年3月13日、Claude Opus 4.6とSonnet 4.6を対象に、最大100万(1M)トークンのコンテキストウィンドウを標準価格で正式提供(GA)すると発表した。200Kトークンを超 […]
AIチャットボットの応答が、長大なテキストの塊として画面を埋め尽くす時代が、静かに終わろうとしている。Anthropicは2026年3月12日、自社のAIアシスタントClaudeに対して、会話の応答内にチャート、ダイアグ […]
2020年2月の世界を覚えているだろうか。中国の武漢で奇妙なウイルスが流行しているというニュースが流れ始めていたが、多くの人々はまだレストランで食事を楽しみ、出張の計画を立て、日常を疑っていなかった。「トイレットペーパー […]
2026年2月、人工知能(AI)の歴史に新たな一ページが刻まれた。Anthropicの最新モデル「Claude Opus 4.6」を用いた実験において、16基のAIエージェントが相互に連携し、ゼロから10万行規模のCコン […]
2026年2月5日午前10時(米国太平洋標準時)、シリコンバレーでAIの歴史に刻まれる奇妙な「15分間」の攻防が繰り広げられた。当初、OpenAIとAnthropicは自社の最新エンジニアリング向けAIモデルを同時刻に発 […]
AIの進化において、2025年後半から2026年初頭にかけての最大の関心事は、単なる「回答の精度」から「複雑な業務の完遂能力」へと移り変わった。2026年2月5日、Anthropicが発表した新フラッグシップモデル「Cl […]
AI企業Anthropicは9月30日、最新モデルClaude Sonnet 4.5を発表した。同社は「世界最高のコーディングモデル」と明言し、複雑なエージェント構築とコンピューター操作において最強のモデルであると位置づ […]