Tech Product

Claude Opus 4.6

別名: Claude Opus 4.6

Overview

Anthropicの最新フラッグシップAIモデル。100万トークンのコンテキストウィンドウを搭載し、コーディング、財務分析、研究開発などの専門領域における自律的なエージェント能力が大幅に強化されている。MRCR v2やTerminal-Benchなどのベンチマークで高いスコアを記録し、ビジネス現場での実質的な業務完遂能力に特化している。

Research Papers

5 件
  • Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s “Diagnosis Please” cases

    Ryo Kurokawa, Yuji Ohizumi, Jun Kanzawa, M. Kurokawa, Yuki Sonoda, Yuta Nakamura, T. Kiguchi, W. Gonoi, Osamu Abe

    2024 79 件引用 Semantic Scholar

    The diagnostic performance of large language artificial intelligence (AI) models when utilizing radiological images has yet to be investigated. We employed Claude 3 Opus (released on March 4, 2024) and Claude 3.5 Sonnet (released on June 21, 2024) to investigate their diagnostic performances in response to the Radiology’s Diagnosis Please quiz questions. In this study, the AI models were tasked with listing the primary diagnosis and two differential diagnoses for 322 quiz questions from Radiology’s “Diagnosis Please” cases, which included cases 1 to 322, published from 1998 to 2023. The analyses were performed under the following conditions: (1) Condition 1: submitter-provided clinical history (text) alone. (2) Condition 2: submitter-provided clinical history and imaging findings (text). (3) Condition 3: clinical history (text) and key images (PNG file). We applied McNemar’s test to evaluate differences in the correct response rates for the overall accuracy under Conditions 1, 2, and 3 for each model and between the models. The correct diagnosis rates were 58/322 (18.0%) and 69/322 (21.4%), 201/322 (62.4%) and 209/322 (64.9%), and 80/322 (24.8%) and 97/322 (30.1%) for Conditions 1, 2, and 3 for Claude 3 Opus and Claude 3.5 Sonnet, respectively. The models provided the correct answer as a differential diagnosis in up to 26/322 (8.1%) for Opus and 23/322 (7.1%) for Sonnet. Statistically significant differences were observed in the correct response rates among all combinations of Conditions 1, 2, and 3 for each model (p < 0.01). Claude 3.5 Sonnet outperformed in all conditions, but a statistically significant difference was observed only in the comparison for Condition 3 (30.1% vs. 24.8%, p = 0.028). Two AI models demonstrated a significantly improved diagnostic performance when inputting both key images and clinical history. The models’ ability to identify important differential diagnoses under these conditions was also confirmed.

  • Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases

    Yuki Sonoda, Ryo Kurokawa, Yuta Nakamura, Jun Kanzawa, M. Kurokawa, Yuji Ohizumi, W. Gonoi, O. Abe

    2024 75 件引用 Semantic Scholar

    Large language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications in interpreting patient histories and documented imaging findings. As LLMs continue to improve, their diagnostic abilities are expected to be enhanced further. However, there is a lack of comprehensive comparisons between LLMs from different manufacturers. In this study, we aimed to test the diagnostic performance of the three latest major LLMs (GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, a monthly diagnostic quiz series for radiology experts. Clinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane’s Q and post hoc McNemar’s tests. The respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models. Claude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.

  • Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

    Darioush Kevian, U. Syed, Xing-ming Guo, Aaron J. Havens, G. Dullerud, Peter J. Seiler, Lianhui Qin, Bin Hu

    2024 61 件引用 Semantic Scholar

    In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. We use this dataset to study and evaluate the problem-solving abilities of these LLMs in the context of control engineering. We present evaluations conducted by a panel of human experts, providing insights into the accuracy, reasoning, and explanatory prowess of LLMs in control engineering. Our analysis reveals the strengths and limitations of each LLM in the context of classical control, and our results imply that Claude 3 Opus has become the state-of-the-art LLM for solving undergraduate control problems. Our study serves as an initial step towards the broader goal of employing artificial general intelligence in control engineering.

  • Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam

    Misaki Fujimoto, Hidetaka Kuroda, Tomomi Katayama, Atsuki Yamaguchi, Norika Katagiri, Keita Kagawa, Shota Tsukimoto, Akito Nakano, Uno Imaizumi, Aiji Sato(Boku), Naotaka Kishimoto, Tomoki Itamiya, Kanta Kido, T. Sanuki

    2024 28 件引用 Semantic Scholar

    Purpose Large language models (LLMs) are increasingly employed across various fields, including medicine and dentistry. In the field of dental anesthesiology, LLM is expected to enhance the efficiency of information gathering, patient outcomes, and education. This study evaluates the performance of different LLMs in answering questions from the Japanese Dental Society of Anesthesiology Board Certification Examination (JDSABCE) to determine their utility in dental anesthesiology. Methods The study assessed three LLMs, ChatGPT-4 (OpenAI, San Francisco, California, United States), Gemini 1.0 (Google, Mountain View, California, United States), and Claude 3 Opus (Anthropic, San Francisco, California, United States), using multiple-choice questions from the 2020 to 2022 JDSABCE exams. Each LLM answered these questions three times. The study excluded questions involving figures or deemed inappropriate. The primary outcome was the accuracy rate of each LLM, with secondary analysis focusing on six subgroups: (1) basic physiology necessary for general anesthesia, (2) local anesthesia, (3) sedation and general anesthesia, (4) diseases and patient management methods that pose challenges in systemic management, (5) pain management, and (6) shock and cardiopulmonary resuscitation. Statistical analysis was performed using one-way ANOVA with Dunnett's multiple comparisons, with a significance threshold of p<0.05. Results ChatGPT-4 achieved a correct answer rate of 51.2% (95% CI: 42.78-60.56, p=0.003) and Claude 3 Opus 47.4% (95% CI: 43.45-51.44, p<0.001), both significantly higher than Gemini 1.0, which had a rate of 30.3% (95% CI: 26.53-34.14). In subgroup analyses, ChatGPT-4 and Claude 3 Opus demonstrated superior performance in basic physiology, sedation and general anesthesia, and systemic management challenges compared to Gemini 1.0. Notably, ChatGPT-4 excelled in questions related to systemic management (62.5%) and Claude 3 Opus in pain management (61.53%). Conclusions ChatGPT-4 and Claude 3 Opus exhibit potential for use in dental anesthesiology, outperforming Gemini 1.0. However, their current accuracy rates are insufficient for reliable clinical use. These findings have significant implications for dental anesthesiology practice and education, including educational support, clinical decision support, and continuing education. To enhance LLM utility in dental anesthesiology, it is crucial to increase the availability of high-quality information online and refine prompt engineering to better guide LLM responses.

  • Performance of Large Language Models in Recognizing Brain MRI Sequences: A Comparative Analysis of ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro

    Ali Şalbaş, R. Buyuktoka

    2025 14 件引用 Semantic Scholar

    Background/Objectives: Multimodal large language models (LLMs) are increasingly used in radiology. However, their ability to recognize fundamental imaging features, including modality, anatomical region, imaging plane, contrast-enhancement status, and particularly specific magnetic resonance imaging (MRI) sequences, remains underexplored. This study aims to evaluate and compare the performance of three advanced multimodal LLMs (ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro) in classifying brain MRI sequences. Methods: A total of 130 brain MRI images from adult patients without pathological findings were used, representing 13 standard MRI series. Models were tested using zero-shot prompts for identifying modality, anatomical region, imaging plane, contrast-enhancement status, and MRI sequence. Accuracy was calculated, and differences among models were analyzed using Cochran’s Q test and McNemar test with Bonferroni correction. Results: ChatGPT-4o and Gemini 2.5 Pro achieved 100% accuracy in identifying the imaging plane and 98.46% in identifying contrast-enhancement status. MRI sequence classification accuracy was 97.7% for ChatGPT-4o, 93.1% for Gemini 2.5 Pro, and 73.1% for Claude 4 Opus (p < 0.001). The most frequent misclassifications involved fluid-attenuated inversion recovery (FLAIR) sequences, often misclassified as T1-weighted or diffusion-weighted sequences. Claude 4 Opus showed lower accuracy in susceptibility-weighted imaging (SWI) and apparent diffusion coefficient (ADC) sequences. Gemini 2.5 Pro exhibited occasional hallucinations, including irrelevant clinical details such as “hypoglycemia” and “Susac syndrome.” Conclusions: Multimodal LLMs demonstrate high accuracy in basic MRI recognition tasks but vary significantly in specific sequence classification tasks. Hallucinations emphasize caution in clinical use, underlining the need for validation, transparency, and expert oversight.

Mentioned Articles

14 件

External Mentions

2 件