Springer Nature
Browse

Evaluating the performance of five large language models in answering Delphi consensus questions relating to patellar instability and medial patellofemoral ligament reconstruction

Posted on 2025-11-04 - 04:38
Abstract Purpose Artificial intelligence (AI) has become incredibly popular over the past several years, with large language models (LLMs) offering the possibility of revolutionizing the way healthcare information is shared with patients. However, to prevent the spread of misinformation, analyzing the accuracy of answers from these LLMs is essential. This study will aim to assess the accuracy of five freely accessible chatbots by specifically evaluating their responses to questions about patellofemoral instability (PFI). The secondary objective will be to compare the different chatbots, to distinguish which LLM offers the most accurate set of responses. Methods Ten questions were selected from a previously published international Delphi Consensus study pertaining to patellar instability, and posed to ChatGPT4o, Perplexity AI, Bing CoPilot, Claude2, and Google Gemini. Responses were assessed for accuracy using the validated Mika score by eight Orthopedic surgeons who have completed fellowship training in sports-medicine. Median responses amongst the eight reviewers for each question were compared using the Kruskal-Wallis and Dunn’s post-hoc tests. Percentages of each Mika score distribution were compared using Pearson’s chi-square test. P-values less than or equal to 0.05 were considered significant. The Gwet’s AC2 coefficient was calculated to assess for inter-rater agreement, corrected for chance and employing quadratic weights. Results ChatGPT4o and Claude2 had the highest percentage of reviews (38/80, 47.5%) considered to be an “excellent response not requiring classification”, or a Mika score of 1. Google Gemini had the highest percentage of reviews (17/80, 21.3%) considered to be “unsatisfactory requiring substantial clarification”, or a Mika score of 4 (p < 0.001). The median ± interquartile range (IQR) Mika scores was 2 (1) for ChatGPT4o and Perplexity AI, 2 (2) for Bing CoPilot and Claude2, and 3 (2) for Google Gemini. Median responses were not significantly different between ChatGPT4o, Perplexity AI, Bing CoPilot, and Claude2, however all four statistically outperformed Google Gemini (p < 0.05). Inter-rater agreement was classified as moderate (0.40 > AC2 ≥ 0.60) for ChatGPT, Perplexity AI, Bing CoPilot, and Claude2, while there was no agreement for Google Gemini (AC2 < 0). Conclusion Current free access LLMs (ChatGPT4o, Perplexity AI, Bing CoPilot, and Claude2) predominantly provide satisfactory responses requiring minimal clarification to standardized questions relating to patellar instability. Google Gemini statistically underperformed in accuracy relative to the other four LLMs, with most answers requiring moderate clarification. Furthermore, inter-rater agreement was moderate for all LLMs apart from Google Gemini, which had no agreement. These findings advocate for the utility of existing LLMs in serving as an adjunct to physicians and surgeons in providing patients information pertaining to patellar instability. Level of evidence: V

CITE THIS COLLECTION

DataCite
No result found
or
Select your citation style and then place your mouse over the citation text to select it.

SHARE

email

Usage metrics

    BMC Musculoskeletal Disorders

    AUTHORS (16)

    • Prushoth Vivekanantha
    • Dan Cohen
    • David Slawaska-Eng
    • Kanto Nagai
    • Magdalena Tarchala
    • Bogdan Matache
    • Laurie Hiemstra
    • Robert Longstaffe
    • Bryson Lesniak
    • Amit Meena
    • Sachin Tapasvi
    • Petri Sillanpäa
    • Patrick Grzela
    • Daniel Lamanna
    • Kristian Samuelsson
    • Darren de SA
    need help?