{rfName}
Te

Indexed in

License and use

Icono OpenAccess

Citations

6

Altmetrics

Grant support

We thank two anonymous reviewers for their feedback.

Analysis of institutional authors

Dentella, VittoriaCorresponding Author

Share

December 3, 2024
Publications
>
Article

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Publicated to:Scientific Reports. 14 (1): 28083- - 2024-11-14 14(1), DOI: 10.1038/s41598-024-79531-8

Authors: Dentella, Vittoria; Guenther, Fritz; Murphy, Elliot; Marcus, Gary; Leivada, Evelina

Affiliations

Autonomous Univ Barcelona, Barcelona, Spain - Author
Humboldt Univ, Berlin, Germany - Author
Inst Catalana Recerca & Estudis Avancats, Barcelona, Spain - Author
NYU, New York, NY USA - Author
Univ Pavia, Pavia, Italy - Author
Univ Rovira & Virgili, Tarragona, Spain - Author
UTHealth, Houston, TX USA - Author
See more

Abstract

Large Language Models (LLMs) are recruited in applications that span from clinical assistance and legal support to question answering and education. Their success in specialized tasks has led to the claim that they possess human-like linguistic capabilities related to compositional understanding and reasoning. Yet, reverse-engineering is bound by Moravec's Paradox, according to which easy skills are hard. We systematically assess 7 state-of-the-art models on a novel benchmark. Models answered a series of comprehension questions, each prompted multiple times in two settings, permitting one-word or open-length replies. Each question targets a short text featuring high-frequency linguistic constructions. To establish a baseline for achieving human-like performance, we tested 400 humans on the same prompts. Based on a dataset of n = 26,680 datapoints, we discovered that LLMs perform at chance accuracy and waver considerably in their answers. Quantitatively, the tested models are outperformed by humans, and qualitatively their answers showcase distinctly non-human errors in language understanding. We interpret this evidence as suggesting that, despite their usefulness in various tasks, current AI models fall short of understanding language in a way that matches humans, and we argue that this may be due to their lack of a compositional operator for regulating grammatical and semantic information.

Keywords

Artificial intelligenceComprehensionFemaleHumansLanguageLinguisticsSemantics

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

The work has been published in the journal Scientific Reports due to its progression and the good impact it has achieved in recent years, according to the agency WoS (JCR), it has become a reference in its field. In the year of publication of the work, 2024 there are still no calculated indicators, but in 2023, it was in position 25/135, thus managing to position itself as a Q1 (Primer Cuartil), in the category Multidisciplinary Sciences.

Independientemente del impacto esperado determinado por el canal de difusión, es importante destacar el impacto real observado de la propia aportación.

Según las diferentes agencias de indexación, el número de citas acumuladas por esta publicación hasta la fecha 2025-08-07:

  • Scopus: 1

Impact and social visibility

From the perspective of influence or social adoption, and based on metrics associated with mentions and interactions provided by agencies specializing in calculating the so-called "Alternative or Social Metrics," we can highlight as of 2025-08-07:

  • The use, from an academic perspective evidenced by the Altmetric agency indicator referring to aggregations made by the personal bibliographic manager Mendeley, gives us a total of: 24.
  • The use of this contribution in bookmarks, code forks, additions to favorite lists for recurrent reading, as well as general views, indicates that someone is using the publication as a basis for their current work. This may be a notable indicator of future more formal and academic citations. This claim is supported by the result of the "Capture" indicator, which yields a total of: 43 (PlumX).

With a more dissemination-oriented intent and targeting more general audiences, we can observe other more global scores such as:

  • The Total Score from Altmetric: 72.5.
  • The number of mentions on the social network Facebook: 1 (Altmetric).
  • The number of mentions on the social network X (formerly Twitter): 62 (Altmetric).
  • The number of mentions in news outlets: 3 (Altmetric).

It is essential to present evidence supporting full alignment with institutional principles and guidelines on Open Science and the Conservation and Dissemination of Intellectual Heritage. A clear example of this is:

  • The work has been submitted to a journal whose editorial policy allows open Open Access publication.
  • Assignment of a Handle/URN as an identifier within the deposit in the Institutional Repository: http://hdl.handle.net/20.500.11797/imarina9393177

Leadership analysis of institutional authors

This work has been carried out with international collaboration, specifically with researchers from: Germany; Italy; United States of America.

There is a significant leadership presence as some of the institution’s authors appear as the first or last signer, detailed as follows: First Author (Dentella, Vittoria) .

the author responsible for correspondence tasks has been Dentella, Vittoria.