Nudge Users to Catch Generative AI Errors

Using large language models to generate text can save time but often results in unpredictable errors. Prompting users to review outputs can improve their quality.

Reading Time: 7 min 

Topics

Frontiers

An MIT SMR initiative exploring how technology is reshaping the practice of management.
More in this series
Permissions and PDF Download

Neil Webb/Ikon Images

OpenAI’s ChatGPT has generated excitement since its release in November 2022, but it has also created new challenges for managers. On the one hand, business leaders understand that they cannot afford to overlook the potential of generative AI large language models (LLMs). On the other hand, apprehensions surrounding issues such as bias, inaccuracy, and security breaches loom large, limiting trust in these models.

In such an environment, responsible approaches to using LLMs are critical to the safe adoption of generative AI. Consensus is building that humans must remain in the loop (a scenario in which human oversight and intervention places the algorithm in the role of a learning apprentice) and responsible AI principles must be codified. Without a proper understanding of AI models and their limitations, users could place too much trust in AI-generated content. Accessible and user-friendly interfaces like ChatGPT, in particular, can present errors with confidence while lacking transparency, warnings, or any communication of their own limitations to users. A more effective approach must assist users with identifying the parts of AI-generated content that require affirmative human choice, fact-checking, and scrutiny.

In a recent field experiment, we explored a way to assist users in this endeavor. We provided global business research professionals at Accenture with a tool developed at Accenture’s Dock innovation center, designed to highlight potential errors and omissions in LLM content. We then measured the extent to which adding this layer of friction had the intended effect of reducing the likelihood of uncritical adoption of LLM content and bolstering the benefits of having humans in the loop.

The findings revealed that consciously adding some friction to the process of reviewing LLM-generated content can lead to increased accuracy — without significantly increasing the time required to complete the task. This has implications for how companies can deploy generative AI applications more responsibly.

Experiment With Friction

Friction has a bad name in the realm of digital customer experience, where companies strive to eliminate any roadblocks to satisfying user needs. But recent research suggests that organizations should embrace beneficial friction in AI systems to improve human decision-making. Our experiment set out to practically explore this hypothesis in the field by measuring the efficiency and accuracy trade-offs of adding targeted friction, or cognitive and procedural speed bumps, to LLM outputs in the form of error highlighting. We tested whether intentional structurally embedded resistance to the uninterrupted and automatic application of AI would slow the user process and make potential errors more likely to be noticed. We believed that this would encourage participants to engage in what is referred to in behavioral economics as System 2 thinking, a more conscious and deliberative type of cognitive processing than the more intuitive System 1 thinking, akin to accuracy nudges in misinformation research.

The study, a collaborative effort between MIT and Accenture, aimed to explore the integration of an LLM into a task familiar to business research professionals. The objective was to complete and submit two executive summaries of company profiles (Task 1 and Task 2) within a 70-hour time frame and seek and reference any available sources, simulating real work conditions. The research participants were given text output from ChatGPT, along with the corresponding prompts, and were told that they could use as much or as little of the content as they saw fit.

Passages from the provided ChatGPT output and prompts were highlighted in different colors. Participants were informed that the highlighting features were part of a hypothetical tool Accenture could potentially develop and that the highlights conveyed different meanings depending on the color. Text highlighted in purple matched terms used in the prompt and terms in internal databases and publicly available information sources; text highlighted in orange indicated potentially untrue statements that should be considered for removal or replacement; text that was in the prompt but omitted in the output was indicated below the generated output and highlighted in blue; and text that was not identified as belonging to any of these categories was left unhighlighted.

Recent research suggests that we should embrace beneficial friction in AI systems to improve human decision-making.

Ideally, this hypothetical tool would combine natural language processing (NLP) techniques and an AI model to query all outputs against a predefined source of truth to highlight potential errors or omissions, but for the purposes of this experiment, the highlighting was done using a combination of algorithmic and human inputs. In addition, we purposely baked in some attention-check errors (nonhighlighted) to measure the circumstances under which adding friction in LLM use led to greater error detection (and improved accuracy) by participants.

Participants were randomly assigned to one of three experimental conditions, with varying levels of cognitive speed bumps in the form of highlighting:

  • In the full friction condition, the LLM-generated content contained three kinds of highlighting based on the prompt that indicated that information was likely correct, incorrect, or missing from the output.
  • In the medium friction condition, the LLM-generated content contained two kinds of highlighting based on the prompt that indicated likely errors and omissions of information that should have been in the output.
  • In the no friction control condition, the LLM-generated content contained no highlighting at all, as per the current generative AI user experience.

Our findings revealed that introducing friction that nudges users to more carefully scrutinize LLM-generated text can help them catch inaccuracies and omissions. Participants in the no-highlight control condition missed more errors than those in either of the conditions with error labeling (31% more in Task 1 and 10% more in Task 2). Moreover, the proportion of omissions detected was 17% in the no-highlight condition but 48% in the full-highlight condition and 54% in the error-highlight condition.

Introducing friction that nudges users to more carefully scrutinize LLM-generated text can help them catch inaccuracies.

As anticipated, these improvements did come with a trade-off: Participants in the full-highlight group saw a statistically significant increase (an average of 43% and 61% in Tasks 1 and 2, respectively) in the time required to complete the tasks versus the control group. However, in the error-only highlight condition, the average difference in the time taken versus the control was not statistically significant. Considering that each task typically took one to two hours on average without the assistance of generative AI, this trade-off was considered acceptable. Thus, the second condition, which involved medium friction, demonstrated a way to optimize the balance between accuracy and efficiency.

Three Behavioral Insights

The results of our field experiment point to actions organizations can take to help employees more effectively incorporate generative AI tools into their work and be more likely to recognize potential errors and biases.

Ensure thoughtfulness in crafting the prompt — a touch point for beneficial friction — given users’ tendency toward cognitive anchoring on generative AI output. Participants’ final submissions were lexically very similar to the LLM-generated content (60% to 80% identical content, as measured by NLP similarity scores). This suggests that the participants anchored on that output, even when they were asked to consider it as merely an input to their own writing. This underscores the importance of being thoughtful about the prompt provided to the LLM, since its output can set the trajectory for the final version of the content. Recent research suggests that anchoring may prove beneficial under some circumstances when generative AI content is perceived as high in quality and can play a compensatory role for an error-prone writer. But, given our findings of high similarity between the LLM-generated text and the final submissions from human participants, it could also lead a user down the wrong path.

Recognize that confidence is a virtue but overconfidence is a vice. Highlighting errors did indeed draw participants’ attention and improved accuracy via error correction. Yet participants across the three conditions self-reported virtually no difference in response to the follow-up survey item “I am more aware of the types of errors to look for when using GenAI.” This presents a reason to be cautious: Users may overestimate their ability to identify AI-generated errors. A tool that adds friction by making potential errors more conspicuous could help users calibrate their trust in generative AI content by mitigating overconfidence.

Additionally, our findings suggest that highlighting errors had no significant impact on participants’ self-reported trust in LLM tools or their willingness to use them.

Experiment, experiment, experiment. Before AI tools and models are deployed, it is imperative to test how humans interact with them and how they impact accuracy, speed, and trust. As indicated above, we observed a difference in self-reported attitudes and actual error detection. We urge organizations to adopt experiments as a means of understanding how best to elevate the role of employees in human-in-the-loop systems and to measure the impact on their understanding, behaviors, and biases.


The ease of use and broad availability of LLMs has enabled their rapid spread through many organizations, even as issues with their accuracy remain unresolved. We must seek ways to enhance humans’ ability to improve accuracy and efficiency when working with AI-generated outputs. Our study suggests that humans in the loop can play an important interventional role in AI-enabled systems and that beneficial friction can nudge users to exercise their responsibility for the quality of their organization’s content.

Topics

Frontiers

An MIT SMR initiative exploring how technology is reshaping the practice of management.
More in this series

Reprint #:

65431

More Like This

Add a comment

You must to post a comment.

First time here? Sign up for a free account: Comment on articles and get access to many more articles.