How an AI tool for fighting hospital deaths actually worked in the real world

In November of 2018, a new deep-learning tool went online in the emergency department of the Duke University Health System. Called Sepsis Watch, it was designed to help doctors spot early signs of one of the leading causes of hospital deaths globally.

Sepsis occurs when an infection triggers full-body inflammation and ultimately causes organs to shut down. It can be treated if diagnosed early enough, but that’s a notoriously hard task because its symptoms are easily mistaken for signs of something else.

Sepsis Watch promised to change that. The product of three and a half years of development (which included digitizing health records, analyzing 32 million data points, and designing a simple interface in the form of an iPad app), it scores patients on an hourly basis for their likelihood of developing the condition. It then flags those who are medium or high risk and those who already meet the criteria. Once a doctor confirms the diagnosis, the patients get immediate attention.

In the two years since the tool’s introduction, anecdotal evidence from Duke Health’s hospital managers and clinicians has suggested that Sepsis Watch really works. It has dramatically reduced sepsis-induced patient deaths and is now part of a federally registered clinical trial expected to share its results in 2021.

At first glance, this is an example of a major technical victory. Through careful development and testing, an AI model successfully augmented doctors’ ability to diagnose disease. But a new report from the Data & Society research institute says this is only half the story. The other half is the amount of skilled social labor that the clinicians leading the project needed to perform in order to integrate the tool into their daily workflows. This included not only designing new communication protocols and creating new training materials but also navigating workplace politics and power dynamics.

The case study is an honest reflection of what it really takes for AI tools to succeed in the real world. “It was really complex,” says coauhtor Madeleine Clare Elish, a cultural anthropologist who examines the impact of AI.

Repairing innovation

Innovation is supposed to be disruptive. It shakes up old ways of doing things to achieve better outcomes. But rarely in conversations about technological disruption is there an acknowledgment that disruption is also a form of “breakage.” Existing protocols turn obsolete; social hierarchies get scrambled. Making the innovations work within existing systems requires what Elish and her coauthor Elizabeth Anne Watkins call “repair work.”

During the researchers’ two-year study of Sepsis Watch at Duke Health, they documented numerous examples of this disruption and repair. One major issue was the way the tool challenged the medical world’s deeply ingrained power dynamics between doctors and nurses.

In the early stages of tool design, it became clear that rapid response team (RRT) nurses would need to be the primary users. Though attending physicians are typically in charge of evaluating patients and making sepsis diagnoses, they don’t have time to continuously monitor another app on top of their existing duties in the emergency department. In contrast, the main responsibility of an RRT nurse is to continuously monitor patient well-being and provide extra assistance where needed. Checking the Sepsis Watch app fitted naturally into their workflow.

But here came the challenge. Once the app flagged a patient as high risk, a nurse would need to call the attending physician (known in medical speak as “ED attendings”). Not only did these nurses and attendings often have no prior relationship because they spent their days in entirely different sections of the hospital, but the protocol represented a complete reversal of the typical chain of command in any hospital. “Are you kidding me?” one nurse recalled thinking after learning how things would work. “We are going to call ED attendings?”

But this was indeed the best solution. So the project team went about repairing the “disruption” in various big and small ways. The head nurses hosted informal pizza parties to build excitement and trust about Sepsis Watch among their fellow nurses. They also developed communication tactics to smooth over their calls with the attendings. For example, they decided to make only one call per day to discuss multiple high-risk patients at once, timed for when the physicians were least busy.

On top of that, the project leads began regularly reporting the impact of Sepsis Watch to the clinical leadership. The project team discovered that not every hospital staffer believed sepsis-induced death was a problem at Duke Health. Doctors, especially, who didn’t have a bird’s-eye view of the hospital’s statistics, were far more occupied with the emergencies they were dealing with day to day, like broken bones and severe mental illness. As a result, some found Sepsis Watch a nuisance. But for the clinical leadership, sepsis was a huge priority, and the more they saw Sepsis Watch working, the more they helped grease the gears of the operation.

Changing norms

Elish identifies two main factors that ultimately helped Sepsis Watch succeed. First, the tool was adapted for a hyper-local, hyper-specific context: it was developed for the emergency department at Duke Health and nowhere else. “This really bespoke development was key to the success,” she says. This flies in the face of typical AI norms.

Second, throughout the development process, the team regularly sought feedback from nurses, doctors, and other staff up and down the hospital hierarchy. This not only made the tool more user friendly but also cultivated a small group of committed staff members to help champion its success. It also made a difference that the project was led by Duke Health’s own clinicians, says Elish, rather than by technologists who had parachuted in from a software company. “If you don’t have an explainable algorithm,” she says, “you need to build trust in other ways.”

These lessons are very familiar to Marzyeh Ghassemi, an incoming assistant professor at MIT who studies machine-learning applications for health care. “All machine-learning systems that are ever intended to be evaluated on or used by humans must have socio-technical constraints at front of mind,” she says. Especially in clinical settings, which are run by human decision makers and involve caring for humans at their most vulnerable, “the constraints that people need to be aware of are really human and logistical constraints,” she adds.

Elish hopes her case study of Sepsis Watch convinces researchers to rethink how to approach medical AI research and AI development at large. So much of the work being done right now focuses on “what AI might be or could do in theory,” she says. “There’s too little information about what actually happens on the ground.” But for AI to live up to its promise, people need to think as much about social integration as technical development.

Her work also raises serious questions. “Responsible AI must require attention to local and specific context,” she says. “My reading and training teaches me you can’t just develop one thing in one place and then roll it out somewhere else.”

“So the challenge is actually to figure how we keep that local specificity while trying to work at scale,” she adds. That’s the next frontier for AI research.