Posted on 3 March 2022
In recent years, excitement has grown around the use of machine learning algorithms in healthcare. The strengths of ML in this context are that it can, when fed with enough training data, detect subtle relationships within that data that humans aren’t even aware of. The potential of ML in healthcare is enormous: soon, we are promised, ML will be routinely used to detect cancer from a blood sample, Alzheimer’s disease from an MRI scan, or to predict who will develop severe COVID-19 symptoms with greater accuracy than any human-designed test.
Some ML algorithms are already seeing use in real-world settings – to predict the probability that a patient will develop sepsis (a potentially deadly consequence of infection) for example. But an investigation by STAT and the Massachusetts Institute of Technology (MIT) suggests that such algorithms may have a fatal flaw: their accuracy can plummet over time.
STAT and MIT conducted a month-long experiment in which they traced the performance of three ML algorithms they built, incorporating the most common factors used in proprietary ML products sold to hospitals. Such algorithms have been shown to perform well initially, but the investigators wanted to see whether they remained reliable over longer periods of time. To this end, the algorithms were tested on a database of 40 000 patients admitted to the ICU at Beth Israel Deaconess Medical Centre in Boston between 2008 and 2019. They were tested at three year intervals, and their predictions were compared to the real patient data in order to assess their accuracy on a 0-1 scale, where 1 is a model whose predictions are 100% correct. This is called the AUC scale.
What they found was that the AUC scores of the three algorithms did not remain stable over time. One of them, a sepsis prediction algorithm modelled on a widely used product by Epic Systems, saw its AUC drop to 0.53 – barely more accurate than flipping a coin.
Artificial neural networks consist of a network of interconnected artificial neurons. These neurons take some input and, based on the activation of different neurons within the network, produce some output. If the output is correct, the neurons involved in producing that output have their connections strengthened. This is the basic principal by which ML algorithms are ‘trained’ to perform a desired task. An algorithm trained on one set of data won’t necessarily work well, or at all, when applied to a different set of data. As an extreme example, an algorithm trained on health data from the United States isn’t going to work well in a hospital in Niger, because the hospital environments, health challenges and patient populations in these two countries are very different. Even an algorithm trained in one hospital in New York won’t perform optimally when used in another hospital in the same city.
In the experiment by STAT and MIT, the three algorithms were all fed data from the same medical centre. So, what happened? Why did the accuracy of the sepsis algorithm decline so much over time? The answer is simple – the environment in which the algorithm was being used evolved over time, but the algorithm itself remained stuck in the time period in which it was trained.
Specifically, two major changes occurred. In 2015, clinicians began using an updated version of the International Classification of Diseases (ICD), a medical records system in which different illnesses are described by an ICD code. This introduced thousands of new, more precise ICD codes, which threw the sepsis algorithm off, as it had been trained with the old ICD codes. When it was retrained with all the codes removed, the accuracy improved by around 0.15, but still showed a decline in accuracy since its debut. Why? Because the patients being treated at Beth Israel Deaconess Medical Centre had changed.
Beth Israel Deaconess had acquired or signed affiliations with several suburban hospitals around Massachusetts. Consequently, a new subset of patients began to be admitted to ICU units. The average time taken for sepsis to occur after being admitted to the ICU increased, and for reasons unknown, microbiology tests to detect sepsis stopped being ordered as frequently or as quickly. In some situations, such changes in the dataset can make an algorithm more accurate as new associations are incorporated into its predictions. However, these disturbances can also lead to the creation of spurious associations and a drop in accuracy, which is what happened to the sepsis algorithm.
This study highlights an important problem: that health algorithms don’t always retain the advertised accuracy over time, and failing to recognise this could mean the difference between life and death for some patients. However, there are ways to mitigate this problem. Algorithms can be trained on data points that are less susceptible to change – for example, by omitting ICD codes, which have been updated 10 times since their creation. This makes the algorithm more resilient over time, with the tradeoff being a lower accuracy during its early use.
Another option is to retrain the algorithm on a regular basis. Unfortunately, vendors generally aren’t upfront about their algorithms’ vulnerabilities, which makes mitigating them harder for users. For example, it would be helpful if vendors gave an estimate of how quickly their AUC degraded and how often their algorithms should be retrained.
Machine learning remains important to the future of healthcare, but the hype and the drive to develop new models has led to the practical challenges of implementing ML being neglected. Health algorithms are useful, but standards for monitoring and maintaining them need to be improved.
AI gone astray: How subtle shifts in patient data send popular algorithms reeling, undermining patient safety: https://www.statnews.com/2022/02/28/sepsis-hospital-algorithms-data-shift/