Medicine's Machine Learning Problem

January 4, 2021

Data science is remaking countless aspects of society, and medicine is no exception. The range of potential applications is already large and only growing by the day. Machine learning is now being used to determine which patients are at high risk of disease and need greater support (sometimes with racial bias), to discover which molecules may lead to promising new drugs, to search for cancer in X-rays (sometimes with gender bias), and to classify tissue on pathology slides. Last year MIT researchers trained an algorithm that was more accurate at predicting the presence of cancer within five years of a mammogram than techniques typically used in clinics, and a 2018 survey found that 84 percent of radiology clinics in the United States are using or plan to use machine learning software. The sense of excitement has been captured in popular books such as Eric Topol’s Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again (2019). But despite the promise of these data-based innovations, proponents often overlook the special risks of datafying medicine in the age of artificial intelligence.

Despite the promise of data-based innovations in medicine, proponents often overlook the special risks of bringing medicine into the age of artificial intelligence.

Consider one striking example that has unfolded during the pandemic. Numerous studies from around the world have found that significant numbers of COVID-19 patients—known as “long haulers”—experience symptoms that last for months. Good estimates range from 20 to 40 percent of all patients, depending on the study design, perhaps even higher. Yet a recent study from Kings College—picked up by CNN and the Wall Street Journal—gives a much lower estimate, claiming that only 2 percent of patients have symptoms for more than 12 weeks and only 4 percent have symptoms longer than 8 weeks. What explains the serious discrepancy? It turns out that the Kings College study relies on data from a symptom tracking app that many long haulers quit using because it didn’t take their symptoms or needs into account, resulting in a time-consuming and frustrating user experience. Long haulers are already dealing with disbelief from doctors, and the inaccurate results of this study may cause further harm—casting doubt on the reality of their condition.

This case is not an isolated exception, and it is not just an object lesson in bad data collection. It reflects a much deeper and more fundamental issue that all applications of data science and machine learning must reckon with: the way these technologies exacerbate imbalances of power. Data is not inert; it causes a doctor to mistakenly tell a patient that her dementia-like symptoms must just be due to a vitamin deficiency or stress. A software bug is not just an error in a line of code; it is a woman with cerebral palsy losing the home health aide she relies on in daily life. As others have argued, the ethics of AI turn crucially on whose voices are listened to and whose are sidelined. These problems are not easily fixed, for the same reason they exist in the first place: the people most impacted—those whose lives are changed by the outcome of an algorithm—have no power, just as they are so often ignored when the tech is being built. Anyone excited about the promise of machine learning for medicine must wrestle seriously with the perils.

As a starting point, we can take five principles to heart. First, it is crucial to acknowledge that medical data—like all data—can be incomplete, incorrect, missing, and biased. Second, we must recognize how machine learning systems can contribute to the centralization of power at the expense of patients and health care providers alike. Third, machine learning designers and adopters must not take new systems onboard without considering how they will interface with a medical system that is already disempowering and often traumatic for patients. Fourth, machine learning must not dispense with domain expertise—and we must recognize that patients have their own expertise distinct from that of doctors. Finally, we need to move the conversation around bias and fairness to focus on power and participation.

• • •

Flaws in medical data

Bias is endemic in medicine. One recent example concerns pulse oximeters, a crucial tool in clinical practice and an essential tool in the pandemic. Prompted by an essay in these pages, which detailed the way most oximeters are calibrated on patients with light skin, a recent study in the New England Journal of Medicine found that Black patients are three times as likely as white patients to get misleading readings, which may impact clinical outcomes. Fitbit heart rate monitors, currently used in over 300 clinical trials, are also less accurate on people of color. Scores of studies show that women and people of color receive less pain medication, lower quality of care, and longer time delays to treatment. Women’s pain is often misattributed by doctors as psychological, resulting in women who report pain being prescribed antidepressants (when they haven’t reported symptoms of depression) rather than painkillers. And similar findings have been observed for race: “A 2012 meta-analysis of 20 years of published research found that Black patients were 22 percent less likely than whites to get any pain medication and 29 percent less likely to be treated with opioids,” a BBC article puts it.

As with all applications of data science and machine learning, medicine must reckon with the way these technologies exacerbate imbalances of power.

These biases—among many others—can result in deeply flawed medical data. The observations, diagnoses, and decisions made by doctors are often treated as objective, but they are fallible, and flawed judgments result in flawed data. In most cases, we do not have data directly recording what patients experience; instead those reports are filtered through a doctor’s perception of their state. Any machine learning model relying on this data is at risk of replicating these biases, delays, and errors.

To take another important example of the way medical datasets may systematically misrepresent reality, diagnosis delays are common for many illnesses, leading to incomplete and incorrect data at any one snapshot in time. On average, it takes five years and five doctors for patients with autoimmune diseases such as multiple sclerosis and lupus to get a diagnosis; three-quarters of these patients are women, and half report being labeled as chronic complainers in the early stages of disease. Diagnosis of Crohn’s disease takes twelve months for men and twenty months for women, while diagnosis for Ehlers-Danlos syndrome takes four years for men and sixteen years for women. Consider how many patients have not received an accurate diagnosis yet or who give up before ever finding one. This leads to incomplete and missing data.

There is also a pernicious cycle around missing medical data for poorly understood diseases: doctors disbelieve patients about their symptoms and dismiss them as anxious or complaining too much. This leads to undercounting how many people are impacted by certain symptoms or diseases, which in turn makes it harder to make a case for increased funding; the diseases may remain poorly understood and patients continue to be disbelieved.

These are just some of the ways the quality of medical data can be misleading or biased. In working with medical data, it is imperative to consider tests that aren’t ordered and notes that aren’t recorded. And it is essential to listen to patients about the ways their data is incomplete, incorrect, or missing. Despite the rise of so-called “debiasing” algorithms, all datasets are biased, and the only way we can understand how is through qualitative work: inspecting how the dataset was gathered, listening to those who have first-hand experience and will be most impacted, and examining the relevant history and social structures.

• • •

The centralization of power

Too often machine learning has the effect of further concentrating and centralizing power away from those most affected by the technology. The risk arises because machine learning can be used cheaply at massive scale; it can amplify biases across entire systems where they would otherwise be more localized; it can be used to evade responsibility; it can be implemented with no system for recourse and no way to identify mistakes; and it can create pernicious feedback loops. We have seen this across a range of domains already: governments using facial recognition to identify protesters, corporations using surveillance to track and even fire employees, the “great decoupling” of income and productivity (in which record corporate profits go to an ever smaller slice of executives), U.S. Immigration and Customs Enforcement misappropriating datasets, and job seekers subjected to degrading, time-consuming algorithms. These impacts are not experienced uniformly by all. We need to consider carefully how this dynamic could play out as machine learning is implemented within medicine.

The ethics of AI turn crucially on whose voices are listened to and whose are sidelined.

In 2018 The Verge investigated an algorithm used in over half of U.S. states to determine how much health care people receive. When it was implemented in Arkansas, there was an error in the code that incorrectly and drastically cut health care for people with cerebral palsy. No explanations were given, and there was no easy way to appeal the cuts. For instance, Tammy Dobbs, a woman with cerebral palsy who needs an aid to help her carry out daily tasks such as getting out of bed, had her hours of help suddenly reduced by twenty-four hours a week. Eventually, a court case revealed that there were mistakes in the software implementation of the algorithm, negatively impacting people with diabetes or cerebral palsy. Dobbs and many others who rely on these health care benefits live in fear that they could again be cut suddenly and inexplicably.

We are still in the early days of widespread implementation of machine learning in medicine, and it is likely only a matter of time until we see examples of centralization of power and potential harms due to opaque algorithms and the lack of recourse that often accompanies machine learning implementations—particularly when they are used as cost-cutting measures, wrongly assumed to be error-free, and implemented without clear mechanisms for validation, correction, and ongoing oversight.

• • •

How machine learning fits into an already distressing system

We can’t understand how data and machine learning will impact medicine without first understanding how patients experience the medical system now. Professionally, I study how machine learning can amplify harms in other complex systems with big power differences. I have also studied the research on medical bias in particular, read hundreds of patients accounts, and am familiar with the medical system first-hand. I once went to the ER after several days of the worst pain of my life. No tests were given; I was discharged and told to take aspirin. I remember sobbing at the gap between my excruciating pain and the doctor’s assessment. A few days later I went to a different ER, where an MRI was ordered. As soon as the results came back I was transferred to the neuro-ICU immediately, and I had brain surgery the next week.

Despite the rise of so-called “debiasing” algorithms, the only way we can understand how is through qualitative work: inspecting how the dataset was gathered, listening to those who have first-hand experience and will be most impacted, and examining the relevant history and social structures.

Since then, I have had a life-threatening brain infection and a second brain surgery, and I continue to live with the long-term effects. My initial ER visit is just one of many times that I have been dismissed by medical professionals. Scores of patient accounts and research studies confirm that my experience is not unique. In a powerful comic, Aubrey Hirsch shares her experience waiting six years to get an accurate diagnosis for Graves’s disease; she developed permanent damage to her bones, eyes, and heart during that time. She experienced debilitating symptoms, yet numerous doctors dismissed her as just an “anxious” young woman.

Such accounts are not exceptional in the history of medicine but more like the rule, including everything from dismissing women’s illnesses as hysteria to the Tuskegee syphilis trials, in which Black men were denied a well-proven treatment for decades. The threads of history wind through to the present. How many patients still don’t have accurate diagnoses? How many are still being dismissed and disbelieved? And how many patients don’t have the resources to go to that additional ER, to keep seeking out doctors after years of dismissal? All these forms of ignorance and bias are reflected in medical data. Machine learning systems must consider how they interface with an already flawed and distressing medical system.

• • •

What counts as medical expertise

Domain expertise is crucial for any applied machine learning project. Radiologists working in deep learning, for example, have discovered dataset issues with incorrectly labeled chest X-rays that those without a medical background would not have recognized on their own. However, it is often assumed that for medicine, the knowledge and experience of doctors is the only domain expertise there is. This is false. While the knowledge of doctors is of course essential, patients have a set of skills and expertise that is distinct but just as essential. Patients know what they are experiencing: what it is like to feel pain, what it is like to navigate a demoralizing health care system. As a patient, it is not your pain or symptoms that matter on their own but the extent to which you can make them legible to providers and the tools they use. Patients must often strategize to try to avoid having their symptoms dismissed: to appear sick, but not in a way that a doctor may think they are faking it. Race, gender, class, weight, sexuality, and many other factors impact how patients are perceived and what contortions may be required to try to be taken seriously.

Machine learning systems must consider how they interface with an already flawed and distressing medical system.

Many patients, particularly with rare or not widely understood illnesses, actively read medical papers, and in some cases will be more familiar with recent, relevant medical literature than some doctors. COVID-19 long-haulers, for example, many of whom have experience as data scientists and researchers, self-organized their own research study in April that made discoveries that mainstream medical research did not uncover until six months later. Doctors may inadvertently give inaccurate information due to being less familiar with recent developments outside their specialization and because unreasonable time constraints often make it impossible to adequately synthesize details of a medical history. Medical machine learning runs the risk of encoding assumptions and current ways of knowing into systems that will be significantly harder to change later. We are at a crucial inflection point with the machine learning revolution, where decisions made now will reverberate for decades to come.

• • •

From bias and fairness to power and participation

Even when problems with machine learning are brought to light, developers often propose “solutions” that involve mere tweaks to code, with no reckoning with the power dynamics at play and no inclusion of the people most impacted.

Fortunately, these concepts of power and participation are gaining more attention, through efforts by researchers, reporters, and activists such as Khari Johnson’s reporting and the Participatory Approaches to Machine Learning (PAML) workshop held this summer at the International Conference on Machine Learning, one of the premier academic machine learning conferences. As the PAML workshop organizers wrote:

The fields of algorithmic fairness and human-centered [machine learning] often focus on centralized solutions, lending increasing power to system designers and operators, and less to users and affected populations. . . We wish to consider a new set of technical formulations for the machine learning community on the subject of more democratic, cooperative, and participatory [machine learning] systems.

Researchers in this area have talked about the need to move beyond explainability (seeking explanations for how an algorithm made a decision) to recourse (giving those impacted concrete actions they could take to change the outcome) and to move beyond transparency (insight to how an algorithm works) to contestability (allowing people to challenge it). In a recent op-ed for Nature, AI researcher Pratyusha Kalluri urges that we replace the question “Is this AI fair?” with the question, “How does this shift power?”

We must insist on mechanisms for ensuring power and participation now in order to ensure the human side of health care is not further eroded in medicine’s machine learning revolution.

These issues are especially crucial in the domain of medicine, where so many patients are already disempowered, and the risk of further centralizing power could lead to great harm. While machine learning may indeed help to bring huge benefits to medicine, patients must be centered and their expertise must be closely listened to. As AI researcher Inioluwa Deborah Raji powerfully wrote in July, “Data are not bricks to be stacked, oil to be drilled, gold to be mined, opportunities to be harvested. Data are humans to be seen, maybe loved, hopefully taken care of.” We must insist on mechanisms for ensuring power and participation now in order to ensure the human side of health care is not further eroded in medicine’s machine learning revolution.

Medicine’s Machine Learning Problem

Donate to support work like this:

Selling Hope

How a Popular Medical Device Encodes Racial Bias

Racism and Respiration

Get our newsletter