| March 11, 2022

Data Science and Artificial Intelligence for Chest Radiology

In recent years, a large number of companies have attempted to develop artificial intelligence (AI) tools to interpret radiological images. However, the adoption of this technology by healthcare providers has been much slower. The available AI tools can only provide coarse-grained reports due to their limited coverage of radiological findings, which is insufficient for clinical workflows — but is it possible for AI to do better?

During a technical vision talk at the Women in Data Science Worldwide Conference 2022, Tanveer Syeda-Mahmood (an IBM Fellow) described outcomes from IBM’s Medical Sieve Radiology Grand Challenge. The challenge was to build a provably good artificial radiology assistant, which required building an entire data science application from end to end. Doing so necessitated collaboration within a large team of clinicians, software engineers, and researchers in medical imaging, machine learning, text analytics, and healthcare informatics.

“This problem was posed to us: Can you pick one modality, such as a chest X-ray, and do a full preliminary read at a level such that I couldn’t distinguish between a human read and a machine read?” Syeda-Mahmood said. There were several reasons that the team investigated chest X-rays in particular. About 60 percent of the X-rays taken in the U.S. are of the chest, so there is a clear business interest — though that was not really the team’s motivation. They were more interested by the fact that no one had completely catalogued chest radiology findings before, and curious whether AI could be as accurate as radiologists (which of course included figuring out how accurate the radiologists themselves are). There was also the interesting problem of whether the AI could capture fine-grained descriptions in language that sounded natural enough to pass a Turing test.

However, chest X-rays are one of the most difficult types of medical imaging to interpret. The images may not be technically ready to interpret due to technical assessment issues, and X-rays may show the presence of tubes and lines from medical equipment — especially if taken in a hospital environment. There are a number of additional ambiguities that can make it difficult to tell whether something in the image is a nodule or an opacity of another kind.

Figure 1. An example of an automated preliminary read with a fine-grained description that was created by the artificial intelligence platform from a chest X-ray image. Figure courtesy of Tanveer Syeda-Mahmood.

Such difficulties led to research questions across multiple fields. “In order to build something at scale, you want to address many different problems,” Syeda-Mahmood said. “Not just for AI, which we know we have to do, but also on the radiology side.” For the medical aspect of the project, researchers needed to catalog all of the possible findings from chest X-rays, gather a benchmark dataset, establish the ground truth, record radiologist reads of X-rays, establish evaluation metrics, and benchmark radiologist performances; meanwhile, researchers on the machine side had to assemble and label training datasets, build machine learning models, record machine reads of X-rays, and compare performances.

The first step in developing the artificial radiology assistant was to determine all of the possible findings from chest X-rays. This intense cataloguing problem required clinicians to curate findings from textbooks, over 200,000 radiology reports, and other sources. They were able to assemble the largest collection of fine-grained finding labels for chest X-rays ever, with more 2,000 findings at the finest level of detail.

The researchers gathered multi-institutional datasets of chest X-rays from a variety of hospitals in the U.S. and abroad. However, much of the data was unlabeled or did not have many labels — certainly not enough labels to cover 2,000 findings. The team therefore had to attach labels to these images before using them to train a deep learning model. Doing this manually would be nearly impossible due to the volume of the data, so the researchers developed methods to interpret the companion written radiology reports for each image using a lexicon of chest X-ray terminology and natural language parsing. This approach was able to extract fine-grained findings from the reports and label the images with modifiers such as the affected anatomy, location, severity, size, and shape.

Using these data and labels, Syeda-Mahmood described how the research team was able to build a single deep learning model and tune it for sensitives and specificities desired in radiology reads by exploiting unique combinations of the latest developments in computer vision. In order for the machine to automatically generate a readable report, the team extracted a database of prior human-written reports and indexed them for visual recognition of findings in chest X-rays to generate sentences that sounded natural (see Figure 1).

Syeda-Mahmood then explained how the research team proved that their AI approach performed comparably to actual radiologists. First, they had to conduct radiology studies to build benchmark datasets and establish ground truth to which they could compare the automatically-generated results. There were promising results from comparing the performance of entry-level radiologists to their artificial radiology assistant in terms of sensitivity and specificity. “For some findings radiologists are better than machines, and for others, machines are better than radiologists,” Syeda-Mahmood said.

Figure 2. The interface for the Turing test in which senior radiologists evaluated reports created by the artificial intelligence platform and radiology residents. Figure courtesy of Tanveer Syeda-Mahmood.

Finally, the researchers performed a Turing test to compare their machine’s output with that of medical residents in radiology (see Figure 2). They gave reports to three senior radiologists without telling them which process created them; one radiologist received just machine reports, one received only reports made by radiology residents, and the third got a mixture of both types of reports. The senior radiologists then evaluated the reports in terms of their quality for diagnostic purposes and overall satisfaction. Excitingly, they found that the machine created more excellent-quality reports than the residents.

Syeda-Mahmood concluded her talk by reiterating the grand nature of this challenge, which explored an entire application from end to end. Ultimately, the research team was able to create a specialized model with best-in-class performance that provided automated reports in natural language. Syeda-Mahmood also looked towards the future. “A lesson learned in data science is that the next wave that you will see in AI models is going after fine-grained findings,” she said. “I think the coarse grain labels have already been addressed now, and in the future, you will see machines going after things that are at a much finer level of detail.”

A recording of Tanveer Syeda-Mahmood's technical vision talk at the Women in Data Science Worldwide Conference 2022 is available on YouTube.

Jillian Kunze is the associate editor of SIAM News.