In 2016, JAMA published research demonstrating
the efficacy of a deep learning algorithm. We were able to train a deep learning neural
network to recapitulate the majority decision of 7 or 8 US board certified ophthalmologists in
the task of grading for a diabetic retinopathy. The type of deep learning algorithm
used to detect diabetic retinopathy in that study is called a
Convolutional Neural Network, or CNN. CNNs enable computer systems
to analyze and classify data. When applied to images, CNNs can recognize
that an image shows a dog rather than a cat. They can recognize the dog whether it's a
small part or a large part of the picture - size doesn't matter for this technique. It can also classify the dog by breed. CNN systems have also been developed to help clinicians do their work
including selecting cellular elements on pathological slides, correctly identifying
the spatial orientation of chest radiographs, and, as Dr. Peng mentioned, automatically
grading retinal images for diabetic retinopathy. So let's open the deep learning
black box to understand how this works. First, a CNN is not one process. It's actually a complex network of
interconnected processes, organized in layers. With each layer, the CNN can detect
higher-level, more abstract features. When the CNN is identifying these
features, it uses something called a filter. Here's how Larry Carin, one of the authors of
a JAMA Guide to Statistics and Methods article on CNNs, describes a filter: So,
we think about a medical image, a medical image in radiology or ophthalmology or
dermatology is characterized by local structure, could be textures, it could be
edges, it could be curves, corners, etc. And what these filters are doing are
constituting little miniature versions of each of these little building blocks. And the way that the CNN looks
for these building blocks is the C in CNN, and it stands for convolution. It's a mathematical operation
that looks pretty complex. But, actually, it's very simple. It's a very simple concept. It's kind of like you've got this filter, and
you're walking to every part of the image, and you're just asking the question, how
much does this image look like that filter? Think of it like this: you
have a drawing, that's the image, and you have a stencil, that's the filter. You take that stencil and pass that
stencil over that drawing that you have, and as you do that you will see that some
parts of the drawing become more visible than others as you do that, right? And that process of sliding that stencil across this drawing is essentially
the process of convolution. Now that we've explained what a filter is
and introduced the concept of convolution, let's use an analogy of written language to
understand the relationship between the filters and the hierarchical structure
of the layers in a CNN. We will simplify the explanation
by using an analogy. The analogy is a written document. In order to communicate through writing,
we organize it as a series of paragraphs, which are composed of sentences, those sentences
are composed of words, and the words of letters. So reading a document requires assessing
the relationship of letters to one another in increasing layers of complexity,
which is a kind of "deep" hierarchy, like the hierarchy in image analysis. Continuing with our analogy,
let's say we're looking for the phrase Ada Lovelace in a paragraph. Ada Lovelace was a mathematician and
writer who lived in the 19th century. And she holds the honor of having published
the very first algorithm intended to be used by a machine to perform calculations, which
makes her the first ever computer programmer. In the first layer of the network, a CNN
looks for the basic building blocks of an image. The basic building blocks of
written language are letters. So in this analogy, the filters the CNN
uses in the first layer would be letters. Let's zoom in on the word "Ada." Here is what the convolution process
would look like for the letter A. When the "A" filter overlies the letter
"A" in the original image, the convolution output would
generate a strong signal. This signal would then be mapped
onto something called a feature map. The feature map represents how well
elements in the image align with the filter. If something is there, the signal outputs white. If nothing is there, the signal outputs black. CNNs generate a feature map for every filter. So in our analogy, there would be
a feature map for every letter. These feature maps would then
become the input for the second layer. In this layer, the CNN would spatially align and
"stack" all those maps from the previous layer. This would allow the CNN to then look
for short, specific sequences of letters in all the feature maps simultaneously. So the CNN would use a new set of filters to
look for specific letters that are adjacent to one another in particular sequences. In our analogy, the second layer would
look for places where the letters A, D, and A are in sequence together
making the word "ADA". It would also look for places where
letters A, C, E, L, O and V are adjacent to one another using filters for LOVE and LACE. The output of the second layer
would be the feature maps for those three sequences of letters. In other words, in those feature
maps, strong signals would be present where the sequences ADA, LOVE and LACE
are located in the original paragraph. In the third layer, the CNN would
stack and align these three new maps and perform more convolutions-this
time identifying where longer words and groups
of words are located. So the CNN could at this point identify where in
the original paragraph the sequences of letters and words making the phrase
"ADA LOVELACE" are located. In our analogy, we were looking for
a phrase consisting of only two words. Had we been looking for a longer sentence
or even a paragraph, the CNN would deal with the greater complexity
by having more layers. We've omitted quite a few
details about CNNs for simplicity, but this captures the essence of the model. But what does this look like for actual images, like identifying diabetic
retinopathy from an ocular photograph? Images are made out of
pixels rather than letters. In a digital context, a pixel is the smallest, controllable unit of an image
represented on a display. Each pixel is a representation
of a tiny portion of the original image. Think about pixels like creating
a drawing with dots where every dot has a color value and an intensity. The more dots used, the clearer the
image becomes. The filters a CNN uses in that first layer
are small squares of pixels that correspond to things like textures, contrast
between two colors, or edges. These are the image analysis-equivalents
of the letters used in our analogy. And as a CNN goes up in the hierarchy,
it looks for combinations of these filters, getting more and more complex with each layer. As the complexity increases, the CNN gets
closer to identifying what it's looking for. So the specific features analyzed at each
layer help put the whole thing together. So, for example, some of the earlier work
showed that some layers tend to be better at extracting, sort of like,
edge-like information. Meaning that, for example, if you combine
different kinds of horizontal edges, we might get a continuous line that
resembles the retinal blood vessels. And as you combine more of those and start
to encode more higher-level concepts such as, you know, is there a micro-aneurysm here, is
there bleeding over here, is there other lesions in the image? And right at the very end is
where these, after these multiple layers, the network will try to then condense all of
that information down into a final prediction. In this case, severe diabetic retinopathy. Developing a CNN to help identify diabetic
retinopathy was motivated because many patients with diabetes are not getting
screened frequently enough. We have to screen diabetic
patients once a year or we should, and there are some barriers
to getting that done. Some of it is just, you know, not having
enough trained professionals to do that task. It's also not having that expertise
available where the patient is. It's not that, you know, there
aren't retina specialists in a metropolitan city four hours away, it's that there isn't a retina
specialist at your grocery store. And CNNs could facilitate the
integration of diabetic retinopathy and other screening programs into primary care. But before that happens, more research, especially prospective clinical
trials, are needed. The way we do approach these things is
really the way that medicine usually works, which is to say, "let's do validations
of the method again and again and again until we're sure, we're reasonably confident
that it really works on many kinds of images, in many settings for, you know, many
different patient populations." And so from my perspective that's really at
the end of the day what's most important: does it work on real patients
and is it reliable? The excitement generated by early results has
already spurred several research groups to look into the efficacy of CNNs in clinical practice, which could potentially finally get
CNNs from the bench to the bedside. I think we're on the third or
fourth technological revolution where neural networks are
coming to the forefront, and I really hope that this
time we'll get it right. But there were failures in the past where
people used the technology in suboptimal ways and we don' t want it to happen again. One has to make sure that we have appropriate
and sufficient data for development, validation and testing, and that we're
solving actual clinical problems. At the end of the day, one thing to
take away is that even if, as a clinician, it can be hard to understand
exactly how a CNN arrives at its diagnosis, it can still be a useful tool. And this is similar to how many clinicians
use other widely-adopted technologies. Consider antibodies: You know, as a
clinician I may not know exactly where that part of an antibody kind of binds to, but
I'm comfortable after looking at some of this clinical validation of using
Lucentis, for example, for an injection, right. This is kind of like any new
breakthrough technology: needs validation and needs transparency, but I think,
you know, the medical community in general responds very well to new
technologies that have been validated. This video is meant to be an
introduction into the topic of CNNs. To further understand how machine
learning works in a clinical context, be sure to read the JAMA Guide to
Statistics and Methods article by Drs. Carin and Pencina in the
September 18, 2018, issue of JAMA.