Hello, I'm Eric Chow. I'm an Assistant Professor at the University of California, San Francisco, and I'm gonna talk to you today about next-generation sequencing. The outline of our talk today will cover traditional sequencing first, and then we'll spend most of our time talking about Illumina Sequencing by Synthesis, which most next-gen sequencing is being performed on today. At the end of the talk, I'll also touch upon two other competing platforms from Oxford Nanopore and Pacific Biosciences. So, the Human Genome Project really spurred the development of cheaper sequencing. This was a 20-year effort that cost 3 billion dollars that was completed in the year 2001. And this lowered the cost of sequencing a human genome to about 100 million dollars for a single genome. And this used traditional Sanger sequencing. Well, before we go into Sanger sequencing, I want to tell you a little bit about DNA, just to bring everybody up to speed. So, what I have depicted over here is a sequence of DNA. It's made of two different strands that run anti-parallel. So, one strand runs in one direction; the other runs in the opposite direction. And your C's hybridize to G's, and your A's hybridize to T's. Sometimes DNA is depicted as... as an arrow, shown here, just to simplify images. And we'll be using this quite a bit today. Now, DNA is a double-stranded molecule, but it can be denatured and separated into individual strands. And it can be re-natured back together, or we can add on different sequences of DNA and hybridize those on, such as this red piece of DNA, over here, that's now bound to this bottom template strand. And DNA can be copied by polymerases. And these DNA polymerases, shown here in orange, can bind to short pieces of DNA that'll hybridize onto a template strand. And they can polymerize DNA with building blocks to extend DNA into a newly synthesized strand. And these building blocks of DNAs are called deoxyribonucleotide triphosphates. And over here are the four different versions, your A's, C's, G's, and T's. And these all share several common aspects. First, there's a triphosphate group that allows the growing strand to attach onto the building block. They have a 3' hydroxyl group which is used to add on additional building blocks of DNA. And they have four different bases attached to them. So, with traditional Sanger sequencing, if you wanted to sequence a piece of DNA like this, you would have to first denature it and add on a primer so that DNA polymerase can bind on, as depicted here. And now, instead of adding just the traditional DNA bases, which would allow the polymerase to extend this DNA molecule, we actually use fluorescent terminators. And fluorescent terminators are very similar to the DNA building blocks depicted here, but there are a couple differences. First, you might notice that there are different colors. And so, there's a different fluorescent group attached to each of the four bases: a yellow for A, a blue for G, a red for C, and a green for T. Additionally, the 3' hydroxyl group that's present on the building blocks to allow DNA to continue extending is now terminated or removed. This is why these are called fluorescent terminators. They have fluorescent group and a terminator that prevents DNA polymerase from further extending the DNA strand. And so, in a test tube, we actually have billions and billions of copies of this template. And we put in the normal DNA building blocks, and we spike in a low concentration of these fluorescent terminators. What results is you have a bunch of fragments of newly synthesized DNA that are all different sizes, because they randomly incorporated a fluorescent terminator. And these molecules in the test tube are then put onto a DNA sequencer, which will separate these and then allow us to determine the sequence. What happens in the sequencer is these molecules are separated in size, from largest to smallest. The smallest ones come out of the sequencer first and get detected, and then the next piece of DNA gets detected, and the next. And the trace that comes out of the DNA sequencer is a chromatogram, down here at the bottom. And if you follow the color changes as each of the different pieces of DNA come out of the sequencer, you can build up your DNA sequence. These machines can run up to 384 samples at a time and generate about 700 bases of sequence for each of those samples. And you can sequence up to 1 million pieces of DNA, or 1 million bases of DNA, in a single day. Now, this might sound like a lot, but when you consider the human genome is actually 6 billion bases... and to sequence the human genome, you actually sample it many, many times, sometimes on the average of 7 times. This means that if you were to try to sequence a single human genome on one sequencing machine, it would take about 100 years. And so, clearly, the Human Genome Project was completed in much under 100 years, and this was made possible by factories of these sequencing instruments. Over here is a single sequencing instrument in this huge factory, where we have sometimes dozens or even hundreds of these machines running 24/7, 365 days a year. So, the human genome was a really great undertaking. It generated a lot of useful data. But it was only... it came from the genetic material of several people. And so, we still don't understand most of what this genome does. You know, we have the sequence, but again it's only from a small subset. But to really understand the function of the genome, we need to sequence thousands to millions of different people to really sample the variety of genetic material present in the human population. And it's obvious that we can't do this with traditional Sanger sequencing. So, luckily, over the past two decades, the cost of sequencing has dropped dramatically. Over here is a chart showing you how much it costs to sequence 1 million bases of DNA. When the Human Genome Project completed in 2001, it costs almost 10,000 dollars. Today, we're approaching 1 cent to sequence a million bases of DNA, which represents a 1 million-fold drop in the cost of sequencing. And if you follow this chart, you can see that there are several inflection points in 2007, 2010, and 2015, where prices had some fairly steep drops. and these were largely driven by new sequencing systems introduced by Illumina, the dominant player in the next-gen sequencing market. On this slide are three different types of sequencers from Illumina. They actually have several more, but they really span the scale of sequencing that's being offered. On the left-hand side is the Illumina MiSeq instrument, in the middle is the HiSeq, and on the right is the NovaSeq, the newest high-output sequencer from Illumina. And if we compare the output of these machines, and compare them with the Sanger sequencing platforms, we can see just how much more sequence we can generate from these instruments. If you just look at how many reads you can get from a single run, the MiSeq generates 30 million reads, the HiSeq generates 3 billion reads, the NovaSeq generates 13 billion reads, while the Sanger sequencing system generates about 400 reads. And so, it's a huge difference. And if you just look at this in terms of how much sequence you can generate in a single day, you can generate up to 4 trillion bases on the NovaSeq in one single day, compared to 1 million bases on the traditional Sanger instrument. Illumina sequencing is the dominant player in the market. It's an imaging-based method and generates many, many reads, millions to billions of reads per run. And from each of these reads, we can generate 300-600 bases of sequence. It's really, really accurate. The error rate is roughly 1 in 1,000 bases. And with the new machines, we can actually sequence a human genome for $1,000 in less than 48 hours. And the sequencing on these Illumina platforms happens in flow cells. These are essentially microscope slides with channels on the inside. What I have shown here is a MiSeq flow cell next to a standard Eppendorf tube that's used for a lot of lab work. And these tubes are pretty small, about 1.5 inches in height. And on this MiSeq flow cell, we can generate about 1-30 million reads per run. Next up is the HiSeq flow cell. You can see it's quite a bit larger than the MiSeq flow cell, and this larger real estate allows us to generate more reads in a single run, approaching 3 billion reads. And lastly is the NovaSeq flow cell, which is even larger. And on this flow cell, we can generate 13 billion reads in a single run. So, you can't just put DNA into these flow cells and get sequences out. You actually have to prepare your sample. And so, there are several steps that occur, but basically what happens is you take your sample that you want a sequence, which is DNA... if you have something like RNA, you can convert it with enzymes into DNA. You take that DNA, and then you have to add on these adapter sequences at the end. So, in blue are primer binding sites that allow the sequencing reaction to occur. This is similar to Sanger sequencing, where we needed to have a primer bind for DNA polymerase to move along the template DNA. And then we also have these capture sequences, in green and orange. And these allow your sequencing sample to be captured onto the flow cell for sequencing. Once you have your DNA sample prepared... it's a double-stranded molecule which we denature and put into the flow cell. And on the left-hand side, again, is a picture of the HiSeq flow cell. So, there are 8 channels, and samples actually go into the glass slide. There are 8 different channels within there. And inside of the slide are a mixture of short DNA molecules, in orange and green. And these orange and green molecules will bind to both ends of your sequencing library. So, we denature our sample, flow those into the flow cell, and they can get captured by these primers -- or these short DNA molecules -- on the surface of the slide. And once that occurs, a DNA polymerase is added, and the DNA building blocks are introduced. And we copy that template. And so, we have a newly synthesized strand that's now physically constrained to the bottom of the flow cell. We wash out the original template strand and then allow the newly synthesized strand to now bind onto the other DNA sequence present on the surface, in this case the orange piece. And then we flow in some DNA polymerase and building blocks, and we get another strand formed. And we repeat this process many, many times. And in the end, we get about 1000 copies within a cluster. And remember that these 1000 molecules all came from the same original template strand, so they all have the same sequence. We denature the strands, and then we can selectively cleave off one of the oligos, or the primers, in this case the green one. And we've washed those away, so now all the 1000 molecules present are all the identical strand, because we removed the other strand. We flow in a sequencing primer, and now the clustering process is complete. And this is the instrument that performs clustering. So, the first thing we do is we take a brand new flow cell -- again, this is the size of a standard microscope slide -- and then we put it onto the stage of the instrument. And the stage is a... is a region of the instrument that can actually heat and cool to perform a lot of enzymatic reactions. Next, we load a blue plate of the reagents that perform the clustering. And lastly, a strip tube with the 8 different samples that are gonna be loaded onto that flow cell. Next, a manifold is mounted on top of the flow cell and connected to the cBot. And in the back of the cBot, or this clustering instrument, are a series of pumps, that will now pull reagents from the blue reagent plate, or from our sample strip tube, and pass them through the flow cell. And this is how reagents and liquids are delivered. And at this point, the clustering procedure takes about 3-4 hours. And once that's done, we take that flow cell and move it on to the actual sequencer, in this case a HiSeq instrument. And on the HiSeq instrument, we have a refrigerated compartment for all the sequencing reagents. A lot of enzymes in there that need to be kept cool during the 3-4-day sequencing run. Above the refrigerated section is a series of pumps that pull reagents from that refrigerated compartment and send them to the flow cell, which we will now load onto the stage. And so that is the flow cell from the cBot that just completed clustering. This gets mounted onto a stage and locked in place. And then we can begin the sequencing run. And what that does is that will trigger the pumps to start flowing reagents from the refrigerated compartment into the flow cell. And behind the flow cell is actually a really powerful microscope that is used to actually sequence each of those molecules of DNA. And so, remember that when we put the flow cell onto the sequencer, this is what it looks like. We have a lot of clusters that each have a sequencing primer bound to them. And the chemistry for sequencing is very similar to the Sanger sequencing terminators that were used. But the one difference is that these are reversible. And so, they're reversible in two ways. So, first, these can get incorporated into the clusters by DNA polymerase, and those clusters will light up in four different colors, depending on which base gets incorporated. A picture is taken, and after that picture is taken, we can actually remove these terminators and the fluorescent groups with some chemicals. And what results is we regenerate the 3' hydroxyl group, so now that cluster can have another round of bases added to it. And so, this is what it looks like for a single molecule. On this end over here, we have our template strand bound to a sequencing primer. And this allows DNA polymerase to bind and then add the base. In this case, it's a yellow one. So, once this gets added, a picture is taken. And then we add chemicals to remove the fluorescent group and the terminator so that DNA polymerase can add a second base. An image is taken, chemistry is performed to remove that base and to regenerate the 3' hydroxyl, and then we can add another base. And so, over time, the cycle just repeats over and over again, with multiple cycles of base addition, imaging, and then chemistry to remove the blocks. And so, if we look at... take a look at this depiction of 5 different cycles... so, each of these represents a different image taken. And if we follow the top left and the bottom right clusters, and see how those colors change over time, we can actually build the sequence. So, for top cluster, the sequence is gonna be AGCCT, because it goes from yellow, blue, red, red, green. And the bottom one is GTAAC. And so, again, the power of this system is that you can sequence up to billions of sequences at the same time, in parallel. And this is what gives these instruments their throughput. So, some people ask me, why can't we increase the amount of bases that we get in a single read by just repeating the chemistry over and over again? Do this 1000, 2000, 10000 times? And the main reason why is that the enzymes and chemistries aren't perfect. So, over here I have a cluster, one single cluster, made out of the same template molecules. But what happens is, because the chemistry isn't perfect, some of the strands lag behind. So, this one is a yellow instead of a green. And some clusters jump ahead. So, this one is now a red instead of a green, because it hopped ahead. And what happens is these errors aren't too bad. But over time, and many, many cycles -- over the course of 100, 200, 300 cycles -- more and more strands start to lag and more start to jump ahead, so that your true signal starts to disappear, and gets weaker and weaker. And this really limits Illumina sequencing to about 300 bases for each read. So, with the Illumina sequencer, four different images are taken during each cycle: one for each of the four color bases. And so, the images actually look like this. They're not very clear. They're not perfect circles. And they're very difficult to see by eye, to figure out where one cluster starts and another begins. You know, one example is here. Let's say in the first cycle we have two clusters that are both G. These just look like a single blob that's a single color, and this is really hard for the sequencer to pick out. But because we're sequencing many, many sequences at a time, in parallel, chances are these aren't identical sequences. And if we go through another cycle, you know, chances are that they'll have a difference. So, now this cluster on the right is an A while the cluster on the left is still a G. But because of this, the sequencer can now compare these images and determine where cluster A has a very pure signal, where there's a mixture between cluster A and B, and where cluster B has a pure signal. And it'll decide to just take imaging information from just these clean areas, where they have very pure signal. So, again, this is four-color chemistry from Illumina. This is what had been used for many, many years, and it was very clear and obvious: one color for each of the four bases. But there are issues with this. So, the colors depicted here... you know, you can tell the colors are very, very different. However, if you actually look at the emission spectra, you can see there's significant overlap between the four colors. And this requires the instruments to undergo a lot of color compensation, and this can contribute to errors. So, a few years ago, Illumina introduced two-color chemistry. So, instead of using four colors to represent four bases, they're now using two colors to represent four bases. In this case, they're using red and green. And the way this works is that you have T's, which are green, C's, which are red, A's, which are actually a mix of both colors, so you'll see them in both images, and then G's actually have no color or no signal. And so, this is how you can encode four different bases with only two colors. And so, initially the quality of the two-color chemistry wasn't as great as the four-color chemistry, but recent developments on the NovaSeq platform, one of the Illumina's newer sequencers that uses two colors... the data quality actually rivals the four-color instruments. And so, there's definitely been some improvements in this area. The other benefit of two-color sequencing is that these reagents are generally cheaper to make, because you don't have to have as many colors, and the instruments are also less expensive, because you only have to capture two images instead of four. So, how far can you take this? Can you do sequencing with only a single color? And the answer is yes. A newer platform just released by Illumina uses a single color to encode four different bases. And they do this by taking two images that are separated by a chemistry step in the middle. So, I'll walk you through this. In the first image, they add the four different bases, and in this case the A's and T's have a color on there. And again, it's the same color. The C's and G's don't have any color. One thing to note is the A base has a cleavable linkage between the base and the color. And the C base has a molecule that we can use to attach something later on. So, we take an image first. So, A's will have color and T's will have color. C's and G's will not have any color. Next, we go through a chemistry step that will cleave the green color from the A bases, and it'll add a molecule that will bind to the C bases and make them, now, colored. And so, we take a second image. So, in the second image, A's won't have any color, because any that did had them cleaved off. The G still won't have any color. But both the C's and T's will have color. And what this looks like if you break this down is... if you compare a single cluster and look at its color -- whether it's on or off between image 1 and 2 -- you can get four different bases out. So, these are the different flavors of Illumina sequencing. We started off with four-color chemistry, moved to two-color chemistry, and now we have one-color chemistry. All these chemistries are still used currently on existing platforms, and each of them have their benefits. So, now we're going to move over to the long-read sequencers. We're gonna start with Oxford nanopore sequencing. And with this sequencing technology, it uses these nanopores, which are extremely small pores with really small gaps in them -- so, in this case, 1.8 nanometers -- that are embedded in lipid membranes. These lipid membranes act as a barrier that prevent currents from going back and forth, so currents have to pass through the pore. And the way the nanopore sequencer works is single-stranded DNA -- or even RNA, in this case, now -- is threaded through the pore. And depending on which bases are in the pore at a given moment, that changes the current that's read out, and detectors can measure this change in current. And over here is just a depiction of what this looks like. So, for instance, you might have a similar... one level of current for your T's, a different one for G's, something else for C's, and then another level of current for A's. And you can just measure these current traces over time and build a sequence. In reality, the traces don't look this clean, because in the pore there are actually up to six bases at a time present. And with six bases times four different possible bases, for your A's, C's, G's and T's, this generates up to 4,000 different possible states. And so, it was really a technical tour de force to get this to work. But today, you can get this to work. You can buy these systems. And they do have high error rates, anywhere from 10-15%, and these errors tend to be biased. But you can get really, really long reads. The current record is a two mega base, or two million base read, from a single piece of DNA. And again, as I mentioned, you can now directly sequence RNA, so you don't have to convert that to DNA. And there are even possibilities of sequencing proteins in the future, which will be very exciting. The other benefit of the Oxford nanopore system is that it's extremely portable. You can see this scientist is actually doing some sequencing out in the field. This looks like a jungle somewhere in the world. You can see the foliage in the back. So, there's a laptop running the sequencer. And the sequencer is actually the small device that the scientist is pipetting into. It's really the size of a small remote control, and it's powered by the computer that it's attached to. So, it means you can really take sequencing anywhere in the world. These systems have even been taken up to the International Space Station for sequencing in space. So, the last technology I'm gonna talk to you about today is Pacific Bio sequencing. And this is another sequencing by synthesis method. And it uses building blocks similar to traditional Sanger sequencing and Illumina sequencing, but they're slightly different. So, down here I'm showing you an A base from the Pacific Bio sequencer. And you'll notice first there's no block on 3' hydroxyl end. This means that once this gets incorporated another base can get incorporated right away. And then the fluorescent group is actually attached to the phosphates. And these phosphates actually are removed once a base is incorporated. This means that this fluorescent group, once it gets incorporated into a DNA strand, will float away from the DNA strand. So, this means that we don't have to do separate chemistry to enable the reaction to proceed. This happens in real-time. And on the instrument, there's a really, really tiny array of wells into... in a plate. And these wells are only about 100 nanometers in height. At the bottom of each of these wells is a DNA polymerase. And what happens is a template molecule is bound to the polymerase, and then it starts incorporating those modified bases. And there's a camera at the bottom that's taking the video that's monitoring this reaction in real-time. So, remember, we have four different bases that have four different colors. And these bases are flowing in and out of this well really, really quickly. And so, you get a signal that's just kind of very noisy. You don't really see anything happening. But when the proper base is bound to the polymerase and matches the template, it actually dwells there for a certain amount of time. It's really a split second. But that split second gets recorded by the video, and you see this bump up in the fluorescent signal, in this case blue, which is the G base. And so, once the incorporation occurs, the phosphates leave, and the fluorescent group that was attached to the phosphates also leaves, resulting in the signal dropping back down to this... this background level. And this will continue until the next base that's correct binds, and you'll see another spike. And so, by looking at these spikes and signals across anywhere from 1-8 million wells at a time, we can build up DNA sequences. And PacBio sequencing generates long reads, anywhere from 1-200 kilobases in length. So, this is shorter than the nanopore sequencing, but it's still much longer than Illumina sequencing. It also has a high error rate, similar to Oxford nanopore sequencing, but its errors are random. This is actually a good thing. Because the errors are random, if you sample the same DNA molecule several times, you can generate a very accurate consensus sequence. And on the top is a model of what a PacBio library looks like. There's a DNA sequence, and the adapters, in green, actually cause the DNA to become this dumbbell shape. So, it's one circular structure. And when a polymerase binds and starts to sequence, it can actually go around and around this molecule many, many times, generating a series of reads that are all attached together that came from the same identical template. So, for instance, if we have a mutation present in a sample and wanted to detect this, this mutation should be present in all the copies. And so, if we match up all of the different reads together that came from the same molecule, we can see that the true mutation is actually detected. We see that in every single read. And then other random errors that show up don't match up with each other. And this allows us to generate a very accurate consensus that has an accuracy of anywhere from 1 in 1,000 to 1 in 10,000, actually, so even exceeding raw Illumina sequencing accuracy. So, why do you long reads? Because there are certain downsides. They're harder to prepare. In general, they cost more. But there are certain benefits. And one example is if you want to assemble a new genome that hasn't been sequenced before. And so, this is kind of like solving a puzzle that's been either split into 10,000 pieces or into 4 pieces. Illumina sequencing in this is analogous to the 10,000 pieces, because you have many, many short reads that you have to stitch together, and this can be very, very difficult. However, if you have very long reads, and not that many of them, it's really easy to align them to rebuild this new genome that you want to assemble. It's also useful if you want to identify structural variations. So, for instance, in a lot of human cancer samples, there aren't just single base changes and mutations. There are some that are structural variations, where there are huge chunks of DNA that have been moved to different areas of the genome, here or there, or flipped around. And these can be very difficult to detect by short-read Illumina sequencing. But if you have really long-read sequencing, you can span these changes and then really easily identify the structural variations. Lastly, you might want to identify where mutations come from. For instance, in humans we get one set of our chromosomes from mom, one set from dad, and there can be mutations on either one. Sometimes mutations might be on the same set of chromosomes. Some might be on either one. And to be able to do this with Illumina sequencing, again, is very difficult, because the short reads make it hard to determine which chromosome they came from. But with long-read sequencing, it's much easier to do so. So, there are lots of applications for next-generation sequencing. I'm gonna talk about a couple of examples. One is prenatal testing. This used to be done using a very invasive amniocentesis sampling, which had certain rates of complications associated with it. But with next-generation sequencing, we can actually take a blood sample from the mom, nowhere near the fetus, and actually sequence all the DNA that's present in the blood. And the reason why this works is that fetal DNA actually makes its way into the mother’s bloodstream. And so, we can use that to detect any type of chromosomal abnormalities in a fetus. And the same thing happens with cancer and transplant rejection patients. So, in cancer, the cancer cells are constantly shedding their DNA out into the bloodstream. And because the cancer genome is gonna be slightly different than the normal genome, we can identify those through sequencing. And the same thing with transplant rejection. If a donated heart, for instance, is undergoing rejection, its cells are dying and shedding DNA. And those donor DNA sequences are very different from the recipient's DNA sequences, and we can detect that by sequencing that material. Another example of using next-gen sequencing is to detect pathogens. Pathogens have their own set of DNA and RNA that are very different from human sequences. And so, if there's a pathogen present in a human sample, we can sequence this human sample, ignore all the reads or the sequences that are human, and ask, what's left? And typically, if there's a pathogen present, we'll be able to detect its sequence in that sample. And lastly, in the context of cancer treatment, cancer again is driven by mutations. And there are lots of different therapies for cancer, but they're very specific for certain cancers. And so, by doing next-generation sequencing on cancer samples, we can determine the best types of treatments for patients and improve their outcomes. So, the future of sequencing is very exciting. Currently, it's 1000 dollars to sequence a genome, but in the next 5-10 years this cost will probably come down to 100 dollars. And it'll be interesting to see what happens once this cost is reached. For instance, will genome sequence just become a routine part of your medical record? And if so, this would be a great resource for researchers trying to study genetic diseases. But at the same point, you have to understand and figure out ways of protecting a patient's genetic information, if all of this is out there. But it's gonna be very exciting to see where... what new applications are developed for next-gen sequencing as the technology continues to develop and mature. With that, I'd like to thank you for watching and joining us today.