Getting Started with Whole Genome Sequencing - #ResearchersAtWork Webinar Series

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone thank you for taking the time to join our webinar today's topic is getting started was whole genome sequencing our webinars are designed to provide helpful guidelines and information that will keep you up to date important areas of Life Sciences Research today we'll begin with a brief history of genetics and intro to next-generation sequencing next we'll discuss important considerations for whole genome sequencing and go over the general workflow finally we'll cover the important areas of NGS in genomics research with a focus on the human genome project now I'd like to take a moment to introduce my colleague dr. Chris a first man Jack ng a specialist at ABM after completing his PhD at University of British Columbia where his study stem cell regulation he joined ABM to help scientists achieve their goals was nearly 10 years of experience in research and experimental design he can assess was nearly any project from initial setup to post sequencing data analysis alongside him you have me nancy yao a product specialist was over seven years of R&D experience in molecular biology and biochemistry our goal is to provide researchers with the support that ABM has been renowned for for those of you who may be new to ABM I'd like to provide a bit of background our company IBM was founded in 2004 in Vancouver Canada and we have been working hard to highlight scientific discoveries in the field of life sizes and drug development for 15 years we're constantly while the fastest-growing biotech companies in the region and we have work hard to become known as a reliable source for researchers all around the world due to our success we can be able to expand our facilities beginning with a branch in China in 2013 and a new facility in the United States in the near future these expansions help put us in a position to better work with each of you and to provide a world-class support whenever you may be with our team of dedicated scientist ABM is committed to empowering researchers with the latest innovations for all of their scientific needs before we jump into this topic ask first go over genetics research and a brief history of genetics to better understand impact on next-generation sequencing in the mid 19th century Gregor Mendel's experiment and pea plants demonstrated that many characteristics could be inherited passed on from parents to their offspring this led people to search the materials that contributed to those inheritance many years later hugo DeVries proposed the existence of genes that inheritance of specific traits in the organism as containing particles in 1913 Thomas Hunt Morgan and students developed the first genetic map the molecular structure of DNA was not discovered until 1953 Rosalind Franklin's work was used by James Watson and Francis Crick to describe the double helix model of DNA this finding led to Frederick Sanger developing a method to sequence DNA nearly 20 years later nowadays next-generation sequencing is used worldwide especially for large-scale automated DNA sequencing allow historical sequencing that federal sankar could only help me imagined when studying the worlds of genes in development and disease there are three levels that we can examine first we can study genes at a DNA level investigating mutations to determine effects on phenotypes next at the early level by studying changes in gene expression a family at the protein level by investigating protein expression and activity the smartest sequencing technologies we can easily connect a genes DNA sequence to his function within a cell or tissue this is particularly important for trying to connect phenotype to actual genes that may be responsible Sanger sequencing has has been the gold standard for many years and offers a number of advantages including a relatively low cost for short DNA sequence and is easy sample preparation with Sanger sequencing you can often get a same-day result the sequencing result on the other hand there are a number of important things to consider including the fact that while super for low throughput samples Sanger sequencing may not be ideal for doing high throughput sequencing and it requires primary design validation or custom cloning and for high road cycles or reliable samples Sanger sequencing does not provide clear results and a performing whole genome sequencing the process is very labor-intensive and time-consuming with NGS on the other side the next-generation sequencing using alumni platform by contrast of our scientists signed his a powerful option for high throughput screens sequencing and researchers can sequence an entire genome without having to do primer design cloning and troubleshooting it's possible to sequence hundreds of samples and sequence thousands of genes at one time and also offers the ability to study small variations between samples including single nucleotide polymorphism even from heterozygous samples well NGS is the excellent tool for genomics research there are many important things to consider it will require very expensive equipment as well as technicians for samples preparation and actually running the sequencer and due to the reagent cost and the use of multi-million dollar machines NGS has a higher cost than traditional Sanger sequencing finally well ingest can generate a lot of data it's often challenging to sift through this data to find information that you're looking for without a dedicated by mile informatics team as alumna certified service provider ABM uses a number of alumna platforms including the mysic mexic heizak and very soon over SiC platforms the first genome was sequence in 1977 from bacteriophage with Sanger sequencing and many years passed before the whole genome sequencing took off in tinnitus firstly in e.coli and then East next genomes were released for multicellular eukaryotes with the models like the elegans Arab Dobbs's fruit flies and Philae mice the Human Genome Project was a 13 year effort causing nearly three billion dollars and with Illumina sequencing and other ingest technologies there has been an explosion in published studies using whole genome sequencing as the cost of sequencing decrease and more researchers realized the benefits the benefits of this technology is now easier than ever to go from a genome DNA sample to a sequence genome was only a number of weeks now I would like to hand things to Christopher who will go over the whole genome sequencing and a brief discussion for both plasma verification and mitochondrial DNA sequencing Thank You Nancy for that lovely introduction next I'll take things over and go a bit into next-generation sequencing to give you an idea of how it works before moving on to whole genome sequencing next-generation sequencing can be used for a number of different approaches whether it's sequencing an entire genome as we'll discuss today studying changes in gene expressions such as with RNA seek we're doing metagenomic studies for environmental samples as we'll cover in our next webinar all next generation sequencing follows the same basic workflow where you input the starting material in this case DNA fragment the DNA to be AB uniform size ligate sequencing adapters and then perform sequencing in this case on the Illumina sequencing platform now I want to go for a couple important terms to know for next-generation sequencing and whole genome sequencing in particular the first is the read this is the sequence of nucleotides that is sequenced from the DNA molecule in this example you can see that we have a double-stranded DNA molecule when the sequencing is performed it'll read the molecule along one strand this is called single end sequencing where only a single strand of the DNA molecule is read you'll turn over to this paired end sequencing is where the fragment is read from both strands from the 5 prime end to the 3 prime end next when choosing what to pick for your project you need to choose between single and sequencing versus paired end with single end sequencing this is more appropriate for studying changes in gene expressions such as with RNA seek where you're only interested in actually the total read count so only single reads are required with paired end sequencing this is more common for whole genome sequencing projects particularly because the paired end sequencing strategy is important for assembling the full sequence and helping with the alignment to the reference genome next reads can come in different lengths the read length refers to the number of nucleotides that are sequence per read and I'm going to go over a couple common read lengths for next-generation sequencing and whole genome sequencing in particular here 75 nucleotides is one of the shortest read lengths which is suitable for prokaryotic or simpler genomes 150 nucleotides is more common for eukaryotic or complex genomes and read links in excess of 10,000 nucleotides are some of the longest reads available and it's ideal for gapless assembly now I haven't actually discussed this yet so you're probably wondering what is gapless assembly whenever you do sequencing you'll end up with a set of overlapping DNA sequences in the example here you can see that there is one read going from left to right and another going from right to left when you have enough of these reads that are overlapping you would eventually form one contiguous sequence or a contig with any sequencing project you would then assemble all of these and have multiple contexts present from your sample whenever you have gaps between a contig i've shown here with this red box this would make the alignment and eventual assembly of the molecule harder so ideally what you want to achieve is a gapless assembly with as few of these gaps as possible this can be accomplished using shorter reads and deeper sequencing to try to cover every region of the genome this is associated with some of the lowest cost for whole genome sequencing so it's very popular with most researchers next you can choose to have longer reads such as 10,000 nucleotides which will generate longer can con takes this has a bit high of a price so it is less common the best solution to try to get a gapless assembly though is to combine deep sequencing with shorter reads as well as long reads sequencing and because this combines the first and second approach this has the highest overall cost next I want to go over reads and coverage a bit so in this graph you can see that there is a read shown in light blue which is mapping to a specific reference sequence whenever you perform sequencing there will be many reads for region mapping to a reference sequence in this specific example you can see the gene nucleotide in the reference and there are two reads in light blue above it that also sequence this nucleotide because there are two reads in this region it would be described as two x coverage now not every nucleotide is sequenced with the same sequencing depth or coverage so for the C nucleotide over here you can see that there are four reads that map to it and then some regions have no coverage such as this a nucleotide here which has zero reads and would have zero ex coverage if you take these three nucleotides into consideration they have varying coverage from 0 X - 4 X which would eventually average out to about 2x overall but does coverage always vary between samples next I'll go over two different samples and a depiction of the coverage that could be achieved with them so in sample a you can see that there are a number of reads that map to the reference genome in the graph below this traces the coverage ranging from 0 x4 0 reads up to 4x representing 4 reads per nucleotide in sample B it has the exact same amount of reads but you can see that the coverage per nucleotide varies widely compared to sample a for this region reason you always want to ensure that you have sufficient sequencing depth to have somewhat uniform coverage even between different samples and most samples you want to sequence billions of bases to ensure as much of the genome as possible to get sequenced typically bigger genomes require more reads or greater sequencing so in the examples on the right side you can see that for different sized genomes such as bacteria and mammals or plants they typically have different genome sizes ranging from 5 million base pairs all the way up to 17 billion base pairs and for this you would typically have a different sequencing strategy so for bacteria you might sequence 1 billion base pairs versus 90 billion base pairs for a mouse or rat and up to 170 billion base pairs for this plant example each of these represent different levels of coverage based on the genome size and the amount of data sequenced so for the bacterial sample I described here this would be 200 X coverage for mammals this would represent about 30x coverage and for plants this would only represent 10 X coverage so it's important to calibrate the sequencing depth relative to the genome size for your species next I'm going to go over a couple important considerations for whole genome sequencing first most of the material that's present in a cell or tissue isn't actually DNA it's a combination of teens organelles lipid membranes as well as RNA molecules the actual genetic material typically only represents about one to two copies of the genome per cell because of this you typically need a lot of genomes for sequencing which would require thousands or even hundreds of thousands of cells when you're going through the extraction process though you need to be careful because there's always the risk of shearing or fragmenting the DNA which can compromise the sequencing results finally there's always the risk of RNA contamination which can lead to degradation of the DNA before you've even begun to prepare your sample with sample preparation and extraction you need to ensure that you have enough DNA that's present in order to successfully prepare an actual sequencing library you need to start with enough starting material unlike with RNA seek as we discussed in the last webinar we had you simply can't amplify the DNA if the amount is too low unlike with RNA next you want to be careful to perform the extraction without fragmenting the DNA to do so you want to avoid roughly handling your samples so this means no vortexing or aggressive pipetting additionally you want to avoid repeated freeze thaw cycles and ensure that you're alluding your sample in nucleus free DD h2o or in buffered solutions finally you need to be careful to avoid RNA contamination you can do this by carefully following the manufacturer's protocol if you're using a kit performing an RNA x' treatment if there's still RNA present but the most important factor is to maintain clean work areas as the RNA contamination present in a sample can lead to fragmenting the DNA and degrading it leaving with something that's not suitable for sequencing next I'll go over a bit of the NGS workflow see of a better understanding of the process from sample preparation to sequencing output with any whole genome sequencing project you need to ask what your goal is are you interested in de novo genome assembly if so you have to choose the sequencing platform you want to use for instance the Illumina platform can provide shorter reads as I mentioned earlier of 75 to 150 nucleotides and provides the most affordable option for PacBio this can provide you with much longer reads such as 10,000 to 50,000 nucleotides it's much more expensive and has more challenging sample prep but can provide you with a more gapless assembly finally you can combine both Illumina impact bio sequencing platforms to get the best possible result this typically has the highest overall cost if instead of doing de novo sequencing you're only focused on resequencing a sample for which there's a reference genome available for that species the considerations are different and you can typically proceed with the Illumina platform with shorter reads with this you would typically sequence the sample and then perform an alignment to the reference genome before beginning your analysis next I'm going over the NGS workflow a bit briefly for whole genome sequencing generally we begin with sample prep as the first step before loading the samples onto the sequencer once this is done there's an important step called cluster generation that happens prior to sequencing each individual nucleotide before you'd get the results delivered first you want to take your input material assess its quality and then you would proceed to library prep the first question is determining whether or not the DNA in your sample is degraded if it's not degraded you can proceed right to library prep however if it is degraded you generally have to stop here and try to obtain a new sample next you take the DNA and fragment it into uniform sizes to make sure each fragment is equally likely to be sequenced next you would like a t' sequencing adapters which are important for having the DNA sequence behind the sequencer itself following this step when you load the sample onto the sequencer there's a step called cluster generation or bridge PCR this is a process where the DNA molecule first binds to the sequencing flow cell and then forms a bridge to bind on the other end of the DNA molecule once this happens there's an amplification step which generates another copy of the DNA molecule as you can see an image 3 after repeated cycles of this you end up with small clusters of identical DNA sequences seen in panel 5 which has two clusters when sequencing proceeds the sequencer reads the DNA sequence from each of these clusters individually to get the final sequencing result the sequencing process itself uses aluminous technology for sequencing by synthesis where individually where fluorescently labelled nucleotides are individually added one at a time to the DNA molecule each time a nucleotide is added it gives a specific fluorescent emission which is then imaged by the sequencer the sequencer combines all of these images to decode the DNA sequence and determine the final nucleotide sequence that's present in the sample at every single stage of this process though quality controls essential and there are a couple of different systems we can use to perform this QC first you want to assess the quality and quantity of your sample the most basic thing you can do is to run a DNA gel to determine if your sample is degraded or low concentration in this example here you can see the DNA ladder in the leftmost lane but you can also see a relatively high molecular weight band in the second lane indicating a large amount of DNA which is of large size as well in the next Lane you can see that it is still high molecular weight but there's lower overall sample amount present and in the final Lane you can see that there is a large smear which generally indicates DNA degradation and that the sample should not be should not proceed to library prep you can measure these quantities or concentrations using two different metrics the first of which is nanodrop which many labs have the next is cubit which is less common but what we use here at ABM with nanodrop you can often have large variations between samples or even within samples for the same measurement as nanodrop can measure both single-stranded double-stranded DNA and RNA as well as the salts that are present in your sample cubed on the other hand is much more accurate and we prefer this from measuring the actual quantity of your sample that's present to get a more accurate result next during library preparation we use both the Agilent bioanalyzer to assess fragmentation as well as qPCR to determine the success of adapter ligation with the Agilent bioanalyzer you will use this to get an idea of the overall fragment size present in the sample and this graph here you can see on the left-hand side that there is a peak and on the right hand side that there's a peak these are effectively markers that give you an idea of small fragments on the left versus larger fragments on the right when you actually run your sample you would ideally get one uniform peak which is of a larger fragment size this is highlighted in green here and this would be generally a good result that would be able to proceed to eventual sequencing on the other hand if you have a very fragmented sample that doesn't have a uniform peak distribution this would generally indicate sample degradation or something that isn't suitable for proceeding to sequencing as I mentioned earlier you can use the qubit system to tell how much nucleic acids are actually present in your sample but this doesn't tell you how much of this sequence is actually successfully prepared and can be sequenced by the machine this is where qpcr comes in because you can use it to determine how much of your library is actually sequencing from that sample next once the sequencing is complete you generally have to process the data we would typically go from raw data to a special type of data put called fast queue before data analysis raw data would represent the actual raw sequencing data from the sequencing machine itself fast queue would be the alphanumeric information for the sequence with quality control info as well this would then be followed by data analysis which first begins with alignment to the reference genome now you might actually ask what does the data look like for all next generation sequencing the data generally looks like this which is a bunch of alphanumeric text with some sequence information buried in it as well as quality control data this red box here highlights the actual sequencing read from this particular sample but now you're probably asking how do I use my data so as I mentioned the first step in this is to align your sample to the reference genome that you're working with in this process you first have to determine if there's a reference genome available for your species if there is a reference genome available there are a number of things that you can do the first of which is single nucleotide polymorphism and short insertion or deletion calling next you can also perform variant detection as well as impact annotation to describe the impact of a given variant whether or not it leads to a truncated protein a premature translational start site or another modification you can also use this to perform phylogenetic analysis or even functional enrichment analysis where you compare differences in gene families between multiple samples if you don't have a reference genome though this is a bit more challenging but there are still a number of things that you can do the first of which is to perform de novo assembly which depending on the genome size can be quite challenging next you would have to go through a process of gene prediction to try to identify where the genes are you could then functionally annotate these finally you could do gene family clustering once you've gone through this process of de novo assembly and have established a reference genome you could then repeat the process and go through the snip and short indle calling variant detection and other analyses I mentioned for when there's a reference genome available in this example I'm gonna go over a snip detection indle calling very briefly but in this example we have DNA that's been sequenced and you can see that the reference sequence in dark blue indicates a genome clio tied but for all of the reads that were mapped this reference sequence in light blue you can see that there's a t nucleotide this would indicate a single nucleotide polymorphism at that site in this sample alternatively on the other end of this reference we have a gene oklet ID as well but none of the reads that map to it have that gene nucleotide despite overlapping that site this would typically indicate a small deletion which could lead to truncated proteins changes in amino acid coding or be silent mutations in some cases finally I'm going to go over other applications for whole genome sequencing using a similar workflow including plasmid verification and mitochondrial DNA sequencing with plasmid sequencing many researchers have materials such as plasmids that are present in their lab but they're unsure of the sequence this can be challenging if you're trying to design a cloning project or work with material that you inherited from one of your colleagues with plasmid C we would first begin by fragmenting the plasmid into small pieces we would then ligate sequencing adapters sequence it using the Illumina platform and finally perform analysis where we can compare the actual sample and the sequence we determined to the reference sequence if this is available and then this can help you design your eventual cloning strategy for your project or perform your next set of experiments what o'connell DNA sequencing is quite similar especially if you want to study changes in mitochondrial DNA in response to drug treatments or mutations that occur to mitochondrial DNA sequences for instance in cancer cells this workflow is very similar except it has an important first step where we would isolate mitochondrial DNA from the sample next we would go through the fragmentation step followed by adapter ligation and sequencing and then we could compare the sample that was submitted to the reference sample or if you have multiple samples present we can do analysis in that manner as well now I'd like to hand it back to Nancy to discuss a bit more of the human genome project compared to sequencing today Nancy [Music] Thank You Christopher for that excellent talk now I would like to spend a few moments discussing the jump we have made in DNA sequencing from the beginning of the Human Genome Project to the present day the human genome project began in 1990 and took more than a decade to complete with hundreds of scientists working together at a cost of approximately three billion dollars before it was completed in 2003 the primary goal of the project was to discover the complete set of human genes providing starting point for additional studies and to release a complete DNA sequence of the human genome once complete the project identified nearly 22,000 and 300 protein coding genes within the 3.2 billion base pairs of the genome the assembled genome represented not a single individual but a combined genome from a number of different subjects to create a reference gene the reference genome even today is important to use more than one individual sample when sequencing new genomes to try to generate a representative reference surely of the project was completed the first personal genome was sequenced in 2007 belong to the notable scientist craig Venter and was further advances in NGS is now possible to sequence an individual's genome for only a few thousand dollars in a number of weeks representing a marked events from the ESM balance required for the human genome project was that I would like to conclude our webinar but before we wrap up we like to offer you a special promo code for 30% off biometrics and analysis for whole genome sequencing plasma verification and malacandra DNA sequencing to help you take advantage of our dedicated bioethics team to help you analyze your data ABM also offers a number of resources for researchers that are interested in learning more our website knowledgebase has many articles covering a wide variety of topics our YouTube channels as many videos to the to cover the important content in access in a successful way and don't forget to check out our blog for the latest post covering tools to help you succeed in your experiments we also have a technical support and customer service teams that can assist you at every stage of your project with ABM if you have any questions about our materials or services you can always reach out to us by email or phone and if you are ready to begin with our next-generation sequencing project you can begin by sending an email to Christopher and our NGS team at MGS at ABM comm in our next webinar which will cover the next generation sequencing for crisper studies as well as metagenomics thank you for joining us today if you enjoy our webinar please select the video and subscribe to our channels to see more and if I have any questions or comments please feel free to leave them in the comments below thank you you
Info
Channel: Applied Biological Materials - abm
Views: 34,572
Rating: undefined out of 5
Keywords: abm, applied biological materials, Whole Genome Sequencing, Next Generation Sequencing, NGS, Intro to NGS, genomics, introduction to ngs, abmgood, webinar, #ResearchersAtWork
Id: vIdc0NLQ2ww
Channel Id: undefined
Length: 32min 15sec (1935 seconds)
Published: Thu Aug 29 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.