Daniel Wegmann: Tracing the spread of farming into Europe using ancient DNA

Video Statistics and Information

Video
Captions Word Cloud
Captions
so welcome all to this si B virtual computational biology seminar series today we have the pleasure to host Daniel Bergman from the department of biology at the University of Freiburg and also member of the SIV Daniel Gatti's speeding computation about population genetics in 2009 with Lahore Escoffier was also a member of TSI B at the Institute of ecology and evolution the University of Bern from 2009 to 2011 people should is the research with the postdoctoral training with John Rivera the department of ecology and evolutionary biology of the university of california in the US then Daniel came back to Europe and he did a second postdoc at the School of Life Science EPFL here next door in Jeff Johnson's lab and then since 2013 he is an associate professor at the university of freiburg leading the statistical and computational evolutionary biology group which is also affiliated to the swiss institute of bioinformatics so the primary aim of science group here is to better understand the underlying evolutionary and ecological processes that have been shaping this diversity over the course of evolution on our planet to achieve this the group designs and evaluates new statistical and computational approaches to infer complex evolutionary histories so for this day develop and apply machine learning algorithms they then apply these approaches to the wealth of data currently being generated most in collaboration with experimental groups and they are further committed to making all their development available to the scientific community by releasing easy-to-use software packages I encourage you to have a look at the webpage of the Daniels group on the website of the University of Freiburg you're interested in knowing more about the packages so today Daniel will take us will tell us about bioinformatics our bioinformatics and population genetics can help tracing the spread of farming into Europe using ancient so Daniel thanks again for accepting this invitation and the floor is yours okay this is easy waste looks to people online right you just tap on a mic okay thank you very much for the introduction and and thanks for hosting me here is the first time I'm giving a seminar where I don't know half of the audience or so it's actually virtually absent from the room so I don't know it's like kind of weird feeling but we're gonna go through so I gonna talk about em as it was said the spread of farming in Europe but before we gonna start I just have like a picture here so this is for the people listening online in case you're wondering like who that person is that is talking to you for the next 45 minutes or so well it's me and this guy here that you see a picture of and anyway so before we're gonna head into all the PI informatics and population genetics I'm gonna just start my talk by actually giving you a very ignorant and super high-level summary of what I think was probably the most dramatic social or cultural revolution in the history of Europe okay and in order to better understand that kind of revolution we're gonna just turn back the wheel of time about 12,000 years and that was a time at which when we were looking at Europe most of the continent was actually colonized by societies that were basing their over trying to search for food by foraging that's hunter-gatherers and all over all over the continent we by now know that many of these different groups are actually genetically quite distinct there's a large diversity it's very little is known about that but we just know that there was a large diversity different groups but they were pretty much all doing the same thing that is running after animals and collecting vegetables of various fruits in the forest then about 11,000 years ago something quite remarkable happened that is we see the first farmers arising in the Middle East and more precisely in an area known as the Fertile Crescent the change from foraging to farming is quite a dramatic change it's not only the change in terms of what you eat but with it comes a whole set of other things that shape culture first of all it's becoming sedentary so farmers tend to live in one place they're not mobile they don't follow Hertz traveling around they actually are in one place and that is quite a change if you think about it probably even changes the way we look at property and a lot of questions like that on top of it becoming a format means adopting a lot of technical things a lot of new tools and working with specific don'ts domesticated animals and plants so definitely these cultures forming culture and hunter-gathering culture are very very different okay what we see then is that after the after farming kind of a rose it doesn't didn't took that much time to actually spread this way of culture throughout the continent by around 8:00 on eight thousand five hundred years ago the first farmers were appearing in like Western Europe or on the European continent especially down here in the Aegean and in the chin region and establishing their first societies there and then within a few other thousand years all of the continent was pretty much covered by farming societies and the last home together is kind of dominio kind of disappeared around that time so the question is therefore how does this or one of the questions you're interested in is how did this Cultural Revolution actually play out and there's really two different kind of theories you can put forward about how this could have happened the first one is if you just turn back the wheel of time a little bit again so we have farmers kind of appearing in the Middle East and then the first hypothesis just states that well the farmers they're the hunter gatherers themselves they were kind of jealous about this awesome lifestyle and try to and try to copy their the way of life and became farmers themselves if that hypothesis is true then modern Europeans or direct descendants of like the foragers a hundred gatherers that lived in the same place okay and these people they came for more farming spread to Europe as an idea as a way of life that can and not necessarily by people the alternative hypothesis it is instead it was actually the farmers themselves that colonized most of Europe displacing most of the home to gathering people of course apart from this tiny little area here up in Brittany but apart from that like pretty much replaced the hundred gathering societies if that hypothesis was true then we as modern Europeans are direct descendants of the first farmers from the Fertile Crescent and not of the home to gathering societies that were living in this area before of course you can imagine every kind of intermediates in areas between these extremes and one of the big questions the thinking of the last decades is trying to figure out how this actually played out looking at existing ancient DNA we can get a glimpse of that this change this change in lifestyle was first most likely driven by the the exchange or by the colonization through people so what I show here is an admixture analysis so this is analysis in which you try for all the individuals included try to assign or to estimate their proportion of the genome they inherited from a couple of groups so here is analysis we run with two groups so which is an athlete tried to partition individuals into two groups and for each individual try to estimate how much of their genome they inherited from one of these groups okay and this is almost this is all them ancient DNA data that was existing before we started the study from the lower part here we have come together individuals followed by early Neolithic people so the first farmers that actually colonized Europe followed by samples taken much later in time in the middle and late late Neolithic what we can see is that genetically it's super easy to actually distinguish hunter-gatherers from the early farmers they are really completely different groups that suggesting therefore that it was really people moving into Europe bringing an agriculture with them and it was not the hunter-gatherers they were actually picking up the lifestyle of becoming of being a farmer however very interestingly in later periods there's kind of a resurgence of hunter-gatherer ancestry suggesting that before home togethers disappear there was quite some cheese flow between two groups so that modern-day Europeans actually have some of their ancestors from both groups okay however even if we know kind of that now that the spread of people was really really important it is still pretty much unclear where exactly these farmers came from we know that farming was developed in the Near East but the question was were these early farmers that we see in Europe were they the descendants of actually these people in the Near East or not one major reason for why we do not know we didn't know this is that actually and the preservation of DNA in ancient bones is a lot affected by climate the cooler the climate the better it's preserved and it's just a very very difficult task to actually obtain HD DNA from warmer regions and so when we started out they started there was actually and the southern most kind of sample in Europe were all from the Balkans but it wasn't there were any samples available further selves oops okay for some reason I'm backing a slide here so what we then did is we started out by actually analyzing samples from the Aegean region as well as from Anatolia and trying to see whether these samples were actually in some way and related to the modern samples that we have in Europe okay and this is work that was mostly led by your handbook and this group in Mainz and they have like one of these simple sterile super clean rooms to work with ancient DNA the issue is that if you actually touch an ancient bone there's probably more of you dana DNA on that bone that original DNA in the bone so it's a really challenging task and so these guys were working really hard to get libraries out of that okay this is standard data that we got in our lab and we were trying to analyze to answer some of these questions before I going to really delve into how we actually did some of these analysis I want to just walk you through some of the major characteristics of ancient DNA okay and these characteristics they really influenced the way we can actually deal with this data from a bioinformatics point of view the first thing is that if you're if you work with ancient DNA most of the datasets they have really really low coverage or sequencing depths there's multiple reasons for this the first one is that we generally get extremely low DNA content out of the bones so what I showed you here on the right is for each of the samples with v i should say for the five best samples that we analyzed so the people they screed up to almost a hundred samples and then identify the five most promising for sequencing and so these are these five most promising sequences and what you see is on the x-axis is the amount of endogenous DNA that's the amount of DNA retrieved that is actually human most likely of the person that we want to analyze all the other DNA comes from soil bacteria and all those things in sequence but it's actually not human okay so in a case of for instance this rat five sample more than 80% of what you sequence is not human so that means if that was the only problem just to get the same amount of sequencing depth as for a modern person you would just have to sequence five times more right but there's more challenges associated with this the other the issue is that because DNA content is very low if you actually generate the sequencing library you get a lot of the times the same fragment you have to amplify the DNA using PCR and you can in sequence the same track of multiple times but I show me on the y-axis for each of these samples is the amount of unique fragments we're actually getting donor is a bunch of blanks that were wrong in the lab so for if you prepare a library or a date repair alive in this ultra sterile cleanroom a sequence here about 10 to the 5 so 100,000 fragments unique fragments anyway in those libraries it's very hard to be so clean to have no DNA at all in your library if you compare that to the first sample in this respect we have so it's for 8 it's about 10 to the 7 it's about only about two orders of magnitude more human DNA in in a kind of library prepare is an actual sample compared to one that it's just prepared from what we think was pure water at the start right so we get so little DNA on that it's really hard to actually have a lot of different fragments and as a result again you have to invest quite a lot of money quite a lot of sequencing lengths actually push up coverage okay the next issue we're having is that because the DNA was lying in the ground for a very long time it's usually extremely fragmented so this here shows the distribution of the lengths of fragments that we got for three of our samples and these were example sequence two of them were samples with two times 150 base pairs parent library right and so we can estimate the fragment lengths exactly what we you see is that even though we could have fragments up to 300 base pairs the vast majority of fragments was around 40 base pairs in length so the DNA is already extremely fragmented in the ground and so even if you invest kind of a load in having one two times 150 base purchase I'm saying you're anyways just getting something around 40 base pairs per week as a result of that the later samples were just sequel to a single end kind of 100 base pair reads because it wouldn't make much of a difference okay so coming back not only are you going to sequence the mature or the majority of reads you can a sequence not from the sample you're interested in those that are from the samples are probably half or even less in length than the other ones like further kind of lowering the amount of interesting DNA you can get out by this approach okay on the last kind of issue that is common when dealing with ancient DNA is the presence of so-called post-mortem damage so post-mortem damage this is our chemical modifications that occur and along the time to the DNA molecules that result in certain changes okay or mutations that are not merely reflective of the person's DNA or the person's kind of diversity that we're interested in but are just areas that I'm having a kind of compiled over the over the time of the course of time the most common form of post-mortem damage is through the deamination of cytosine and if you have a D emanation of cytosines that turns into uracil and if you do a PCR that turns into Timmy and as a result you're gonna see a lot of T's instead of C's when you're looking at these ancient fragments and clearly if we were to believe that these these are all real we're gonna miss them we have a lot mistake in our genotype thing and as a result may actually completely get the wrong type of impression of relationships among people or even the diversity it's interesting to note important to note that actually all these c-to-t changes they appear of those g2 ain't changes on the opposite strand if you map so these are the two types of errors we see luckily these areas do not just occur randomly or I mean they do local randomly in the in the fragments but what we see is that there's still a certain type of distribution that comes with the position in the read okay so what I show here in the in the picture are these type of distributions that we inferred for two of our samples the way we infer these numbers is by simply comparing by mapping all the reads against the human reference and then just counting how frequently do we see a T in one of these reads if the human reference was a C now certainly you do expect a real difference between a human that you look at and the reference he do not expect to always be exactly like the reference so there's a certain fraction of sites that you do expect a true difference in which the individual would have a tea but the reference has a C okay that's I don't know something like one in a thousand or one in 4,000 base pairs you would actually expect okay however what we see here is that at the beginning of a reach on the x-axis you have to position at the beginning of the reads in certain cases more than 50% of the times in which and that match maps against the C in the reference we actually see a T and that's exactly this post-mortem damage that was mentioning before and the reason for why this kind of really accumulates here at the beginning of the read is because this chemical modification happens much more likely is occurring much more likely if the DNA is single-stranded so we have the DNA fragmenting up into fragments in the ground losing some nucleotides at the end we get sticky ends with single-stranded DNA and that's where these chemical modifications appear this is known because we can actually replicate that process in the lab we can just expose DNA into a lot of kind of weird influences and you can replicate that process so the chemical way by which is happens quite well understood okay and you see that this is a common pattern in all of kind of the libraries we looked at there are different types of libraries you can use for H DNA we can have parent libraries here in black 4x5 you can treat them with a chemical that actually turns the uracil back into citizen before sequencing so you do that can apply the chemical first before doing PCR like that you would actually lower the amount of these kind of changes they're still somewhat present but to a much lower degree the downside is you you you actually lose a lot of material so that's how we stop kind of doing this and then for bar eight me you singer and sequencing there is just interesting to see that because the reads do not actually necessarily reach the end of the fragment they are at the beginning affected by post-mortem damage but all the reads that actually do not reach the end of the fragment they're not affected on the other side that's why you see kind of here in the lower panel in blue that those wait they're actually exactly as long as the cycles we used in luminary a machine so if you're seeking 200 papers you read is exactly a hundred base pairs long then that means that most likely these rates in order each the end of the fragment and if you look at those we see a reduced rate but other than that this is basis in the data we analyze okay so these are three things we have very low amount of data generally and we have these errors okay and now we are interested in working with this data to understand or to infer genetic diversity in these ancient samples and compare them to modern cells generally low sequencing depth just means ambiguous genotyping okay that's something we're all aware of that's why people try to increase depth okay but what it means for us is that we cannot simply increase this we have samples with that which have less than 1x sequencing depth so cover it less than 1x and they were already sequence on many many different lines of Illumina so money is not really going to solve this problem in the short term right if you just want to say oh let's go up to 10x it becomes unaffordable even for super-rich labs so we have to deal with the fact that some of these samples have depth of around 1x clearly if you're interested in in firming inferring genetic diversity of a sample that has just a 1 X coverage then by first calling phenotype and then just counting all the sites for their heterozygous it seems to be a pretty stupid idea right so we have to do this in a somewhat different way in population genetics in the lead in the recent years there's been a lot of tools proposed for actually trying to in order to try to infer population genetic parameters of statistics without calling genotypes this is this whole idea that we can invest money rather in more biological samples rather than just in a technical replicates we do for a single sample because all these different I can replicate one of these common tool commonly used tools is Aang's sense for analysis of next-generation sequencing data and this offers an algorithm that actually infers allele frequency distributions or site Fritzie distributions SSS without ever calling genotypes ok and you can show that you can actually infer these distributions even if you have coverage of 1 to 2 X usually for example however Ang's is not really applicable to our data out of two reasons first it requires a priori knowledge on which are the major minor alleles in the population now clearly we have a lot of knowledge about say your human populations from Europe we could actually know which ones are most likely going to be the major minor alleles but we're very careful in the sense that we don't want to bias the genotypes of our ancient samples through more than knowledge right otherwise they're just going to appear more modern than they probably are so we want to have a way to infer does not require knowledge from modern populations the second problem is anxious it does not actually handle post-mortem damage and if we don't handle that we come across the overestimate diversity okay so we needed kind of a different approach so this is when we came in and together with a postdoc in a group of 10 osseous kasatonov and a PhD Vivian link we developed methods to actually analyze this data and foreign genetic diversity no since this is a pilot romantic series and quite excited because I can actually show you some equations and I'm sure that at least a fraction of the audience we really appreciate that that's not the case if you give usually talks to different audiences so I can benefit from that but actually having a couple of slides is a few equations okay so bear with me and because at least I'm excited about okay so the model we proposed in for genetic diversity is relatively simple okay we're just interested in the genetic diversity of a single individual you can call this hetero zygosity a population genetics statistics for that would be Sita okay which is given by 2 T mu here so what is this - T mu so - TT is the time to the most recent common ancestor of the two sequences of a signal individual so if these are the two sequences in one individual these two sequences do have a common ancestor let's assume T generations ago the total length of the tree connecting the two is 2 - 2 times this time and if you fly by the mutation rate mu we get the amount of mutations accumulating on that branch okay so that's a seat that we're interested in estimating we will never be able to estimate either a mutation rate or T separately okay however like it clearly puts a model in place in which we can explain like the genetic diversity of single individuals that we can use in addition we're gonna say okay we don't want to use any modern information but we're going to use the sequence composition in the region kind of as information about what type of alleles to expect if you're in a GC rich region you're more likely to see G and C alleles than a interior so neighboring rows I do tell you something about the probability of seeing a particular allele so that's what can I say we have regional allele frequencies PI here and these frequencies they obviously sum to one but they give us some information about what type of alleles and mutations expect in that particular reach for which we want to infer genetic diversity if we have these two we can have build an easy model okay here just put on the likelihood for this it's likely in which we have a product across each site assuming that mutations are independent or between sites that's an assumption she already made I think it's also makes sense just because my grandma and dad had a mutation at one locus does not prevent my grandmother from having a mutation at an and locus just next nearby right so we can assume that mutations are independent between sides so we can take the product across sites but then for each side because we do not know the genotype we're gonna just integrate altitudes how we're going to integrate over the genotype and just ask what's the probability of the sequencing data di given the genotype that I integrate over times the probability of the genotype given the seat and pie that I have okay so this way the genotype is a latent variable that can integrate out and I'm interested in trying estimates of both seats and pi without having to know anything about the genotype in between that's a basic idea so in the next few slides it's just gonna walk you through how we act through the model that we really employ for these two terms so first what we call the emission probability and that's a probability of the sequencing data given the underlying genotype and then the probability for the genotype given the parameters and we want to know okay so for the substitution model that is the probability of the genotype given PI and Seeta we just employed a classic first designed substitution model so famous fails to shine 1981 more very very commonly used in phylogenetics and it has some advantages over other models namely that it actually allows for recurrent mutations at the same site okay the model is in that regard are very simple it's the formula may look a little bit complicated but if I walk you see we see it's actually very simple model so what we have here is the probability of a genotype given the frequencies PI and hetero zygosity Sita Sita is we can see it as a kind of a with the overall mutation rate in that in that sequence so if if we look at the lower line here if the two nucleotides of a genotype are actually different so K not equal to L then in that case we do require at least one mutation on the on the okay and then what we have here is the probability that there's at least one mutation is given by one line is e at the power of minus the rate and then if that's the case we just multiply with the probability that if you follow all three the first nucleotide on one side is actually of type K and the last nucleotide on the other side is of type L so it gives you PI K times pi L times the probability of at least one mutation if the two nucleotides are equal that is emits a monomorphic site then there are two options one option is that there is actually no mutation okay so that would be e at the power of minus theta or that there is at least one mutation but does the last nucleotide to which you mutate to is the same as the first one okay so this is why I multiplied this again we see it okay so it's a very simple model allows for recurrent mutation and has just the parameter theta plus pi that we require and for our mode the emission probabilities are best explained with a little example so consider a case in which one read at one position here has the data DJ so these data and Ray is just one position in the read so we see it we read here we see here in a nucleotide being read and we ask what's the probability that the read actually turns out to be an a if now here at risk Anna for simplicity assume we have a haplotype case so we have a genotype that is just a team now we're just going to assume that in this haplotype case then this means that there is a sequencing error happening at this location and in sequencing area happens with probability epsilon J and because the sequencing area can turn a T into an ASE or a G it's just in one third of the cases and that actually turns out to be a clean this is just for the haploid case you can spin this further for a duplication or very hard I just use this as an example to see where actually post more than damage know comes in because if we have the same again haplotype the case in which the underlying genotype is at C but the data that we read is a T then there is no like more than one way by which you can actually end up having a T in your sequence if the underlying to not have was a C one of the ways is easy it's again like a genotyping error a sequencing error sorry a sequencing error so that's the second term here it's epsilon sir times the probability that there is no post-mortem damage so that actually the DNA that I get the bone still contains a C and then the T appears and by the machine making a mistake or I have the probability that I do have post-mortem damage d so that the DNA molecules actually turned the see was actually turned into a t before I even read Dean is not doing any mistake and everything is fine so I'm having one minus the probability of a sequencing error you can spin it further you can develop this for all kinds of genotypes that you want we did this for all the diploid genotypes that exist okay and then if we have this we can actually build a model by which you take this or we take these emission probability into account when integrating our genotypes okay the question is somewhere from where do we get the probabilities of the post-mortem damage so vjg here well for this we actually can use the exact same the curse that I showed you before we see a pattern of this exponential decay and actually just fit the generalize and exponential decay function to it and then we can assume that this is like we have so we apply in a priori we estimated from the data and then assume this to be known okay once we have figured out how to do this and these probabilities for we can easily assume that reads the individual reads at one location or independent so we just have to integrate all these different reads we just have to take the product across all the individual data points that we get for the same site together I like you so how does this actually perform like this is work can we estimate Sita what type of sequencing depth do we require so this is these are some results we got from simulations we first simulated data in a one mega base window for different coverage or sequencing depths here so from 0.2 to 3.2 and what you can see is that as soon as coverage is a little over 1 we seem to get quite accurate estimates if Sita is at least 10 to the minus 3 which is what is the average in the human genome okay so we seem to get quite accurate estimates even if coverage is only slightly poorer on X if we combine data from about one mega base similarly we can simulate one cot 1 X coverage but then ask how does the window size affect things so how many sites you are required to accurately infer heterozygosity if the coverage has a certain oh I should say like this is no coverage per site that is the average coverage we simulate in the middle again we see that it's the window size approaches about one mega base we get quite accurate in inference if we have much less data than that there is some bias that we see okay so generally one X coverage about 1 mega bytes 1 million sites are required to do that interestingly the window size as well as the sequencing depths there are two there are two factors that seem to work jointly here together you can either have larger windows and less coverage or less coverage and large windows one way to plot this is here in the last box on the right where I plot for all combinations of window size and coverage the amount of error remaking the estimation and you see this is from simulation so it's kind of like a weekly density popper what you should see is that for all combinations of window size and coverage that result in about the same number of sites at the same number of individual weight information being there you get the same performance that means that whenever we are interested in inferring genetic diversity from an exceptionally low coverage sample we can just increase the window size we may not be able to infer heterozygosity on a one make one mega base window scale but just ten with no skills and then it should just be fine if we have a sample with a lot more data than we might actually to zoom in and get those estimates that are much more a much finer scale clearly the performance of this is affected by the actual diversity in the sample the higher the diversity that the higher the CETA here the less data is required for proper inference I think that makes kind of sense if in a one megabyte window you have a single heterozygous site you need quite a lot of data we confidently actually see that however if every second side is heterozygous you don't need that much data to realize that there's plenty of diversity in that window so we see this relationship and that means that for very low CT values like 10 to the minus 4 we need slightly higher maybe coverage of 3 if you were using 1 megabyte windows or actually maybe might just use higher windows if we estimate city or we expect see to be relatively small so that seems to work quite fine we were quite excited and then just took our three Greek samples here so these samples are up to about 10,000 years old we had sequencing depth between 0.8 for the worst and 3.5 X for the best sample and then we kind of run our estimator so these are the results we go and compare them to our expectations of the average human genetic diversity is about 10 to the minus 3 as I said before now do you see it's like we get super different estimates we get this blue guy up here which has something like 10 to the minus 2 or even more that means one in a hundred base pairs seems to be heterozygous well this red guy is like this kind of jumping up and down between two values one is one in a hundred thousand and one in I don't know something like 10 billion sites being heterozygous now that should appear to be a little bit strange right I mean in the end these are all humans it's just ten thousand years ago even if you call nowadays and go to most distinctive populations that you can a sample and look at it's hard to find anything like that in diversity so there's a real question like why are these estimates so different we don't expect them to be so different in the end they even come from we think from the same population of the same region at the same time right about pretty much the first farmers arriving in the gene why should the first farmers differ so much in their genetic diversity ok so the answer is actually that the answer is that these are not very good estimates but there's something else in the data that prevents us from getting proper estimates and what it is what we figured out quite quickly is that all all these kind of estimate all these tools that try to estimate diversity without calling genotype they require an accurate accurate information about sequencing areas at a particular site we do take quality scores into account right and if of course these quality scores are completely off we're gonna get completely different that was a completely wrong diversity so the point here is that I want to make is we do need to recalibrate quality scores original absolutely if you want to estimate genetic diversity if the sequencing machine tells you I'm very very sure that this is the base you're gonna see but it's actually wrong you're gonna have very biased estimates in terms of diversity and as we will see we can explain these differences solely by the issue with recalibration the question is just how do you recalibrate data if you have this much data how do you really recalibrate that and if you don't want to use modern information to do the most current recalibration tools they require information about which sites are actually polymorphic in the genome which one is the reference site and so forth and we don't want to actually make our ancient samples more modern just by just imposing all that information so we were therefore thinking hard about how could we use how could we devise a recalibration strategy that does not require any sort of reference information okay now here's our solution and we're relatively excited about that because it also works of course for all the species for which you do not have that information available anyways right if you're working in all model organism you're the first one to sequence it there is no kind DB snip around there's no perfect reference genome so how you're going to recalibrate with gatk or because our tools right you don't know well I think this is one way maybe by which and it's it's feasible so I kind of walk you through the thing is that we make a particular assumption namely that we can get one part of the genome for which we may not know the sequence but we do know that it should be monomorphic all over okay and one of these type of sequences is for instance X linked data in males so it happens that most of our examples were actually males 4q chance because in the screening it was pretty much 50/50 when regarding the sex for some reason we got like mostly males in this app and so if we just look at the X link data then all these sites they should be we don't know what they what their allele was on the X but we know that they were ma no more one copy of the X so we should never have a heterozygous ID now we can use this information in the following way so we're going to assume that there is some sort of transformation that transforms the true the machine quality score that we get into the true error rate and this transformation we put down here these lower equations in two way so first we're gonna say that there is this a new here okay that it's going to be a linear combination of what we call Q which is some sort of information that we can turn into we call this new which is then transformed with an autistic function into an actual error so the logistic function is just there to bound the error rate between zero and one you can have an error rate below zero or above one right so this is why we use this and the linear function here is just a combination of information that we get for every base such as for instance the actual code he tells me this but also the position in the read whether the positions just before what type of nucleotide that was and all these kind of information that also be cues are use but use this here in a linear function okay once we have that we can build a likely to function for our haploid data this is the first equation here where we expressed the probability of the data given these coefficients beta that we want to estimate of our linear model okay here in this way so for each site we're gonna first integrate over the unknown hidden genotype because we do not know the genotype but then if I pretend say let's say the tuna type is a then I'm gonna go through all weeds that cover that site and I say what's the probability of seeing this data here given at the genotype is a the error rates that I have for all my papers okay and all the PMD probabilities that I have my base pairs and then the product of this multiplied by the probability that the tuna type is actually a and then the sum over all possible tunas it's just going to integrate out all the genotypes and then I want to know which are the coefficients pitiful I mean linear model that maximizes like okay so I try to find the transformation of quality scores that best explains the data in a location at which I do not expect to see any polymorphic sites but without actually making any assumption about what the reference basis okay so this is the basic idea and so we use an amalgam to work that out it works relatively fast doesn't work so this is again from simulated data so here which is simulated data where we transfer we distorted the qualities using four coefficients the first one is the actor coefficient distorting our using the actual quality score of the machine one way we take the square of that that's just if you have nonlinear effects you can put that in with the square and the position as well as the position square okay in practice we also add all ten different contexts that you can imagine as additional factors so this is then we travel and we then run our tools or our tool box to actually estimate these coefficients from simulated data again within a one megabyte window for different sequencing depths and it seems we need a little bit more data than for Sita but if you have something like 2x coverage in a one legged ice window you it seems to be possible to estimate these coefficients quite nicely or if you have lower depths at 1x this is data not shown but if you have only one X you need about 10 mega base data then you're on the safe side and you can actually estimate these coefficients very very reliably ok so we apply this to our Greek samples this is what we're going to get so this is the call the Equality transformation table where on the x axis we have the Machine they call it is called reported by the Machine and the y axis the one that we get after our recalibration and the caller's here represent density and what you see is that for all these three samples we actually find that the quality is given by the machine to optimistic machine always tells you for instance here call it scores 40 but then this will be the line here the dashed line is when the the true that they both call these are the same since you all these points are actually below so all the high quality sites in particular are actually in all of that as high quality as the Machine suggests ok so we needed to recalibrate that you also see that the pattern is quite complex it's not that kind of just a weird line that's also because there's lots of different library sequences on many different lines and even different machines and of course for each library and each Lane we had to recalibrate on their own ok so you cannot just recalibrate putting all the data together but we recalibrate it for each library and on each run independently that's why you get these complicated pictures but anyway we seem to be we see a difference in these in the quality scores and if we then use these recalibrated quality scores and now infer genetic diversity for our samples things suddenly start to make a lot of sense ok so here is just an example we of course download it all the other type of available data ancient data as well and started to play with this especially all the mayors because we know this works for mayors ok and so I should say like we have some strategies for females now we can reuse ultraconserved elements we also tried mtDNA it's just pieces where you believe there's no polymorphism right and that seems to work so far we only have mostly male so we're quite happy with this we still have to experiment a little bit how to race with females but we're quite confident that ultraconserved elements should actually sufficient data to run the same strategy there too so here I show you on in the first panel and a trace of heterozygosity or diversity along the genomes it's just a piece of chromosome one and so along the the genome for different samples we have TSI one two and GP or two so these are just two male samples from the 1,000 genomes project and in comparison we have kk1 and beyond so these are two hunter-gatherer samples and beyond it from a sweet cave and this one here k q-1 and it's from the caucasus region okay so what we see is the trace here along that you know interestingly the to modern samples they almost fall on top of each other right it's not only true that if you take a Britain and a British sample and and psi from Tuscany a tusks example they do not only have roughly the same genome-wide average and in their hetero zygosity Shoni as a dashed line but even if you just follow kind of one mega base by one mega base through the genome they're almost entirely they're almost identical these two saps in their diversity how very interestingly they're home to gather samples here they're a little bit different they have genome right slightly lower hetero zygosity which you would expect given the lower population size and their traces look a little bit different too we then run this on a bunch of samples so we have here three modern samples from Italy again from Britain we have beer - it's a Bronze Age sample we have two two hunter-gatherers then we have three African modern samples and an ancient modern example here the motor sample what we find it's like that the this is shown the distribution of heterozygosity in the windows we see that the Bronze Age sample seems to have pretty much the same diversity as a modern European that's kind of expected on the gatherers has slightly lower diversity we see that Africans have a higher diversity and Europeans that's well characterized interestingly this ancient African also exceeds modern Europeans in diversity we can then ask how correlated are the distributions of diversity in the genome so if you just go along the genome and have the trace that I show here in the first picture and we then do a correlation among that so here is experiment correlation and then try to get the pairwise correlations show here as a tree what we see is that we have all the modern African samples the modern Europeans are much closer together quite nicely basil - is we have the motor sample here - the African we have the bronze life sample basil - the modern European suggesting that they are not only very similar to modern samples in their overall diversity but also in their distribution of diversity in that however the 200-gallon samples they're really basic today so they have a very very different pattern of diversity in the genome and interestingly even between them the difference is about as big as between any of them with any of the others at the chest thing again that these two hunter-gatherers actually four come from completely different groups it's not just that they are two old samples like representing completely different groups it's maybe not a surprising one from Switzerland one from the Caucasus region it's quite far apart right but it's kind of interesting that we thought okay so this was kind of our conformation things seem to work quite nice of course we can use the very same kind of model that we developed for this hetero zygosity calling also for genotype calling so we also implemented a genotype caller for ancient DNA that takes post-mortem damage and everything into account if we do that using our Emily colors we see that we can easily outperform tak and even if ta DK is actually using map damage too which is another tool to cut off a PMD affected size okay as we did here in the simulations see this is the amount of errors that we make here as a solid line okay for at last this is our tool analysis tools for low coverage and ancient samples how we call it and then this is 48 ek we see interestingly our caller does not seem to have a reference bias so if you compare our line here for reference reference homozygous calls or alternate alternate we see exactly the same pattern well it clearly has a reference bias so you're making much less mistakes if the true and phenotype is reference reverence then if the true tunic I'm is alternative filter so the only way by which gatk is beating us here on the reference reference which is actually by calling too many things reference reference okay this is also see this in the amount of sites where the there is no call being and being made by the tools so Gita has a kind of large number of sites at which it just does not produce a call and and while our we kind of accept we are willing to call a lot of sites even if that may lead to a slightly higher rate in the reference reference case but our rate is lower in the other cases even for low coverage okay so clearly four inch samples we it's easy to outperform GK and for those of you using gatk just be aware of the fact that we're not the first ones to notice that but it has quite a strong reference bias actually most of you areas - its irreverence so now that we have genotype calls and we have diverse and everything we can let's go back to the biological question we started with right now if we and if you apply all these tools to our ancient data what can we learn about how farming actually now spread from the Fertile Crescent to Europe okay so yeah this is the slide that should have been at the very beginning this because I stitch together late take in other slides it's a bad habit a bad habit but anyway let's jump to the piece here here so this is the piece a we did on our cold genotypes and at the snips for which we have a lot of modern samples available as well so this is a reduced number of sites this is a PCA run on all modern samples so we have different groups you see quite nicely that if mode for you especially for Europe and most regions if you do a PCA on genetic genetic codes in the phenotypes may of modern samples you pretty much get kind of a map because isolation by distance is too predominant factor explaining differences between humans okay well you see here we have this athenians down here past southern Europe century of the Brits Slavic people okay they kind of line up all around here south don't we here we have two Mediterranean and then it goes to which states now the question is where do our ancient Sam we can project the ancient samples on this we have much less snips for the ancient samples and all ancient samples have different sets of snips so you need a way to actually project them on the maps corpus Costas analysis and if we do that okay these are all the samples that existed before our study so we have the European foragers showing up here we have the early form it's down here again confirming that these two groups were really distinct groups right that we know already the question is now where do over a key informants fall and they actually fall right into the group of early for early European fathers so we can say that genetically the very first farmers that showed up in Greece from the Middle East they are genetically the direct ancestors of the very first farmers in the rest of you so we can trace back the genetic ancestry of the farmers at least now to the Aegean region okay we see that they were hunter-gatherers living in Europe people moving in these people that we see establishing new settlement forming settlements in Central Europe or Western Europe these are genetically identical to the very only farmers we see in the Aegean region okay but now how can we trace that even further back to the Fertile Crescent and the answer is no because our first Iranian farmer that we have from the one sample from Iran shows up here there's in a completely different place suggesting that our Iranian first farmers are genetically not related to the Aegean first of all okay we can further look at that using a different type of analysis so this is analysis done by Garrett Ellington a collaborator person on our calls and so he's running a talk on chroma painter in which what you try to do you try so these are our two ancient samples this is bar-8 our highest quality sample from the aegean and wc1 our highest quality sample from iran and for all both of these samples you try to explain the haplotype structure you see in that sample as a model in which you copy haplotypes from modern populations okay so you say this is for 8 here so we say along the genome you copy together haplotype the best matching haplotypes in all locations for modern populations ok and of one of these advection analysis and then all these wrong thoughts our modern populations and their size and color represents how much of haplotypes are called from that population ok and you run this for all population at once and then for power8 and then everything for wc1 ok and then what we see is like here the relative contribution of the populations to these samples when we find is that our energy and sample here seems to be in terms of haplotypes not only individual sites but haplotypes extremely similar to modern Europeans because you can explain the comp the chromosome best by stitching together haplotypes in all over Europe on the other side our Iranian sample is best explained by stitching together haplotypes we find in the far in the in the forest ok so that is mostly in India and Pakistan and modern Iranian populations so what does this suggest this suggests that the iranian farmers are actually not the direct ancestors of our gian and therefore European farmers they may be the direct ancestors of the farming that the farmers that spread towards the east but these two groups the Jean farmers and the Fertile Crescent forms they are genetically distinct they're different so what does that mean we're not really sure does that mean that there were just multiple groups developing farming all together does this mean that these farmers here learned or in kind of picked up farming from these other people we're not really sure we think the latter one sounds quite plausible so that's why um I think we can conclude that for the spread of farming from the Fertile Crescent to Europe involved two different kind of factors first farming most likely spread in some way by actually a cultural diffusion be it because in in that hunter-gatherers learn to form from other farmers or be it that there was a bunch of different groups developing forming together exchanging ideas we don't know and then once these people that started to farm reached European Shores though they were just spreading Deming and colonizing the rest of you as people okay so as a result ancient DNA tells us that the spread of farming you to Europe is complex and it depends on which region we look at from the Aegean to Europe it's mostly buried by the replacement of people but from the Fertile Crescent to Europe is a much more complicated story that mostly involved and at least to some degree cultural diffusion as well okay so let me sum up some conclusions for some general conclusions that I think we can learn from working with this type of data and apply it maybe the other data so first of all errors are an unavoidable part of NGS data all of you working with NGS data you know that and we have to find a way to deal with it just ignoring it seems not to be a very clever strategy as we've seen at least in this case filtering introduces introduces a lot of spa a lot of biases and if you're interested in genetic diversity it usually introduces a bias towards too little diversity we're only willing to accept heterozygous sites or alternative sites where we really sure it means we have too few of them and we can underestimate our diversity the trick that is what most people in population genetics are kind of moving to is that we try to avoid calling two types altogether and just rather use tools that estimate what we're interested in by actually integrating genotype uncertainty if we do that we have to benefit if actually and being able to work with low coverage data and if we work with low coverage data it means we can invest in more biological samples rather than more weeds of the same thing of one individual so investing in biological replicates rather than technical replicates as I said before however whenever we do this kind of hocus-pocus we require that the quality scores we working with are actually reliable we can only integrate our genotype uncertain see if our uncertainty if the genotypes are properly estimated themselves okay and we have learned the hard way here that recalibration is super-important if we work with low coverage data to degree to which our diversity estimates were pretty much useless without an acquire a recalibration before now applying all this kind of logic to and at the ancient DNA case here so I presented you some tools that we implemented in at lastest analysis tools for low poverty and ancient samples that we developed it's a tool to specifically designed to work with ancient DNA particularly ancient DNA has two major things to deal with low sequencing depth and post-mortem damage and both have to be properly taken account into account I showed you how we can how we developed a reference free base called a score recalibration I think that might also be interesting for people with working with no model organisms for instance and that we can have accurate we can infer accurately the genetic diversity of individuals even if you have sequencing depths as well as one X right so which you think about is quite remarkable you have on average one single copy per site but you can still tell how many sites are heterozygous in a region just by and integrating all overall uncertainty and in the end if we use all these tools we also arrive at some unbiased genotype calling so we don't bias our two types to which the reference nor to its modern samples and only this allows us to properly do properly kind of compare ancient samples with more than that if we apply these tools to our samples from a gene and Iranian first farmers we first found that we can actually trace back the ancestry of the European first farmers directly to the Aegean meaning that people from the Aegean spread into Europe with bring for me however this kind of direct luncheon ethic line breaks if you go from the achievement towards the Fertile Crescent somewhere in between and it doesn't extend we don't know really where but suggesting that there was at least some cultural diffusion involved in the spread of farming okay we started with lectures like to thank a lot of people involved in this work so amazed mostly for first people from my group at Tennessee us because often athlean link posted a phe working a lot on this and my long term collaborating chris here is a mathematician i collaborate with a lot then the leading people from the Palio genetics group in minds who generated all the data so this group is led by european Borger susana prepared all the samples in the lab for the gene project okay and also was heavily involved in many of the population genetic analysis phonus took the lead on the iranian sample also two years of lab work to actually get that done and then some analysis their christian is the embedded by fermentation in the group we collaborate with the world and then as a large number of people involved in these projects archaeologists digging out the samples helping us with the interpretation lots of people running analysis and to make this work okay and then thank you all for your attention and happy to take any questions if you have some [Applause]
Info
Channel: SIB - Swiss Institute of Bioinformatics
Views: 1,910
Rating: 4.25 out of 5
Keywords:
Id: oJ9pbQsyaUg
Channel Id: undefined
Length: 54min 56sec (3296 seconds)
Published: Thu Mar 09 2017
Reddit Comments
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.