Sriram Sankararaman | Signals of Ghost Archaic DNA in Present-Day West African Populations

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today I'll spend some time talking about this recent work that my group has been pushing for the last couple of years but also giving you a little bit of context of what this finding implies what is the implications for our understanding of biology evolution and what are some of the computational questions statistical questions that it leads us to so as a background a lot of what we work on is in understanding genomic data so the genomes again feel free to interrupt me if there are questions if you have terms that unclear happy to answer them as we talk so the genome so this is this entity that all of us carry that we inherit from our parents three billion long sequence of a C's GS and T's and what I am interested in is in methods that allow us to interpret and understand our genomes so this is a complicated question and the reason why we as computer scientists are in a position to be able to answer this question and contribute to this is because of the availability of data from genome sequences so the kinds of questions we are interested in is how do we take genome sequences and understand mutations or changes in our genome that are associated with diseases which of these mutations are responsible for certain clinical traits we are also interested in doing this at scale so as we speak today we now have data sets of genome sequences from hundreds of thousands of individuals and that poses lots of interesting computational challenges in addition to the relevance of the genomic sequence to understanding disease associations there's another reason why I am interested in the genome and that is the fact that by looking at these mutations we can say something about human evolution so what I'm going to be focusing on today is this other aspect which is by looking at genome sequence data we'd like to reconstruct human origins and human history and it turns out that this kind of an understanding also has relevance to our understanding of disease biology so to give you a little bit of a background about this field it's been about 20 years since the human genome sequence the first human genome was sequenced and today we have lots of genome sequences from hundreds or now thousands of individuals and by looking at these genome sequences we are building a picture of how human populations are related so the broad outline of this picture is that present-day humans so that's all of us we originated in Africa around 200,000 years ago and then there was this migration Out of Africa and this migration was the founding of the populations outside of Africa today so there was the small set of individuals that left Africa and these populations went in different parts and gave rise to the present-day individuals outside of Africa kay is thousands of years ago so hundreds of thousands of years ago so that's kind of the broad picture but now there are wrinkles in this broad story so if you talk to people who are studying human evolution and through pologize archaeologists what they say is what is surprising about this spread is that around this time humans modern humans were not the only human-like creature around so there were other populations we call them archaic humans who were also existing this is based on the fossil record and the most well-known of this are this population called the Neanderthals so the Neanderthals they appear in the fossil record around 300,000 years ago and they vanished around 30,000 years ago and there is clear kind of overlap in space and time between when modern humans existed and when Neanderthals existed and again for those of you who haven't seen these fossils our skulls this is a modern human skull this is a Neanderthal skull and there are all of these questions in evolution about why one population the Neanderthals went extinct why did modern humans become so successful what are the biological differences if any for example there are conjectures that modern humans were successful because of language because of abstract thinking that allowed them to organize or self organize and so all of these are interesting theories and the reason why now we can begin to answer them is now we have data so we have now genome sequences from not just modern humans but these archaic humans like the Neanderthals so this begins in 2010 when the first genome sequence from a Neanderthal was made available so this was a cave in Croatia from which this genome sequence was obtained actually it was three of these bones bones in the tie each of this gave a little bit of DNA and that allowed us to assemble the first neanderthal genome sequence it was a technical achievement turns out when you have a bone that lies in the ground for hundreds of thousands of years there are all kinds of chemical changes that happen to the DNA it gets broken down into really short fragments often just hundreds of characters long most of the DNA is actually bacterial at some point the bacteria invade the bone and so you no longer have a lot of human DNA finally you have contamination from other humans so people go and touch the bone they handle it and so each of this presence technical challenges and so all of this had to be overcome to actually get access to the first such archaic human genome yeah yeah so the answer to that is twofold one is it's never clear which of these fragments actually retains DNA so a fragment that like the skull that looks like it's big doesn't necessarily have a lot of DNA left in it so there's a huge trial and error process and what actually happens is people scream thousands of these bones to find one which has enough DNA that we can work with the second is archaeologists are very reluctant to part with skulls they want to analyze their men and make influences on them these are things which are not informative from their point of view so they are they are happy to have geneticists go and analyze them all right so this presented an opportunity where we can now analyze these genome sequences and one of the first questions and this was a big debate in human evolution was whether there was some kind of mixing between these populations or whether this was just a process where modern humans come in they completely replace the Neanderthals and they spread to parts outside of Africa so to do this there are many ways we can try to look at it but one way was you take this Neanderthal genome and you compare it to different other modern human genomes so in this case you're comparing a Neanderthal genome to an African and a non African genome and you can compute a distance measure and what you find is consistently the Neanderthal genome is always a little bit closer to any non African genome compared to an African genome so why is that turns out there are multiple possible reasons for this and one reason was there was mixing between the ancestors of non Africans with Neanderthals after non-africans left Africa so in this picture there was this small group that left Africa and they mixed with Neanderthals and that's why all of them are consistently closer turns out we can actually say something more we can actually date when this mixing event happened and for this we use a property of how our genomes are transmitted so this is a property where when a person transmits their genome to their offspring you actually end up ship transmitting a shuffled copy of your genome due to this process called recombination so the copy that you end up transmitting is a mix of the copy that you've inherited from dad and mom and this process happens at a certain rate every generation and so what happens is that leads to certain patterns that we can exploit to date these mix triplets so here's an example of how we would do this so you have the genome of the modern human the genome of an archaic human in this case the Neanderthal and if you look at the first generation of this mixture you'll have an individual like this who has one copy of their genome from the archaic the other copy of the genome from the modern you now in the next generation you start seeing recombination come into play where this individual has the prefix of their genome from the archaic and the suffix from the model and that's because there was a recombination that stitched up the archaic and the modern human sequences together and this is a random process but it happens at a certain rate every generation and so if you look at a person today they're going to have parts of the genome that are modern and parts that are archaic and the lengths of these segments are going to be indicative of when this mixing happened so if mixing happened just a few generations back you'll have these large segments if it happened in the distant past it's going to have these small segments so now we can use this to actually figure out or back out the date of this mixing event and this is a plot I'm not getting into the statistical details but if you have a measure of this length of the archaic segments versus distance along the genome it has a certain decay and we can use this decay to act estimate the state and we get a date of interbreeding between these populations of around 50,000 years ago so modern humans left Africa 100,000 years and around 50,000 years ago was when this mixing event happened we can also estimate how much of our ancestry comes from this mixing and we estimate around 2% based on again these kinds of statistical techniques all right so this is all about the global proportion of Neanderthal DNA but one question that everybody was interested in is could this have had some impact on biology so it turns out the Neanderthal DNA is quite different from modern human DNA it has mutations that are never seen in modern humans today so is it possible that all of these mutations that came in because of this Neanderthal gene flow event could those have had some impact on human health so we were interested in this and we were part of the study which was one of the first to look at specific regions of Neanderthal DNA so this was a study that was done in mexican-americans it was trying to associate genetic mutations with risk for type 2 diabetes and we ended up discovering a novel risk variant for type 2 diabetes in mexican-american populations when you look at this risk variant it has this interesting distribution where the risk variant is essentially absent in Africa and its present outside of Africa and it's a relatively high frequencies in the Americas that's why we discovered it in mexican-americans and then we compared the risk variant that we have discovered to the Neanderthal genome and we find that there's nearly a an exact match so this was one of the first examples of a variant that effects risk for a disease that could be tied to this kind of an admixture event since then there were several other notable examples yeah how does this mutation show up in Africa it does not show up in Africa so what we are seeing here is so the the this pie chart is telling you there are two possible variants here there is the variant that is low risk for type 2 diabetes and the high risk variant and what we are showing here is what is the proportion of the high risk variant so in Africa essentially at 0% whereas outside of Africa is where you start seeing this variant and when you go to the Americas it's at even higher frequencies and this is exactly the pattern that you would expect if this got introduced into modern humans via interbreeding okay so since then there were many other examples which try to figure out at a given gene that I care about say there's some gene that is interesting is there evidence for this gene coming in through Neanderthal interbreeding and there was several such examples many of them are actually genes that are important for immunity related function and there's this interesting hypothesis about why immunity related genes might have a lot of Neanderthal DNA in there but then given these examples one of the questions we were interested in is can we do this in a more systematic manner so instead of looking at one gene at a time can we look across a person's genome and ask is there Neanderthal DNA at a given position along a person's genome so this led us to build what we call maps of Neanderthal DNA and essentially what we mean is can we go to a person's genome and color them according to where they have Neanderthal DNA so we don't know these colors and that's what we'd like to back out and given genome sequences from an individual so the basic idea is straightforward so we have a test genome and we have an archaic genome in this case the Neanderthal genome and we also have genomes from African populations which we know don't have Neanderthal DNA and our goal is to go along this test genome and color it according to where it might have come from so in this case it's a pattern matching problem in this case we would say that this portion of a person's genome is closer to the archaic and so it likely comes from the archaic population so the way we did this is we actually wrote down a statistical model for inferring Neanderthal ancestry the idea is this is fundamentally a sequence labeling problem so we're going along a person's sequence and labeling the ancestry and there are a couple of things we'd like to incorporate into this model so the first thing is this fact that recombination hands are parts of your genome together so what that means is if I tell you you have Neanderthal DNA at position one that makes it pretty likely that you have Neanderthal DNA also at the adjacent position simply because of the property by which your genomes are transmitted so we'd like to build a model which takes a cup into account this correlation structure and tells you where you have Neanderthal DNA so the model uses data that looks like this so you have say European genome and we have the Neanderthal and we have African genomes as I said Africans are assumed not to have Neanderthal DNA and we'd like to label this person's genome according to whether they have Neanderthal or modern human ancestry so we are building a probabilistic model of this vector of zeros and ones we call this the local ancestry vector given all of this data there are some technical challenges and I won't say too much about it one technical challenge is if you actually try to model this vector it turns out to be what we call a non Markovian process which means the probability of seeing Neanderthal DNA at position two depends on what you saw at position one the probability at position three depends on one and two at position 4 depends on one two and three and so you have this dependency that is extremely hard to capture so the way we do this is we borrowed ideas from NLP and and speech processing where these kinds of models called conditional random fields have been very successful in modeling these non Markov or long-range dependencies I won't say too much about the technical details but essentially we are trying to predict the joint probability of the ancestry vector so z1 through z4 are the ancestry vectors and so we are trying to model the probability of this ancestry vector given all of the data that we have observed and the way you do this one way to do this is we write down certain functions or statistics that couple each of these ancestry labels to data and then we have functions that couple their ancestry labels at adjacent positions to each other and this model has certain parameters and so we have to optimize over these parameters to get good prediction so to give you an example of the kinds of features that would go into this model here is one example so this is a feature that looks at one position at the mutation patterns at that position so here you have a mutation in the non African that is also present in the Neanderthal and this mutation is completely absent in the Africans this is exactly the pattern that we also saw in the type 2 diabetes example so this by itself is very weak evidence on the other hand if you now saw many of these mutations next to each other then you accumulate additional evidence that this is actually Neanderthal DNA so we build this model we train the parameters and the result is we can now compute maps of Neanderthal ancestry so here is an example where you apply it to individuals from the thousand genomes project it's a big database of publicly available genome sequences so this is a European individuals genome Chinese individuals genome and an african individuals genome and what you see is clearly there are many places where the model is confident of Neanderthal DNA in the non African genomes and relatively few such positions in the African genomes we did some additional validation so we did simulations to get a handle on the false discovery rate and the sensitivity of this model typically at a false discovery rate of about 10% we recover anywhere between 60 to 80 percent of the Neanderthal DNA we then applied it to all of the genomes in this thousand genomes project and effectively we see that we recover substantial amounts of Neanderthal DNA in Europeans and East Asians relatively little in the African populations so after this we looked at other data sets so people had been getting genome sequences from across the world so this was a data set of about 200 genomes from about 100 different populations across the world and we could compute the distribution of Neanderthal DNA across all of these individuals and what we see based on these methods is again substantial Neanderthal DNA outside of Africa but we also see variation across different populations and now there's a lot of hypotheses that is trying to explain this variation could it be the case that there were multiple such Neanderthal intermixing events that were population specific that could have led to this variation so that's an ongoing research question in human evolution yeah so you're looking at the total proportion of Neanderthal DNA and the assumption is on average this Neanderthal DNA is going to be fluctuating randomly across populations so if there was a single event and the population split then averaged across their genome it should be close to statistical variation on the other hand if you see large differences then you need some other explanation so one explanation is there was other interbreeding events there could be other explanations like selection and and applicated possibilities as well yeah so there are two sources of validation so that's exactly right all of this is largely unsupervised so we don't know the true labels places in the genome where you on the and authority or name so a lot of this is trying to see how much of these predictions line up with what we know based on other methods for example the fact that you should see a certain trend across different populations or certain genes we expect should pop up the other thing which we also do which I haven't gone into is for all of these we have other ways of statistical testing which are not as precise but we can show mathematically are unbiased and often we verify these predictions with these unbiased but less powerful statistical estimates but yeah in general all of these have to be verified using complimentary lines of evidence all right so this was all about Neanderthal DNA around this time or actually several years back there was another major finding in this field so this is a cave in Siberia it's in a place called the nice over cave and this was around 2011 where archaeologists found this pinky finger so that's what you have here and again they decided to test it for DNA so there was initial hypotheses about what could this be possibly this could be human this could be Neanderthal those were the two obvious candidates but it turned out it was neither human modern human non Neanderthal so it was this new population called the Denisovans so the Denisovans this population which is actually a sister group of the Neanderthal so if you build a tree you have a split between modern humans and this branch this branch then splits into the Denisovans and Neanderthals so this arrow here is this interbreeding between Neanderthals and non-african ancestors now again we can ask the question this new population how does it relate to other modern human groups turns out again that has been interbreeding between Denisovans and other modern human populations specifically populations lying in Oceania so this is populations that line islands of Australia are populations which have over three to six percent of Denisovan DNA so these populations both have Neanderthal and they have denisa with DNA and it's quite interesting because this was one of the first discoveries that came entirely from genetics so previously Neanderthals we had fossil record we could say this is what we expect this was particularly surprising because this was entirely made based on the spin key finger there was no evidence that this was unusual and genetics were saying that this was something quite different we can extend these kinds of methods to the study of Denisovan DNA in human populations and here again we see that we find substantial Denisovan DNA these are the populations in Oceania and Australia who have substantial 27 DNA turns out there are also populations in East Asia particularly Tibetan populations have quite a bit of deletion with DNA so everything I've talked about is now building these maps we can do more than just look at how much of the person's genome has Neanderthal or any seven DNA so now that we have a map we can actually ask along the person's genome how is this DNA distributed so this leads us to looking at fine scale maps of archaic DNA so here what I've done is we have arranged the 22 chromosomes and the sex chromosome in the circle and we are going along every position and we are counting up at a given position how many people in this database have Neanderthal or Denisovan DNA so if you see a line that's telling you that's a place where large numbers of individuals carry Neanderthal or denisa DNA so now by looking at this map one of the things that was interesting is there's a fairly non-random distribution of this archaic DNA so we build models and we asked what would it look like if this was just randomly distributed and then in some places it went low some places it went high turns out this is not the pattern that you would expect so there are some places in the genome which we call peaks of archaic DNA so places where many individuals today carry the archaic mutation as opposed to the modern human mutation so here's one example it's an extreme example so this is a gene which has been known to be involved in skin color and pigmentation and this is a gene where more than half of Europeans today carry the Neanderthal version so around 50,000 years ago there were two percent Neanderthals so two percent of the population would have carried this gene today it's more than half again we can ask is this random or is this expected and it turns out a process by which the gene is moving up or down in frequency randomly so we call that neutral evolution cannot explain this observation and so what we we are convinced is this is a gene where it has risen up in frequency because of some positive impact so we think that this must have been adaptive for the modern human population so here's another example so this is a gene called epass 1 so this is a gene where the mutations are known to be extremely important for tolerating to low oxygen environments so this is a gene where there's a mutation that is found to be at very high frequency in populations that live at high-altitude environments like the Tibetans and if you examine what this mutation is that confers high altitude adaptation turns out that this is a mutation that was inherited from the Denisovan population so what we find is there are peaks of these archaic DNA and at least some of them have a clear measurable effect on ability of human populations to survive in different environments on the other hand there are other places in the genome which are devoid of archaic DNA we call them deserts of archaic ancestry so here there are places where as far as we can tell no human modern human individual carries the archaic version of the gene some of them again have an interesting underlying biology so here is an example where this is a position on chromosome 7 which is a desert 4 archaic DNA put Neanderthal and any 70 any and it turns out it's about 10 mega bases long so it's a pretty large chunk of the genome and this is a desert that overlaps a gene called fox p2 and the reason why this is interesting is this fox p2 gene is known to be an important gene for speech and language so the hypothesis is this is a hep hypothesis is that these are places where there is a human mutation and there is the archaic mutation and the archaic mutation is less fit or its deleterious compared to the human mutation and that is the reason why the archaic mutation was removed quickly after it came into the human gene pool that's just a hypothesis and there several groups that are actually actively testing this and this is another quantitative way of seeing this so we asked whether if you look across a person's genome is it the case that the archaic DNA is a adaptive is it deleterious and so what we did is we looked at how much our cake DNA is present in a person's genome across different parts of the genome binned by what we call selected constraint so selective constraint means here are places in the genome which are highly selective highly functionally constrained known to be important based on biology here are places in the genome which have low constraint based on biology we think they're not necessarily important and what we find is both the Neanderthal and the Denisovan DNA tends to get lower as you go to regions which are under stronger constraint so the picture that's coming from looking at these maps is on the whole the archaic DNA was deleterious not necessarily good for us and that is why it has been removed in parts of the genome which are selectively important but that's not the full story there are other places in the genome where it has actually risen up in frequency and that's possibly because these were adaptively beneficial so where we are right now is now we are building maps we have these statistical methods that are pulling out these these regions of archaic DNA and a major challenge now is to try to see what biological impact this has had all right so everything we have said so far everything they've discussed is about these populations populations outside of Africa the real reason that we've only focused on these populations outside of Africa is because of the fact that we had these archaic genomes to begin with so the fact that we had this Neanderthal genome and this Denisovan genome allowed us to ask questions about out of africa populations so now the question is what about populations within Africa turns out we don't know much there's a lot of evidence from the fossil record in Africa that there were our cake populations in Africa as well however we haven't been very successful in getting ancient DNA out of these fossils so largely due to environmental conditions it's really challenging to get ancient DNA within Africa so while we now know quite a bit about our cake DNA outside of Africa within Africa knowledge is fairly limited so this was what we decided to look at more closely so the goal is can we say something about archaic ancestry archaic DNA within Africa even if we don't have these archaic genomes to begin with so again the challenge here is we don't have label data so every time we have a method we have to worry carefully about the bias is inherent in this method so to try to get at this question we actually had two complementary approaches and they'll talk a little bit about both of them so the first approach uses some theory from mathematical population genetics so this is the study of how population's evolve and what we expect to see in genome sequences if populations are evolving in a certain manner and so the first line of evidence is going to come from the field of mathematical population genetics and the second one is something where we are going back to our statistical approaches and we're going to devise a method that can pull up these segments of archaic DNA but do so without needing KEK genome so we're going to develop a reference free methods for building these maps of introgression so let's start with the first one so we're going to start with this mathematical object it's called the site frequency spectrum so the site frequency spectrum basically is looking at the genomes of a bunch of individuals and building a histogram based on how frequently a mutation occurs so the idea is simple so you have genome sequences let's say from Africa in this setting every line here is an individual every column here is a position we go along each position and we're going to ask how frequent is a mutation at that position so in this setting you might have a mutation occurring in three out of five individuals another mutation might occur in in all individuals and so forth so now we can tabulate it and we can build a histogram of the number of mutations that occur at a certain frequency turns out this summary of the data called the site frequency spectrum is extremely informative about how this population evolved for example if this population just evolved by itself the site frequency spectrum has a certain pattern to it on the other hand if there was introgression into this population it has another pattern to it so potentially we might be able to use this histogram to say something about how these populations have evolved the challenge is the site frequency by itself has too much information so it not only is determined by the history of this population it is also determined by other processes that we have less information on for example mutation rates selection and so forth so we want to have a summary of the data which is informative about introgression and as robust to everything that we don't care about so to do this we came up with a different summary which we call it the conditional site frequency spectrum so what this summary of the data is doing is it's only looking at positions where when we compared the Neanderthal genome to the chimpanzee genome which is the genome of the ancestor the Neanderthal and the chimpanzee genome are differing in what pace they carry so now we are going to build a histogram exclusively based on those positions where the Neanderthal and the chimpanzee genome differ this we call the conditional site frequency spectrum now why is this an interesting summary of the data so this relies on some population genetic theory so let's say the true history of our population was something like this you had an ancestral population at some point it split into two one was the ancestors of the Africans the other was the ancestors of the Neanderthals now if you compute this conditional site frequency spectrum we expect it to look uniformly distributed the reason for this and we can show mathematically why that's the case so it turns out the mathematical reason for this is at the ancestral population so this again has a lot of population genetic theory for why that's the case this site frequency spectrum which tells you what fraction of your mutations have a certain frequency X it has a form that looks like 1 over X so in other words the site frequency spectrum says if you have a mutation that's going to be present in one individual that will be present twice as frequently as a mutation that's present in two individuals three times as frequently as a mutation in three individuals and so forth so this is a well-known model in population genetics now we're going to look at the conditional site frequency spectrum so what is the conditional site frequency spectrum we are only restricting to those mutations where the Neanderthal carries the mutation so the Neanderthal is what we call derived compared to the chimpanzee so the probability of doing this so if you have a mutation at frequency X and you randomly sample a Neanderthal genome probability that this Neanderthal genome carries that mutation is exactly equal to X so now we are filtering based on all sites according to this probability and so the result is what we call the conditional site frequency spectrum so the at a frequency X you have 1 over X fraction of sites and the probability of picking a site is X so the conditional Site frequency spectrum is uniform so this is theory now what do we see in the data so in the data so this is data from an African population of West African population in the thousand genomes project so if you now compute the same entity you get something that's quite far from uniform so there's a mismatch between what theory predicts and what the data shows us so what might explain this oh and before this we look at other West African populations not just the Yoruba and all of them show this characteristic u-shaped pattern of course there's possibilities of technical artifacts so there are errors of all possible kinds errors in the genome that we are looking at errors in the archaic genome errors in how we identify these different mutations so we asked whether some of these could explain this pattern turns out these errors could explain them but the rate of these errors is very higher than what we know from other studies turns out there are other biological processes that might also explain them and I won't get into the details and so to verify this we did simulations under these models of biological processes we filtered the data in different ways to try to figure out which parts of the genome are less likely to be affected by some of these processes and we found that across all of these ways of looking at the data we could rule out these other explanations we also looked at other models of human history so we asked let's build a complicated model of human history others have also built such models and could this explain the data so here is the blue curve which is the data the orange is what we see based on these models of history that we have and you see that maybe it fits a little bit here on the left but it doesn't explain all of this spectrum so we concluded that current models our current understanding of human history does not explain the data based on this statistical sample so what else might explain this and so this led us to models which are what we call models of structure or integration in Africa so here is one model that we showed does explain the data so this is a model where there is this population that split off prior to the ancestors of modern humans and then it integrates our interbred again we call it a ghost population because this is not a population which we have identified based on any fossil evidence and what we find is this model which has a ghost integrating into the African population does explain this conditional sight frequency spectrum we also looked at other models other models of introgression maybe this is not really a ghost maybe it's the Neanderthals coming back together and mixing turns out again we can reject this it doesn't fit our data and we can reject this model at a fairly stringent p-value there are other models where there was maybe just within Africa there was a lot of structure population splitting and merging again those do not explain the data quite as well and we explored a lot of other models which have structure within Africa so in this setting for example we looked at different models of structure within Africa and we are computing the p-value so lower p-value means we reject the model across different Chi of parameter settings and we find that none of these parameter settings allow us to explain using this model so taken together this analysis of the data suggests that a model where there is a ghost population integrating into Africa could explain this conditional sight frequency spectrum there's one other question we also asked which is can we say whether this was an African specific signal or was this shared between Africa and our of Africa populations I won't say too much about this but our current estimates suggest that some of the signal is actually prior to the split between Africans and out of African populations we can also get more quantitative about this so we can ask when did this population split off how much of the ancestry came from this population and at what point of time was this integration event so at this point we fit a model with all of these parameters and one of the interesting aspects of this is this was a population that split off prior to Neanderthals splitting off from modern humans so it's a fairly old population and almost 11 percent of the ancestry of Africans comes from this coast archaic population so compared to the 2% or the 3% attributable to Neanderthals and Denisovans so this had a fairly big impact in terms of how much ancestry comes from this population so that was one line of evidence we were not quite sure whether we completely believe this so we try to figure out if there was another way to look at this to do this we now built a map of interest DNA and the key difference now is we have a method that does not require an archaic genome clearly we don't have an archaic genome to begin with so that's what we want from this method so the underlying idea behind this reference free method for archaic DNA is if you have enough modern human individuals so if you look at a collection of modern human genomes simply by looking at the patterns of variation within those modern human individuals there is enough information to pull out these inter aggressive segments so what we do is we compute a bunch of statistics on these modern human genomes these are the features that we use to train a model like the conditional random field model that we had previously and we use that to predict the archaic DNA so some kind of examples of the kind of Statistics we use for example if you have a target genome so this is a genome in Africa and this individual has archaic DNA at this position then what you expect is a number of mutations will fall on this person's genome that are going to be exclusive to this individual because this is archaic most of these mutations are not going to be seen in other individuals so seeing a signal like this might increase our odds that this is an archaic segment similarly if you take this target genome in Africa and you compare it to another genome sequence if this was actually archaic then the distance to all the other genomes is going to be much bigger then if this actually came from a modern human population so again you can build a collection of features and use a statistical model to be able to figure out which of these features are informative for making predictions of archaic theory before applying it to the African setting we had a positive control here which is the Neanderthal integration that we've already identified so now we apply this method first to the Neanderthal setting and here we compare the prediction from the reference free method to the predictions from the reference based method that we had previously and by and large we see concordance predictions for example as we look at the probability that this method assigns to whether a segment is archaic or not now we can ask how likely is that segment to match the Neanderthal genome which we know is what we expect this segment to come from and so as the threshold of the probability increases you're more likely to match the Neanderthal DNA then if you are labeled to be not Neanderthal similarly this was the gene that I identified earlier as having high proportion of Neanderthal DNA identified using the reference based method the reactant 3 method also identifies this as an interest DNA segment again a lot of the other features also line up so this makes us confident that this method actually can pull out interest DNA so then we applied it to Africa so applying it to the West African population one of the populations this is the Yoruba we find that at a false discovery rate so in this case of 20% about 8% of the genome is identified to be a cake according to this method so this is concordant with some of those other estimates that we got from the conditional site frequency spectrum analysis so given this interest segment we can now ask is it really closely related to any of the other genome sequences that we have for example maybe this came from some other modern human genome with misidentified it as archaic so we have genome sequences from other populations that are distantly related to African populations we can also compare it to the known archive genomes like the Neanderthals and Denisovans and so what you see here is a measure of the distance of these integral segments so we are comparing the integrator segment to the non-interest segments according to our method and we find that for all of these the segments that we label as archaic are not particularly closely related to any of these populations again this is exactly what you would expect if this was a a ghost or an unknown a cake population all right so those are the two lines of evidence and so what that leads us to conclude is present-day Africans trains quite a bit of their ancestry to this unknown population this population split off quite far back in the past so they're kind of bigger picture about what this is telling us about human history is this notion that populations that are split hundreds of thousands of year back come back and mix and this is a common feature no matter which set of populations we've been looking at so we knew this was the case with out of africa populations but we also know that within Africa this is common and a week after our paper appeared there was another paper that also showed instances of these kinds of ghost archaic populations integrating into other populations in the world so the bigger picture is this model of human history which is incredibly complicated where you have branches splitting off and then coming back together and mixing so that leaves us with some questions for the future so the first is how pervasive is the signal of our cake DNA and what were these coast populations so one way to answer this and that's what we're doing right now is getting genome sequences from other modern human populations that have not been represented yet so even within Africa we have only sampled a small subset of the populations it's an extremely diverse environment the second is ancient DNA so this field has been making huge progress over the last five to ten years so this is a timeline and this shows the different ancient DNA sequences that have been generated across the world so this is in Europe this is in in different parts of Asia this is in Africa and this is a different points in time going from most recent to furthest back in time and what we see now is we are actually building a picture of how different human populations were at different points in time the picture in Africa is still fairly sparse so again it's unclear how ancient DNA will progress there but my guess is in parallel to the advances in populations outside of Africa we will soon start having ancient DNA from populations within Africa so once you have those kinds of genome sequences we might be in a position to be able to identify what these ghost populations might have been the other question which I briefly touched upon is what was the impact of this DNA on human health on human biology so this is where there is a confluence between these kinds of evolutionary studies and studies that have been undertaken with the purpose of studying health and human disease so these are what we call biobank datasets an example of this is a study in the UK called the UK biobank which has genome sequences from about half a million individuals thousands of traits and one of the things we are doing is now using these maps that we have built to be able to directly test the influence of these mutations in these biobank scale data sets so for example you can look at a position like this which we have identified as a mutation that is inherited from say the Neanderthals and then put it together with disease or other kinds of trait information and this allows us to understand in a natural context so without doing experiments in in cell lines which other groups are doing what might these mutations have done why is it that some of these mutations have become as frequent as 50 percent today the other question so this is all about the biological and evolutionary impact of these kinds of analyses the other question is trying to understand how we can actually build a better or a more coherent picture of evolution so what I've talked about today is humans but it turns out that this picture is actually common across different species so people have looked at primates people have looked at mice butterflies and as you get genome sequences and you try to build the history of these populations you soon run into a messy looking graph so this is a computational problem and a challenge so far as a field we have been fairly good at building trees so given genome sequences there's a whole field of Philo genetics which is about inferring trees from these sequences turns out it's a challenging problem but we do have relatively good tools we know what methods work how much data is needed for these methods to work and so forth once you start looking at these graphs we are essentially lacking any good theory so we don't know how do we make inferences starting from all of these genomes how do we actually build these in some systematic manner what methods work how much data we need for this and just to give you a hint of the kinds of things we are working on so the kinds of methods that we are working on are based on this notion of invariance so the idea is we looked at the African setting and we said there is this statistic this conditional site frequency spectrum this strange-looking statistic that has some nice properties it is flat when there was no integration and it has this u-shaped when there is integration but we had to think through this really hard to come up with that argument so now the question is can we have a method that identifies this statistic automatically for us so given some picture given some model of human evolution can we build these statistics in an automated manner so we call these statistics invariance the reason they are invariance is it's some function of the data which has a certain value under a given topology of even and it has a different value when this topology changes turns out that these are actually a very powerful technique for studying evolution interesting point some of the first work so this is not a new idea there's been some really old work so some of the first work is actually from James Lake professor here in UCLA but again a lot of that needs to be adapted to the data and the kinds of problems that we are dealing with all right so with that I'd like to acknowledge everyone so the student who who worked on the African introgression extremely talented student Aaron and this was of course work across many different Institute's across many different disciplines I'd like to thank all the collaborators and funding agencies thank you [Applause]
Info
Channel: Computational Genomics Summer Institute CGSI
Views: 8,949
Rating: undefined out of 5
Keywords:
Id: 8bqU6QsW-sM
Channel Id: undefined
Length: 56min 31sec (3391 seconds)
Published: Mon Apr 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.