The Inference of Nature: Cause and Effect in Molecular Biology, Sarah Teichmann

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

- Hi, everyone. I'm delighted to welcome you all to our monthly Inference Colloquium Series. This series as many of you know, is one of the flagship events of our larger interdisciplinary project that examines the nature of inference focusing specifically on issues of correlation and causation across disciplines. It is a very ambitious interdisciplinary project, and I'm delighted that our speaker today is one of our core members of this team. And this project is supported by the John Templeton Foundation, as well as Yale's Franke Program in Science and the Humanities. And I think these epistemic issues that are sort of front and center in most academic disciplines today, and in particular, as we all move into sort of the big data era in our respective disciplines, the question of how we can infer and produce new reliable knowledge using new tools and techniques is really a very critical one to explore right now. I am really excited about today's speaker, Dr. Sarah Teichmann, and her unique take and perspective from the vantage point of molecular biology. But first I would like to thank our benefactors, Mr. and Mrs. Richard and Barbara Franke for their generous support of the Franke Program and other efforts, many other efforts actually at Yale, that bridge disciplines. And I just wanted to take this moment to remind all of you assembled that we are recording this event, and that all participants must therefore mute their videos. If you wish you can, as is customary submit your questions through the chat feature, and you can feel free to submit your questions. We will actually have a dedicated Q&A session at the end of the talk. So it is my real honor and privilege to introduce our speaker today, Dr. Sarah Teichmann, who is the Head of the Cellular Genetics Program at the Wellcome Sanger Institute in Cambridge, England. And it is also a particular personal pleasure because she's one of my close friends, and we have known each other through the early stages of our scientific careers. Sarah Teichmann's research focus has been on understanding sort of global principles of regulation and gene expression and protein complexes with a particular focus on issues of immunity. She earned her doctorate at the MRC Lab of Molecular Biology in Cambridge, and was a Beit Memorial fellow at UCL. She started her own group at the MRC Laboratory of Molecular Biology in 2001 and was also an elected fellow of Trinity College, Cambridge. Her lab focuses on discovering stereotypical pathways of assembly and evolution of protein complexes. In 2013, she moved to the Wellcome Genome Campus in Hinxton, Cambridge jointly with the European Bioinformatics Institute and the Wellcome Sanger Institute. And in February, 2016, she's had an incredibly illustrious career already and has so many prizes, so many honors that I'm just gonna cherry pick a handful of them because I really don't want to take up any more time, and I'm dying to hear what she has to say. So in 2016, she became the head of the Cellular Genetics program at the Sanger Institute. And co-founded this very, very exciting initiative called the Human Cell Atlas International Initiative, which she continues to lead. Sarah is an elected member of AMBO, a fellow of the Academy of Medical Sciences and a fellow of the Royal Society. So without further ado, Sarah, we are absolutely delighted to have you speak in the series and really look forward to today's talk and the discussion that is going to follow tomorrow. So before I hand it over to you, I just mentioned that our discussant for tomorrow is Professor Neil Lawrence, who is the Deep Mind Professor of Machine Learning at the University of Cambridge. So please join us at 3:00 PM EDT tomorrow for a continuation of today's exciting session. So Sarah? - Thank you so much, Priya, for that kind introduction and very generous invitation for me to speak at this exciting interdisciplinary colloquium. It's an incredible opportunity to sort of reflect on the field of computation and theory in molecular biology. And I'm really excited to be giving this talk. So I've called it "The Inference of Nature" because what we're inferring in computational biology, theoretical biology, bioinformatics are molecules and cellular components of organisms and their interactions. And so it's essentially prediction of features of nature kind of at that level of the individual components. And within theoretical and computational biology at this molecular level, they are, one of the main aims is of course, to predict causation from correlation. And there are really two different schools of thought here. One is that you want to get the mechanistic details right. So if molecule A interacts with molecule B interacts with molecule C and causes a pathway or cascade of biological interactions and processes, then you wanna be for that you're, that you're modeling the sequence in the correct way and the feed forward and feed back loops and individual interactions and the directions of these arrows all precisely correct and ideally with quantitative kinetics and so on attached to the, as labels to these interactions. And so even though A may correlate with C, you'd wanna make sure that the modeling is modeling the correct sequence of A to B to C and not sort of jumping ahead of itself. And this school of theoretical and computational biology, this school of modeling is really concerned with these mechanistic details and describing them with differential equations and all sorts of, well, you know, Boolean models and related methods that sort of fit into that level of biology. But there's another school of thought that's really more concerned with correlation rather than those mechanistic details and for very good reason. And that is that, you know, we can calculate linear, non-linear, all different, you know, correlations in any numbers of dimensions. Shown here is a very simple correlation between A and B. And often we're doing that, not between sort of two simple molecules and a few data points, but in absolutely huge data sets. And the data sets that I'm showing here are actually quite modest on the scale of what's available to biology in this day and age and for many different types of measurements. And the field that I'm most active at the moment is genomics and functional genomics where we're measuring, for instance, as an example, the mRNA levels, the expression levels of all the genes in the genome in millions of cells at the same time. And so the size of these data sets are absolutely huge. And we can calculate the relationships between or the correlations and relationships between individual genes and represent them as these networks. For instance, that's one way of mining data. We can also collapse high dimensional data that's in 20,000 dimensions into a two dimensional space. And what's shown here on the right-hand side are individual cells, and you can see how there are hundreds of thousands of them projected into this little, this little panel on the right-hand side here, where I'm showing my red cursor. And so the way we, the only way to really tackle these data sets is using statistical and computational tools from data science, as well as machine learning and artificial, you know, more globally artificial intelligence methods, deep learning and so on because the magnitude of the data is simply so enormous that we can't, it doesn't make sense to tackle it with differential equation-based systems in mechanistic models. And what we're extracting are correlations if you will, in these very large data sets in order to extract principles of the molecules, their interactions. On the right-hand side, it's actually molecular fingerprints of cells. So it's basically features of these biological building blocks from large data. And really what I'm trying to say is that, you know, the correlation really or these relationships in this data space are informative. And sometimes we don't really care whether they're causal or not. It's just, the correlation is enough to make a prediction. And so what I'm trying to say is for instance, red sky at night, shepherd's delight. Red sky in the morning, shepherd's warning. That little ditty about the weather and that you can predict the weather from the color of the sky is, you know, it's a very powerful, predictive principle, but of course the color at night doesn't cause the weather the next day, but the correlation is good enough to be very powerful. And that in itself, sometimes predictions in and of themselves are really what we care about, even if we don't understand the detailed mechanistic reasons, we haven't deciphered every single causal, you know, element in the process. - [Priya] So Sarah, if I may ask for a quick clarification. - Sure. - In the sort of multi, highly multidimensional spaces, what is predicated on the fact that you know all the variables in question, right? You know, what the variables are. Or is there an element of also looking for what might be potential? And I ask only because it's such a huge multidimensional space. - So it's a great question. I think, you know, this comes back to the, sort of, let's say the, you know, the thinking that is data mining unbiased? And of course the, you know, these methods that I'm showing here for calculation of graphs between molecules from high dimensional, this is from high dimensional single cell genomics data or manifold projection, you know, which is, this is a universal manifold approximate projection here that I'm showing on the right-hand side. These are, these are unbiased, but in a way in doing these calculations, we are, we have, as the scientist, you know, we have a sort of a certain hypothesis or a certain mental framework that drives us to actually simplify and project the data in this way, even if the computational methods are unbiased. That's getting a little theoretical and philosophical. But for instance, what I'm showing here is a graph calculated through the correlation of transcription factors. And as such, it's completely unbiased. That's a completely hypothesis free method, but actually my hypothesis, you know, I have a sort of hypothesis that I haven't articulated here. And that is that the subset of genes that I'm showing here are transcriptional regulators. They are the class of genes that switch other genes on and off. And so the reason that we did this analysis was that our hypothesis is that these are the key regulatory factors that determine these cell states. And that's why we're calculating the correlation between expression between, you know. - And you have confidence. - So I think there's always a sort of yin and yang in a, you know, data mining. There's always a, there isn't a contradiction between the hypothesis driven mechanistic modeling and the unbiased quote, unquote data science kind of approach, because actually they're not so, they're not completely different worlds because actually there's often a hypothesis or a sort of framework that you start from, even if that ends up, you know, surprising and it's not correct and you discover new things using the unbiased machine learning or data mining tools. Does that make sense? - Yeah, thanks. - So that really, your question really comes back to this sort of supposed dichotomy between the modelers who do mechanistic modeling and the data miners, and, you know, are they, completely different, you know, coming from completely different ends of the world? No, they're not. Actually, you know, both of them are scientists who, you know, start from certain hypotheses. Yes, the data mining approach can seem completely hypothesis free and unbiased, but actually, you know, you always start from a certain way of analyzing the data which in itself implies a hypothesis. You then discover components, which in turn can enter the mechanistic models and so on. So at the end of the day, there's really a productive symbiosis between these different approaches within the domain of theoretical and computational biology. And of course, in molecular and cellular biology, we are in a luxurious situation where we can actually interrogate the systems experimentally that we're studying. So we're not in a, operating in a domain like climate change and climate modeling where we can't tinker with one variable, one factor and see what the outcome is of doing that perturbation. Or let's say the cosmos, you know, that, of course, my friend Priya works in where, you know, it's very difficult to eliminate one planet or something like that and see what changes. In biology, that's not the case. We can actually do experiments. And so it's, you know, and that in a way that means that the modeling is, has always been part of the, or the theoretical and computational approaches and modeling and inference have always been part of biology to some extent. And one example is genetics. And what I'm showing here is Gregor Mendel, the monk who observed the peas and the variation in color size and shape and flowers and so on of the peas, according to their inheritance and from the patterns that he observed in terms of the offspring's features, he inferred that there must be the existence of inherited factors that came to be known as genes. And so this, the earliest genetics experiments that are relating changes to genetic crosses, you know, are making an inference that it's, the stroke of genius that he did was to make this inference that there are factors that are inheriting these features. What I also mentioned is that some, you know, in cell biology, and this is a case of developmental biology, so embryonic development, there are other kinds of perturbations. So the genetic perturbation is basically variation that's the perturbation in here in developmental biology. It's basically taking a bunch of cells or a tissue section and transplanting it into another embryo and showing that you can control access formation through that region called the organizer, that region of tissue that acts on a donor tissue, on a host tissue, the host tissue takes on the features of the donor tissue. And so that kind of, that's another kind of perturbation. The third kind of more modern, getting towards more high throughput experiments in genetics is exemplified by this classic experiment by Nusslein-Volhard and Wieschaus where they did chemical perturbations in Drosophila embryos to uncover the genes that control the patterning of the embryo, where you get these beautiful striped-like patterns that are controlled through a hierarchical set of genes that determine the polarity, the gaps, and the smaller stripes. And ultimately these iconic homeotic genes that control the anterior, posterior access. So it's, and this was through systematic chemical mutagenesis and observation, by Nusslein-Volhard and Wieschaus. So we have these, the genetics basically and perturbation experiments. And we also have, what's more the molecular and biochemical level of experiments in terms of individual molecules, and what I wanna use to exemplify this here, where inference plays a role at multiple different levels is making molecular models. So making models of the three-dimensional structure of molecules. And of course the legendary example is double helix, where you had two different approaches to tackle that. On the one hand, the careful crystal graphic experimental approach pursued by the group in London and the Franklin and Gosling paper that publishes the x-ray diffraction pattern of crystals of deoxyribonucleic acid. And then of course, the double helix is inferred computationally from the diffraction pattern. And that in itself is a kind of inference in the sense that you're calculating or inferring what the three-dimensional structure of the atoms must be from the x-ray diffraction pattern measurements. So there you can see that the experimental measurement really relies on a computation. It's not like Gregor Mendel's peas, where he's simply looking at the color with his eyes. Here, in order to interpret this very complex data, this x-ray diffraction pattern, and then sort of predict the double helical structure from this pattern, what's required is computations, and that require a computer basically. And on the other hand then, Watson and Crick, of course, used a combination of different pieces of evidence to build the model in a more, in a more conventional way with actual physical models by using the information from the diffraction pattern and the step, the distance between the steps and the rungs of the ladder or the base pairs in the DNA double helix, combined with chemical information about the complimentary (audio blurs) A T and G and C, and a few other pieces of evidence. And then they stitched together these orthogonal pieces of evidence to come up with a double helical model. So there are two different types of inference here. One is the calculations on the experimental measurement, and the other is basically putting together different pieces of experimental data. But the point that I wanna bring across here is that the experimental data was key to make the models. And at the same time, the model had to be validated by experimental data. So it's this molecular structure prediction, which I'll also talk about later is very closely intertwined with the observations from experiments being coupled with different types of inference, computational, and more intellectual sort of model, and this challenge of predicting protein structure. And we've, you know, Priya just before the talk raised the question of AlphaFold, which of course has been, you know, very widely publicized as this deep learning approach for systematic structural prediction. This has been a challenge basically from when the first protein sequences were sequenced using sort of Sanger, Sanger sequencing. And, you know, the simple little sequence of insulin for instance, had to wait for many years, until Dorothy Crowfoot Hodgkin solved the crystal structure. And so this, because of the powerful information that's inherent in these molecular structures, you just saw the double helix, which is a very clear one, because the, that basically then immediately gave the clue that the genetic code consists of this sequence of letters, A, T, G, and C for the proteins, which consists of a 20 letter alphabet. It's perhaps a bit more difficult to see what's so important about knowing these molecular structures, but basically you'll just have to bear with me and believe me when I say that this structure also gives a lot of information about the function of the protein and what its interaction partners could be in, and so on and so forth. And so this exercise in critical, in protein structure prediction, which has been, you know, a competition in the community on an annual basis to assess or benchmark what the best structure prediction methods are, it's been going on for many, many, for, you know, 14 or more iterations, has in a way culminated in this deep learning. And the reason for that is that this approach can now at this juncture in our history draw on such rich experimental data sets, where there are on the order of hundreds of thousands of individual protein structures that the rules for making these proteins structures can now be learned in an automatic way. And that wasn't possible in the early days when there was much less data, but there are many areas of biology for which that's true. So the protein data bank is the repository of crystal structures, the three-dimensional atomic coordinates of molecules, proteins like I just showed you, and that's been going on for 50 years now, and it's this incredible big data resource. What I'm showing here are other databases in biology the protein sequences and UniProt, genes and genomes and Ensembl, and then down at the bottom two, sort of more functional genomics databases that I'm involved with the Human Cell Atlas Data Portal and the EBI Single Cell Expression Atlas that are portals for the single cell genomics data that the, in particular the gene expression data at the level of single cells, which is being possible through a resolution revolution genomics that allows us to measure the expression of single cells I'll come to later in my talk. - [Man] I haven't heard of UniProt. What is that? - UniProt is the protein sequence database, ao the amino acid sequences of proteins. Yeah, so my point here was really that there are a lot of databases now that provide the substrate for data mining in biology. And this is really, you know, is a development that's gone on over decades, but it's really accelerated over the past few years that we're now in the era of big data in biology. And that's, there's absolutely no question about that. It's, you know, an exponential increase in data that's happening at the moment. And there are a range of theoretical numerical approaches that aid inference. And I've said on the one hand there's modeling. There are also in silico simulations, in silico representations of systems where you can perturb systems completely theoretically. And so in contrast to these modeling simulation approaches, there are data science approaches. And some of these are when the data on its own is self-evident, like in to Gregor Mendel, but there are, there are also these larger scale approaches, and yeah, these for AlphaFold, it's the structures, and with these big data approaches now, the methods are really statistical, computational, machine learning and AI, as I've mentioned. So, you know, given that there is this very big and important field of computational biology in molecular biology, so that there can be a situation where this field, you know, which wasn't, you know, a traditional part of biology, this field of computational biology and using big data to analyze and predict, feed biological structures, biological interactions, molecular interactions, and so on has meant that some experimentalists might view the theoretical component of molecular biology with suspicions. Okay, and I'm gonna pick up right here and dive into the main meat of my talk, which is going to be three elements. One is talking about predicting pathways approaching complex assembly, which is in that area of predicting molecular structure. The second is predicting cell types using single cell genomics data. And the third is the Human Cell Atlas and using Human Cell Atlas data to predict cell communication and infection by COVID-19 SARS-CoV-2. So if you think about the inside of a cell, and the reason why it's important to predict how protein complex is assembled, this is a molecular simulation of the inside of bacterium. And you can see that essentially, it's a very, very crowded place. All these proteins and protein nucleic acid complexes are multi molecular complexes that are rubbing up against each other in a sort of gel like very, very compact environment where there's hardly, there are hardly any water molecules between them, although they are in a aqueous solution. And so understanding how these large multi-subunit components, these globules that you can see in the interior of the cell assemble and how they find their partners is a really fundamental question in biology that kind of goes beyond those individual components that are predicted by AlphaFold to the next level of these molecular assemblies. And that's an area that I worked on for over a dozen years and where we're using the big data for molecular structure. So the, what we're looking at are proteins, the amino acid sequences, and they are really the output of genes. So genes are code level of DNA. They're transcribed into messenger RNA, which is the messenger molecule, the intermediate level of information that we'll come back onto later. And that's then translated into the protein level. It's this amino acid level. And the question that we're asking in this work is how do protein complexes assemble? And can we predict these assembly pathways? And so what we needed to have in the first place is a big data set that we could, we could build our models from. And that, we built that data set from over tens of thousands of crystal structures that were available at the time. When we started this work, which was in the early 2000's actually, we published this really complex database in 2004. So we made our own database that provide where we modeled protein complexes that we first of all predicted to be in their physiological confirmation as graphs of the individual amino acids sequences where the edges that are shown here between the graphs are the physical interfaces of the interacting protein subunits. And so this then allows us to use graph theory as a theoretical basis for relating complexes for matching them to each other, for finding sub components of these graphs that are connected to each other. And if, what's important to understand the basis of this is that over half of all protein complexes are assemblies of multiple subunits of the same type. They're exactly they're called homomers. They're basically repeats of the same subunit, and they can be related to each other through axis of rotation. And they can be twofold, as you can see here in a dihedral symmetry where there's a twofold axis of rotation, or they can be cyclic, which can have any number of elements around the circle, sort of in this donut shape. We've got four here, but there could be six, seven, eight and so on. And then all homermeric complexes, because they're closed symmetries, consist of combinations of these dihedral two-fold axis of rotation and the larger cyclic axis of rotation. And the important difference between these is that in a, in this kind of flat donut-shaped sort of repetitive structure with a cyclic axis of rotation, you've got the head of one subunit connecting to the tail of the next, the head of one to tail the next and so on. And so this interface consists of two different surfaces. Whereas in the cyclic case, in this dihedral axis of rotation, the two-fold access, it's actually the exact same interface. It's two heads that are contacting each other. It's exactly the same surface that's mirrored, or that's used, not mirrored, but that's used within this dimeric protein structure. And so that, that basically means that there's a different evolutionary, there are different evolutionary pathways and pressures that make these different symmetries of protein complexes. And the very simplest, as I said, to build something, either in evolution or assembling inside a cell would be for a single protein subunit to stick to itself. So it encounters another copy of itself, and it sticks to itself through a dihedral axis, two-fold axis of rotation, where it's like a handshake. It's like you shaking your neighbor's hand. Your right hand, it's the palm of your hand is contacting the palm of their hand, so it's exactly the same surface sticking to itself. It's two different, it's a hand shake kind of symmetry with a single two-fold axis of rotation. And what that means in evolution is that if there's one mutation, let's say in the fingers of your hand, that increase the affinity or stickiness for the other hand in the hand shake, then that mutation will come twice and will increase the stickiness of the two hands twice because you have that symmetry. Now in the, in the other, and here you can, you can have that process occurring again. And another set of two-fold axis of rotation to make a dimer of dimers, in other words, to form this tetramer. So once you've got the dimer, the two dimers can interact with each other and form the tetrimer, again with these interfaces that are reusing the same surface. And a different evolutionary or even kinetic kind of assembly scenario in the cell is that three subunits encounter each other, and form a triangular here, interactions that are occurring within the same plane with a single axis of rotation down the middle in a, in a cyclic manner. And here at what we need is really, as I said, for the head of one subunit to interact with the bottom of the other sub unit, and for this to occur for all three sub units, and what you, what one mutation would therefore count three times within, within, amongst those three subunits. And so you get a weak or slower kind of increase of the affinity for the three sub units. So one mutation would kind of weakly increase the affinity across all three interfaces in the same way. The same would go for four subunits or five subunits. And this series can kind of increase in the same way. So you see, you're kind of building up these rings, and a single mutation would slowly increase the affinity across the whole ring, across all the different interfaces at the same time. Now building up these, these stacks of rings can occur by either the hexamer over here with the dihedral three-fold symmetry could either form through a trimeric, trimerization of the dimers. So it can come through this three dimers sticking to each other. The octamer can be four dimers. The decamer can be five and so on. Or alternatively, we have this, these other pathways where a trimer can stack to form a hexamer or a tetramer can stack to form an octamer and so on. So you can get to these, there are two different paths to get to, for instance, the hexamer. One is from the monomer to the dimer to the hexamer, the other one's from the monomer to the trimer to the hexamer, and these different intermediate states can be intermediates both in evolution or in the kinetic assembly, in a cell in a biochemical sense, and what we, that's what we showed in this work. And this was a paper that kind of synthesizes these ideas, but that builds on several other theoretical and computational analysis, and what we showed here was the key principle that these pathways are reflected in evolution in the sense that you find monomers related to dimers and tetramers, but also trimers related to dimers and hexamers, but all the pathways that are not connected, for instance, a tetramer is never or only very, very, very rarely connected to a hexamer. A trimer is very, very rarely connected to a tetramer and so on. So the arrows in green and red that you're seeing here are really the main transitions that we see, and that holds both for the evolutionary pathways and for assembly pathways. And we think that the reasons that these principles of evolution assembly of protein complexes are mirrored is that it's, it's the size of the interfaces that really determine the intermediate forms that are conserved in evolution. And also the intermediates that are fastest, that are formed fastest kinetically. And so in a living cell. And so what that means is that for instance, if you had the trimer as the intermediate to the hexamer, then these interfaces will be larger, would have a higher affinity to each other, would be more stable, or if it were the dimer that's the intermediate for the hexamer, it will be that dimeric interface that's larger than the trimeric interface. And that was really the insight in this paper that comes through a large data mining exercise. And we expanded this concept for heteromeric complexes. These are complexes that consist of subunits of different types. So here you'd have an octamer that consists of two subtypes of the yellow subunits, two of the red, two of the green, two of the blue. And if these, if these assembled in a random manner, you know, you'd still be there in the middle of the night, trying to find the correct order for these jigsaw pieces to come together. Whereas if there's an ordered assembly where the blue and the red always form together first, and then the yellow, and then the green, then you've got a sort of click, click, click Ikea kind of assembly, and it's very rapid and efficient, and there are no miss assemblies or aggregates that are sticky aggregates that form erroneously inside the cell. And so it's really the speed and the efficiency of that assembly that's driving that, and we show that these heteromeric assemblies also are reflected in the evolutionary conservation. And in this case, what we're using is the principle of gene fusion and fission because, of course, protein subunits that are fused within the same gene and that are part of the same polypeptide chain will be covalently linked, and will be directly therefore also kinetically the most efficient subunits to form interfaces first. And indeed what we show is that the genetic organization of protein subunits reflects their assembly order. And then these predictions that are based on a combination of protein structure, they're based on gene structure in the two dimensional of genome sequence of organisms, were verified experimentally through a beautiful collaboration that we had with Carol Robinson's group. Carl Robinson is a physical chemist, was one of the inventors of micro molecular mass spectrometry where you measure the mass of intact protein complexes, and through collaboration with her and based on the protein, express proteins from many generous collaborators who gave the reagents, we showed that in vitro in biochemical experiments, our pathway predictions were accurate kind of in the vast majority of cases. And so overall in this body of work, we're showing that protein assembly pathways just like protein folding itself is an order process. It's fast. It's spontaneous. It's predictable, and it's also conserved in evolution. And these predictions are validated through these physical chemistry experiments. And so you can think of the proteins, the amino acid sequence doesn't only encode the structure. It also encodes assembly instructions. You can think of it as a set of building blocks where we've got two building blocks, and we've got, the instruction is that the red interface always binds with the blue interface. G is neutral and exposed to the solvent. And that set of simple instructions for these building blocks would encode this cross structure. We've got the red connecting to four blues, and it makes this structured. In a way protein complexes are kind of like that. And we can encode those using graph theory and complexity theory. This was a fantastic collaboration with Sebastian Ahnert, my collaborator in the physics department. And we show here that we can develop a sort of shorthand for what the individual protein subunits are and what interfaces they will form with their partners. So we've got gray and white here, A interacts with C interacts with D and that basically forms this square structure. On the other hand, if we've got gray and white, but C interacts with B, and A with A we'll get this linear structure. And with equal numbers of building blocks that are the general rule in nature, you take those principles into account, and you end up calculating what we describe as a periodic table of protein complexes, and the power of the periodic table of the elements just is similar in a way to this periodic table of protein complexes because the principles of atoms and electrons in the case of the periodic table predict features of the elements. Here are principles of evolution and assembly of something that's predict structures that are of the individual elements here in terms of the number of repeats and the number of subunits of cells in this matrix that are not, that were not filled in by experiments, but that we are able to fill in computationally. And those predictions then were also verified by later releases of the protein data bank. We also showed why, how this is kind of accommodated in the translation of proteins. And what you're seeing here is a simulation of two elements of the ribosome. This is from Adrian Alcock from IO in collaboration with my group, and this simulation, this molecular simulation shows two polypeptide chains coming out of two adjacent ribosomes sitting on a messenger RNA. And what that tells us is that the end terminus that comes out of the ribosomes first, so that's beginning of the chains is likely to also interact with it is likely to also interact first. And that kind of, that sort of in vivo, you know, taking into account these in vivo kinetics of co-translational assembly also gives us much more detailed information about the constraints on making homomeric proteins, and that the parts of the amino acid chains that come out first are interacting with each other before the other parts assemble. And so that's a kind of more detailed level of the principle of how these polypeptide chains assemble in terms of homomeric relationships where exactly the same copies of exactly the same protein are interacting with themselves, with more copies of themselves. So in summary, we've used evolutionary relationships and principles of protein biophysics to predict assembly pathways. They're inferences from thousands of proteins structures and evolutionary sequence relationships between proteins as well as gene structure. And our predictions were experimentally validated both by macromolecular mass spectrometry and also by structures of proteins. And so that's really the lesson about predicting protein complex assembly pathways. And you can see from this little story that it was a synergistic exercise basically that involved bioinformatics at the structural level, bioinformatics at the sequence level, simulations like the molecular simulation that you just saw of the ribosome making these proteins and coming out of the ribosomes, coupled with physical chemistry, macromolecular mass spectrometry experiments, and then also by physical experiments that I didn't show. So there's really, you know, this science is really a collaborative effort between different disciplines and different types of scientists working together to try to discover the truth centrally. In the second part of my talk, I'm going to shift gears and go on to predicting a completely different building block of life, and that is the cell, and the, at the, the basis of this is really the evolution of genomics from sequencing DNA as in the human genome project to sequencing RNA, which gives us a molecular fingerprint of cells in terms of the subset of RNA that's inside each single cell. And it's that subset of messenger RNA that tells us about the molecular features of the cell and in terms of what proteins would also be expressed inside that cell. And of course the cell is basically the fundamental unit of life. It's a component of the tissues, and the tissues in our body are the individual micro-environments of organs. So the nose here, upper respiratory system, the lungs, and lower respiratory system, the thymus, which is where T-cells are made, the heart, and so on. Each one consists of many different types of cells, and conventional RNA sequencing, sort of conventional bulk genomics used to require thousands and thousands of cells to be mashed up together before that the nucleic acid was extracted and put on the sequencer. Over the past, over almost 15 years, the genomics has undergone a so-called resolution revolution, where we're now able to sequence the nucleic acid content from each individual cell in a sample. And that's called single cell genomics. And it's that has really opened up the ability to interrogate single cells almost more powerfully than using a microscope. Because what we can do is isolate individual cells, either in well plates or using microfluidic robotics, and then extract or label the messenger RNA, the comprehensive nucleic acid content of each single cell and sequence that, and then analyze the vast data sets that tell us the genes that are active in single cells. And this has been a series of technological innovations that's gone on from 2009 to date really with many different types of isolation technologies, genomics protocols and computational methods that have evolved at pace. And that now individual experiments routinely encompass on the order of a million cells. And of course, that advance in technology has been, you know, absolutely revolutionary for biology. I think it's not kind of an underestimate to say that. And it's also been coupled to kind of slightly behind the single cell genomics with a spatial genomics revolution, which allows us to measure nucleic acid content of tissue sections, where the cells are actually in their native tissue context. So you're then taking a slice of a tissue, like a mouse brain that's shown here, and what we see here is the expression of six genes, each in a different spatial region. And this is from members of my group together with Omer Bayraktar's group and Oliver Stegle's group that have developed a statistical probabilistic inference framework for mapping individual cells to spatial genomics data. And from the single cell genomics data, one of the key kind of data mining exercise is to enforce cell types. And the way we do that as using a set of algorithms called clustering, which was, you know, encompasses a vast array of different approaches, but essentially what the exercise entails is finding individual data points that are similar to each other and then grouping them. And what you're seeing here are data sets from gut, embryonic and fetal on the left, pediatric on the right. So they're intestinal samples, and each little spot on this two-dimensional projection consists of a single cell. And that data point encompasses in and of itself thousands and thousands of gene expression levels. So it's a vast matrix of hundreds of thousands of cells, each with thousands of thousands of genes expressed. And the total number of genes, of course, in our human genome is somewhere between 25 and 30,000, depending on how you count. And so this matrix that we're clustering consists of hundreds of thousands of cells multiplied, you know, across 25,000 genes roughly. And then the exercise is to cluster those data points that are similar with each other, and are, and that's basically then our interpretation of the cell types that are present in the data set. And you can see here that interpretation's a combination of clustering and data mining, but also kind of like Priya was asking at the beginning, there is intellectual input. There are hypotheses. There is external data that we're putting into the interpretation. And they're, the cell types are shown in different colors and annotated and different things. So you see immune cells, they're kind of lumped together, neurons of the gut. We have a kind of brain that surrounds our gut, interestingly, they're labeled in yellow. We've got smooth muscle that's around the gut and that's responsible for sort of moving material through the gut, and that's shown in brown, and the actual and enterocytes that are absorbing material inside the gut are all shown in the blue colors here. And so there are different, there are different cell types that are shown in different colors for the different clusters based on the similarity in their expression profiles. And so, again, how the cell clustering works is that we're grouping cells with similar expression patterns. In a way it's a hypothesis free grouping. It's a data mining grouping on the one hand. On the other hand, as I said, there is in the final interpretation of the data, there's often external information that enters it. So there's a kind of intellectual puzzle where the scientist is also putting in data. This is high dimensional data. Different clustering algorithms can indeed produce different results. So how do we know what result to use and what's the best algorithm? One of the tests for this is theoretical, and that is self consistency. And our single cell clustering assessment framework is attempting to do exactly that. Sorry, this should be two C's and one A. That was a slight error here, so. And what we're doing here is a cross validation, so a simulation or assessment of the clustering results by testing whether the two clusters can actually model and discover each other with the, using logistical question machine learning method, or whether they are sufficiently distinct that they don't cross match to each other when we model their properties in terms of their weighted levels of gene expression. And so there's basically ways of assessing the different clustering algorithms from a theoretical and computational point of view. And then there's of course also ways of assessing the results from clustering algorithms experimentally or using external data from the literature. In any case, so the, we've predicted cell types here for 20 tissues in the human body. This is almost a million data points very recently. It's very exciting, you know, to be able to be at this juncture where we have data that covers a representative set of tissues from the whole body. It's not the entire body, but it's maybe half of the tissues in our body. You know, we're getting very close, and it's a very exciting time now for the community. And as I said, this isn't an automatic. You know, a million cells, it's not something that you can do by hand. Gregor Mendel or Watson Crick, you know, those kinds of approaches are not gonna work. We need to do this by machine learning. And we use a supervised or semi-supervised approach so that we can then make these models of cell types, which allows to transfer labels from the known data sets to new datasets. And in general, of course, you know, we're at the point where we're somewhere in between these regimes. When you're, when you have little data, you require a lot of knowledge, and the more data that you have, the less knowledge you really need to interpret it because you're kind of crystallizing. You have enough data sort of crystallized very, very accurately, basically the, for instance, the clusters in your data sets, or the protein structures in your AlphaFold predictions. And this is from Carl Henrik Ek who's a colleague of Neil Henderson, Neil Lawrence in the computer science department in Cambridge. And I think it summarizes that the kind of development of single cell genomics and computational biology in this area of cell type annotation really nicely. So we are, we're getting to the era of large-scale single cell expression data sets, where we can, we don't need so much model selection and tuning anymore. And so the motivation is where we can make classifiers that can automate the annotation. We want, we compiled this data from many, many data sets, 19 different data sets from different regions of the body and ask whether they're tissue-specific cell states, whether they're cell states that are shared across tissues. There are many really exciting questions that we can now use this data set for to interrogate the cells from across the human body. We made, so they cluster into roughly 100 different cell types. And there are subdivisions of 10 broad cell types. And just to sort of give you an overview of the workflow, you have to basically integrate the datasets, harmonize, assembles, or clean the data, do initial training, model the data, and then you can feed into unannotated new data and interrogate it with the reference model. And that can in turn then reenter the pipeline and contribute to the training data in itself. And so you can go through a kind of iterative process where you expand your models with more and more data systematically. And so basically the take home from this is that we're entering the area of big data in single cell genomics, just the way we did in protein structures in the previous section of the talker on the AlphaFold and the biological interpretation of that data just needs computational inference because it's just so vast. And the computational tools in mathematics on the other hand need to keep pace with the experimental technology. So there's a kind of to and fro and a symbiosis between the theory and the computation. I'm coming to the last part of the talk now, and where I'm going to discuss prediction of cell communication, how cells talk to each other, how they communicate with each other and how we can predict which cells get infected using Human Cell Atlas data. So the Human Cell Atlas is an international consortium with a mission of creating a comprehensive reference map of cells using single cell genomics coupled with spatial data and interpreted with computational methods. And it's, you can think of it as a Google maps of the human body, where we're using this new cutting edge, high resolution technologies to get from the kind of course grain view of the human body to the Google street maps view. And we founded this about five years ago. This is this my co-founder, my partner in crime in this endeavor is Aviv Regev. And we've now grown to a community of about 2000 members across, you know, really across the globe, 77 countries across the world. And this is a grassroots, bottom up, scientist-led initiative. We are organized into working groups and biological networks that focus on the different organs and tissues in the body, as well as human development, organoids and genetic diversity. And one of the first, my first project on human cells and tissues as opposed to a mouse or other model systems was the placenta. And what we set out to do in from about 2015, 2016, was to map this organ. That's a transient organ. That's at the interface between the mother and the fetus, sort of in the, in the inside of the womb. And it's of course only present for nine months of your life, but it's absolutely crucial and essential to your development for those nine months. And you wouldn't be here without this organ. And it's relatively poorly understood because it's, the human is very different from mouse and even different from the most closely related non-human primates, the chimpanzees. It's shown here in red. And, of course, the really intriguing conundrum to me here was that there's a mystery about how the maternal immune system tolerates the paternal antigens. So of course our immune systems are tuned to reject anything that's non-self, and that's really the basis of our health and our homeostasis. And yet, in, when we're pregnant, we are tolerating something that's non-self that has antigens or proteins that come from the father. And so how is that actually possible? And what we set out to do therefore, was to study this tissue by dissociation, sorting cell, single cell genomics, using two different technologies, microfluidic droplets and well plates studied robotically, and then computationally cluster the cells. And you can see here, we studied both the decidua, which is the uterus or endometrium, the placenta, which is the, so the uterus is the maternal side, the placenta is the fetal side, and the maternal blood in order to distinguish cells that were from the maternal blood. And this gave us the cell phone book, I'll call it, which means what are the individual cellular components of both the maternal side on the fetal side? And you can see here, there are immune cells and K cells, T cells. There are glandular epithelial cells that do the secretions to support the fetus. There are fibroblasts, which are kind of structural components of the uterus. And it's really through the statistical inference that we're able to find all these cell types. But what we really wanted to understand was how does this immunological tolerance take place? And to understand that, what we need to also dig into is how the cells are talking to each other. And we call this CellPhone, a statistical inference system for deciphering cellular interactions through the receptor ligand complex, and so this links to the protein complex assembly that I talked about in the first part of the talk and how those protein complexes on the surfaces of adjacent cells are mediating the interactions between cells. So this is a kind of interaction that's at a higher level than the proteins. It's at the level of the individual cells. We developed a statistical inference framework to look for receptor/ligand interactions between cell types, between single cell clusters that are specific to those cell types. So they're not ubiquitous. These aren't molecules that are expressed everywhere. They're molecules that are specifically complimentary between individual cell types. And that allowed us to decipher the cells. From the placental side, the fetal side that you can see up here where you've got the placental trophoblast cells, you've got placental immune cells, blood vessels, fibroblasts. The maternal blood is swishing around here. You've got extravillous trophoblast, which are the, invading the uterus, the maternal blood vessels, the maternal glands, and maternal immune cells. And there are a lot of interactions that are taking place that guarantee that immunological tolerance that I mentioned earlier, and that we discovered in this work. And so this is really a kind of flavor of a Cell Atlas exercise. It's one of the first organs that was mapped in this way. We mapped all different regions of the placenta to get a comprehensive overview. And this, we published this work in 2018. And then of course over the intervening years, there are a lot of tissue and organ datasets that became available. And there are those 20 tissues and almost a million cells that I mentioned early in that integrated data set. And so when the pandemic hit in early 2020, we became aware that there was this virus circulating. What we leveraged was the Human Cell Atlas data and also the Human Cell Atlas scientific community to understand COVID-19 and that endeavor has really continued. But I wanna tell you this story from the very early days of the pandemic, where we mapped the viral entry receptors from the Human Cell Atlas data. So we asked in all of this single cell genomics data from around the body, where are the viral entry receptors expressed that could welcome the virus into the cells? Because of course the virus is docking onto the surface of cells, and so we're simply asking, can we predict where the virus is entering in the healthy Human Cell Atlas data? And we assembled data from all around the body. You can see the different tissues here, and then mapped where ACE2 is expressed, and I'm gonna focus very briefly on the barrier tissues, where the healthy reference data is probably most useful and is telling us where the virus can hit. And of course, the nose is one of the main ones where we have the nasal passages where you've got inhalation of aerosol droplets. And indeed here, we find epithelial cells, goblet and ciliated cells that have high expression levels for ACE2 and TMPRSS2. So we pointed towards those specific cells as being potential viral entry points. In the lower airways, the bronchi, you've got club and ciliated. Sorry, my internet disconnected. - [Priya] Yeah, no worries, yeah. - So we found cells in the eye and in the gut epithelium and teracytes and also in the placenta, which I just talked about where we hypothesized that there are cells that can be responsible for the vertical transmission from the maternal to the fetal side. Now, while this is rare, what this data shows is that there's a potential pathway of transmission. These were all predictions about where the virus could enter. We were trying to get to truth and information and insights into this infection as quickly as possible using Cell Atlas data. And what's really exciting is that since then, we published that work last year, it's been cited about 1300 times. It got a lot of attention also from the public health point of view. What's really exciting is that those predictions have largely come true from experimental data where we can see, for instance, from nasal swabs that we can read, next generation sequencing reads mRNA of the virus inside individual cells in the nasal epithelium. I should also say that in the mouth, which I haven't discussed, we predict salivary glands, ductile cells in the top of this salivary glands, in the ducts where the virus could enter. And indeed, in these microscopy images, we can see that the virus is indeed sitting inside those cells, and we published that this year. So for ACES and the SARS-CoV-2 virus, the correlation with the receptor does indeed reflect the infection and the infected cells. And so that's really the last part of my story. I've told you about protein complex assembly pathways and data mining from protein structures, inferences from single cell genomics to predict cell types and cell states, the clustering, predicting cell communication of the maternal fetal interface, and viral entry into cells through the cell surface receptors. And so I'd like to summarize basically at a high level, what this is telling us about correlation, causation is that in the protein complex assembly, this data mining approaches for linking evolution assembly, the correlation was experimentally validated using physical chemistry, mass spectrometry, and other methods. In single cell genomics, the cell typist models really rely on, the data is of such a scale that the interpretation really relies on machine learning and computational tools like clustering. And for the Human Cell Atlas, What I told you about the cell interactions in the placenta and the COVID-infected cells is that these correlations are now coming true based on orthogonal experimental measurements, and later experimental data. And with that, I'd like to thank you and take any questions. - Thank you so much, Sarah. First of all for your enormous patience through the disruptions for today, what an exciting set of experiments and conceptual model building that you and collaborators have been working on, and truly interdisciplinary. So thank you so much for allowing us a glimpse into this world and the results. So I think we have a couple of questions. We had one from Deepti in the chat, and her question is how do you account for heterogeneity of time? What is the time that you are using in the spatial clustering analysis? - So these are all snapshots essentially. So in the, yeah, so the example in the spatial genomics of the mouse brain for instance, you know, that's a snapshot for at one point in time through a tissue section, same for the placenta. You know, it's a snapshot of a tissue that you're measuring at one point in time. Now within that, there are kind of pseudo temporal relationships between cells, which I didn't discuss because you have cells that are progenitors of other cells. So you kind of have stem cells that give rise to other cells, and those relationships, I mean, inferring those relationships is another whole field in and of itself. And actually in some of the gut data that I mentioned, there are stem cells and differentiated cells within the epithelial compartment, and in the data projection that I showed in 2-D, you can almost see those relationships where the stem cells are kind of more at the bottom, at the root of a trajectory, and the differentiated cells are kind of more at the top, but we weren't formally doing that inference here. It just comes out in the manifold projection, yeah. - Thank you. If I may, I have a question. So, you know, one of the arguments that is always made about, you know, comparing disciplinary approaches, say physics and biology, is that, you know, in physics we have some guidance from conservation laws. So, you know, you have conservation of energy, conservation of angular momentum and so on. And those really form the bedrock on which we build inference models because these give you constraints. So, and the question often arises, are there, when there are clearly symmetries, do these symmetries similarly translate into conservation, new conservation laws or are there just similar sort of, you know, entropy and energy? I mean, are these the sorts of conservation laws that operate? - No, I mean, that's a great, that's a great question. I'd say, so there are some, I mean, at one level biology is physics. And so, you know, in biophysics, like in the, in the macromolecular assembly, you know, at some level, this is physics. It's biophysics, it's physical chemistry. The laws of physics apply, and of course they apply to all of biology in that sense, but when you're dealing with the sort of big data sets, then you're not in that regime, but there are still like sort of, you know, the central dogma that I showed, which is DNA makes RNA makes protein. Like that central dogma would still enter into any of these data mining assumptions. So for instance, we are predicting cell types from the RNA, but our kind of unarticulated assumption is that those, those RNA fingerprints will translate into what proteins the cell is expressing. So there are some fundamental truths that everybody has as part of their kind of mental map when they're doing, you know, when they're doing biological science. - But do they then become constraints as well? So-- - Can, they can, yes, they can. Yes, and, you know, because the central dogma may not always hold because you may have RNA that doesn't make protein, you know, and so that can, you know, that correlation may not hold or that assumption may be misleading, but I would say it's the exception rather than the rule. - [Priya] Right. - ]Sarah] Yeah, but, yeah, I get your point. - Right, because I mean, I think that is sort of one of the features when you make inferences in sort of physical systems, and if you have, you know, you're looking for these unmapped correlations and when you find them, you still, if some of them violate the conservation of energy, for example, then you know that you are probably missing a variable or there is a hidden variable that, you know, you in your inference structure you have overlooked, or, and so on. So I guess what you're saying is that it's not quite the same. - Yeah, I mean there are, so I see what you mean. So there are kind of like sense tracks, you know, like that. And one of them would be, so for instance, that, you know, cell types that are, that have developed from different progenitors or that are completely different from, let's say epithelial and neuronal. So the, you know, the brain and the epithelial tissues. Those cell types should never, let's say, logistical question models, machine learning models should never cross match each other. And so that was kind of the internal consistency check that I mentioned where we're looking for, you know, set clusterings that are separate, or that are, that are hitting each other, that are finding each other. There's that kind of thing that you can use. - [Priya] Right. - So that's based on a sort of developmental, you know. - Right, so I guess there is-- (cross talking between Sarah and Priya) So I guess the way this would kind of connect to a physics kind of argument that I was making is that one constraint would be that, you know, these cells cannot transmute into each other. That's a fundamental constraint. - Yeah. - That they have, they are independent building blocks that don't transform into each other. - Exactly, exactly. - Okay, great. Thank you. Any other questions? - [Participant] I just wanted to make a comment on what you said at the last moment, Priya, about the cells can't become another cell. Well, as long as you're not looking at stem cells, right, Sarah? - Right, right, yes. I mean, I guess they. - Right, but the progenitors, you know, let's say gut, you know, stem cell wouldn't become a brain. I mean, naturally, you know, so the neuronal progenitors will be different from the gut, epithelial progenitors. I mean, that's a kind of reasonable assumption for any biologist I think. You can force them to do those things in the dish in a Frankensteinian kind of way, but they wouldn't happen naturally necessarily. - Right, so I think while we wait for someone else, I mean, I have. - Well, I think Jenny wants to ask a question. - [Participant] Yep, Jenny has her hand up. - Oh, there's a hand up. Okay, sorry, please go ahead, Jenny. - [Jenny] Okay, thanks. I just try to be polite. Thanks for the great talk. Concerning Priya's question as well, I would like to follow up on this. You have single cells, and you try to cluster them by type, but they are not, you do not take into account the environment of these cells. This comes only when you take into account the Human Cell Atlas. So would your clustering types then change by taking into account the environment of each cell type? Will you get different clusters, by taking this, the environment into account? - I mean, so in theory, you know, if the data, like you shouldn't. I mean, the molecular fingerprint based on the RNA content should, you know, we think that that should be sufficient to determine the identity of a cell. But of course, if you have additional information about the environment, if you have, let's say a cluster that's large and very disperse, and it, and, you know, in the 20,000 gene space, then you may, you know, having the precise micro, anatomical, micro environmental niche information, the tissue, may tell you, oh, you know, in the, in the crypts of the colon, there's this, let's say, intraepithelial T cell type, whereas in the villi, there's this intraepithelial T cell type. And we can now distinguish them, you know, based on very, very, very subtle differences within this cluster because we have the micro environmental information. I mean, that's what, that's one of the things that we developed this, this method called Milo with John Marioni's group that basically distinguishes sort of very subtle neighborhoods in a KNN graph of the single cell genomics data based on external information, the kind of metadata parameters that you can give the algorithm. So, yeah, that's a, that's a good point. Taking into account metadata about location can resolve clusters in a more, in a more fine-grained way. - [Jenny] Thanks. So you get sub clusters. - Yeah, and sub clusters that are, that you couldn't, where there wouldn't be sufficient information based on the single cell genomics data alone, but using the metadata, you then have enough statistical power and basis to distinguish them. - [Jenny] Mmm hmm, thanks. - [Priya] Wilson? - [Moderator] Hussein? - [Wilson] Yeah, I have a question more about prediction towards the future. So I'm curious, as I know that the Human Cell Atlas is in the process of building the map of human cells. So just like we learned about the differences between the sequences of individual human genomes on the bulk level, where do you see this going in terms of this single cell level, in terms of different individuals having, I don't know, different numbers of cells or cell types or expression profiles and so forth. - Yeah, so the human genetics level for the Human Cell Atlas is basically coming now for individual tissues. So where it started off, if you remember for the human genome project, the equivalent is sort of, the next level was kind of, you know, population genetics or GWAS, and it started off with individual genes rather than the whole human genome. And I would say for the Human Cell Atlas, the counterpart of that is genetics of let's say blood or genetics of, let's say, you know, small intestinal biopsies, where you are able to gather samples from hundreds of donors and can then calculate, like you said, differences in abundances of cell types or differences in patterns of gene expression at the single cell level of, on the human genetic scale using single-site EQTL or dynamic EQTL inference algorithms. And that's an area that's developing both on the experimental level and on the computational level, about what's the best way to do that. And in fact, we're collaborating with Neil Lawrence, who's going to join me as a discussant tomorrow about, in terms of algorithms for analyzing single cell genomics data at the human genetics level, using Gaussian process latent variable models. So I think that's where it's going, yeah. - [Wilson] Thanks. - Thanks. I guess we had, I guess, Sunny, is that you? - So I think Fred was asking a question, and he was saying. (cross talking with Priya and Sarah) - [Priya] We had one more person, William, who had. - Yeah, and William, yep. - [William] Ah, hi, can you hear me? - Yep, hi. - Thanks for your talk. I really enjoyed it. I'm not a real scientist, I'm a social scientist, so maybe, excuse my ignorance on this if it sounds kind of coming from like man on the street type perspective. But over the past year and a half, there seem to be certain countries that I wouldn't necessarily say are immune that have had a lot greater chance of controlling, spread and transmission of like the COVID-19 virus. I was wondering if there's any evidence in sort of the genomic field of whether certain populations might be like more genetically adept at overcoming COVID than other say like in Southeast Asia, perhaps some previous exposure to regional waves of SARS has given some sort of advantageous adaptations to the body's dealing with this new virus. - So advantageous adaptations from the viral point of view? Sorry, I don't quite understand the question. - [William] Yeah, I was wondering, I guess, I guess what I'm wondering is if you looked at all this data that you've collected, if there's any indications that you're seeing unfold from certain populations in parts of the world are different in a way that allows a better response to viral infections? - Oh, okay, no now I understand what you mean. So, the answer to that is we don't have enough data to really, so the answer to that is, there're a couple of different answers. So in this Gaussian process latent variable model approach that I mentioned, what we can do as map the associations with immunity to COVID-19 onto both nasal sort of epithelial data, and also onto blood data of populations and show that the OAS1 mutation in this RNA processing enzyme that's part of the innate immune response against the virus that's present in every single cell in our body, that there are indeed sort of genetic variants that lead to a higher expression with a certain splice variant versus other variants that lead to a lower expression with a lower splice variant. And that the ones with the mutation that leads to the splicing truncation has lower expression. And that may explain the higher propensity to severe COVID-19 of people who have that genetic variant. So essentially we can sort of take the association data and interpret it in terms of expression based on these data sets. That's one thing that we can do. And that's one thing that we're showing in this unpublished work that we'll post online soon. And then the other thing that we can learn by looking at population data of COVID-19 is the difference between children and adults in terms of their innate and adaptive immune responses. And we have a post on that on a publication on Med Archive that I didn't talk about. And what we see there is that the children have a much stronger innate immune response and a more polyclonal, adaptive immune response. So children are really kind of better poised to get rid of the virus quickly. Whereas our innate immune response is kind of more sluggish, our adult response is, and the kids are also developing a kind of TNB cell response from scratch, if you will. So they don't rely on their memory of TNB cells, which we do as adults. We rely on our immune memory. And so we have smaller, larger clones of TNB cells. Whereas the kids are seeing these viruses for the first time and have a much more kind of diverse population of TNB cells in their response. So that's a kind of, those are the bits of things that we've learned from this work. There's a lot more to learn, but it's, you know, we have learned quite a lot about the innate, about immune responses in different genetic, with people with different genetic variants, kids versus adults and so on. And we'll continue to study this over the coming years because it's, you know, incredibly important, I think not just for COVID-19 actually, but, you know, we're actually learning new things about virology immunity overall. - So there's a question by Fred. Let's take that question. - Yep, so Fred is asking, is there something in common between the three parts of the talk? And I would say the central take home message is really, the central thing that I wanted to illustrate as a common thread throughout all these things is really the iterative cycle between having big data in biology, using statistical computational machine learning approaches to analyze it and make predictions, and then going back to the cycle of experiments to validate those predictions. Like I said, in biology, we have the luxury of being able to do that cycle because we can interrogate molecular and cellular systems experimentally. It's not like the cosmos or the climate. And really the central message that I wanna get across is that that's a very powerful paradigm of iterating from biology to prediction with these big data methods, and then back to experiments. You know, and that holds across structural biology. It holds across cellular biology. It holds across tissue biology and virology. It's whatever you, wherever you look kind of thing. - Yeah, and I think, as you mentioned, Sarah, this ability to do controlled experiments is really key in validating these predictions for the kinds of conceptual computational models that are being built. So if there are no further questions, just wait for a minute to see if anyone raises their hand or types in the chat. Let's all thank Sarah for an excellent talk. - Yes, thank you everybody for staying here so late. - [Priya] Thank you very much, yeah. - Very much. Yeah.

Info

Channel: YaleUniversity

Views: 1,354

Rating: 4.8888888 out of 5

Keywords: genetics, cell atlases, COVID-19, modelling, Yale, modeling, Franke Lectures

Id: D9UTXbZr38M

Channel Id: undefined

Length: 84min 46sec (5086 seconds)

Published: Sun Sep 12 2021