- Hi, everyone. I'm delighted to welcome you all to our monthly Inference
Colloquium Series. This series as many of you know, is one of the flagship events of our larger interdisciplinary project that examines the nature of inference focusing specifically on issues
of correlation and causation across disciplines. It is a very ambitious
interdisciplinary project, and I'm delighted that our speaker today is one of our core members of this team. And this project is supported by the John Templeton Foundation, as well as Yale's Franke Program in Science and the Humanities. And I think these epistemic issues that are sort of front and center in most academic disciplines
today, and in particular, as we all move into
sort of the big data era in our respective disciplines, the question of how we can infer and produce new reliable knowledge using new tools and techniques is really a very critical
one to explore right now. I am really excited about today's speaker, Dr. Sarah Teichmann, and her
unique take and perspective from the vantage point
of molecular biology. But first I would like
to thank our benefactors, Mr. and Mrs. Richard and Barbara Franke for their generous support
of the Franke Program and other efforts, many other
efforts actually at Yale, that bridge disciplines. And I just wanted to take this moment to remind all of you assembled that we are recording this event, and that all participants must
therefore mute their videos. If you wish you can, as is customary submit your questions
through the chat feature, and you can feel free to
submit your questions. We will actually have
a dedicated Q&A session at the end of the talk. So it is my real honor and privilege to introduce our speaker
today, Dr. Sarah Teichmann, who is the Head of the
Cellular Genetics Program at the Wellcome Sanger
Institute in Cambridge, England. And it is also a particular
personal pleasure because she's one of my close friends, and we have known each other through the early stages
of our scientific careers. Sarah Teichmann's research focus has been on understanding
sort of global principles of regulation and gene
expression and protein complexes with a particular focus
on issues of immunity. She earned her doctorate at the
MRC Lab of Molecular Biology in Cambridge, and was a
Beit Memorial fellow at UCL. She started her own group at the MRC Laboratory of
Molecular Biology in 2001 and was also an elected fellow of Trinity College, Cambridge. Her lab focuses on discovering
stereotypical pathways of assembly and evolution
of protein complexes. In 2013, she moved to the
Wellcome Genome Campus in Hinxton, Cambridge jointly with the European
Bioinformatics Institute and the Wellcome Sanger Institute. And in February, 2016, she's had an incredibly
illustrious career already and has so many prizes, so many honors that I'm just gonna cherry
pick a handful of them because I really don't want
to take up any more time, and I'm dying to hear what she has to say. So in 2016, she became the head of the Cellular Genetics
program at the Sanger Institute. And co-founded this very,
very exciting initiative called the Human Cell Atlas
International Initiative, which she continues to lead. Sarah is an elected member of AMBO, a fellow of the Academy
of Medical Sciences and a fellow of the Royal Society. So without further ado, Sarah, we are absolutely delighted to
have you speak in the series and really look forward to
today's talk and the discussion that is going to follow tomorrow. So before I hand it over to you, I just mentioned that our
discussant for tomorrow is Professor Neil Lawrence, who is the Deep Mind
Professor of Machine Learning at the University of Cambridge. So please join us at 3:00 PM EDT tomorrow for a continuation of
today's exciting session. So Sarah? - Thank you so much, Priya,
for that kind introduction and very generous
invitation for me to speak at this exciting
interdisciplinary colloquium. It's an incredible
opportunity to sort of reflect on the field of computation and
theory in molecular biology. And I'm really excited
to be giving this talk. So I've called it "The
Inference of Nature" because what we're inferring
in computational biology, theoretical biology, bioinformatics are molecules and cellular
components of organisms and their interactions. And so it's essentially
prediction of features of nature kind of at that level of
the individual components. And within theoretical
and computational biology at this molecular level, they are, one of the main aims is of course, to predict causation from correlation. And there are really two
different schools of thought here. One is that you want to get
the mechanistic details right. So if molecule A interacts with molecule B interacts with molecule C and
causes a pathway or cascade of biological interactions and processes, then you wanna be for that you're, that you're modeling the
sequence in the correct way and the feed forward and feed back loops and individual interactions and the directions of these
arrows all precisely correct and ideally with quantitative
kinetics and so on attached to the, as labels
to these interactions. And so even though A may correlate with C, you'd wanna make sure that the modeling is modeling the correct
sequence of A to B to C and not sort of jumping ahead of itself. And this school of theoretical
and computational biology, this school of modeling
is really concerned with these mechanistic
details and describing them with differential
equations and all sorts of, well, you know, Boolean
models and related methods that sort of fit into
that level of biology. But there's another school of thought that's really more
concerned with correlation rather than those mechanistic details and for very good reason. And that is that, you know,
we can calculate linear, non-linear, all different,
you know, correlations in any numbers of dimensions. Shown here is a very simple
correlation between A and B. And often we're doing that, not between sort of two simple molecules and a few data points, but
in absolutely huge data sets. And the data sets that I'm showing here are actually quite modest on
the scale of what's available to biology in this day and age and for many different
types of measurements. And the field that I'm
most active at the moment is genomics and functional genomics where we're measuring, for
instance, as an example, the mRNA levels, the expression levels of all the genes in the genome in millions of cells at the same time. And so the size of these data
sets are absolutely huge. And we can calculate the
relationships between or the correlations and relationships between individual
genes and represent them as these networks. For instance, that's
one way of mining data. We can also collapse high dimensional data that's in 20,000 dimensions
into a two dimensional space. And what's shown here
on the right-hand side are individual cells, and you can see how there are
hundreds of thousands of them projected into this
little, this little panel on the right-hand side here, where I'm showing my red cursor. And so the way we, the only way to really
tackle these data sets is using statistical
and computational tools from data science, as
well as machine learning and artificial, you know, more globally artificial
intelligence methods, deep learning and so on because
the magnitude of the data is simply so enormous that we can't, it doesn't make sense to tackle it with differential equation-based systems in mechanistic models. And what we're extracting
are correlations if you will, in these very large data sets
in order to extract principles of the molecules, their interactions. On the right-hand side, it's actually molecular
fingerprints of cells. So it's basically features of these biological building
blocks from large data. And really what I'm trying
to say is that, you know, the correlation really
or these relationships in this data space are informative. And sometimes we don't really care whether they're causal or not. It's just, the correlation is
enough to make a prediction. And so what I'm trying
to say is for instance, red sky at night, shepherd's delight. Red sky in the morning,
shepherd's warning. That little ditty about the weather and that you can predict the weather from the color of the sky is, you know, it's a very powerful,
predictive principle, but of course the color at night doesn't cause the weather the next day, but the correlation is good
enough to be very powerful. And that in itself, sometimes predictions in and of themselves are really what we care about, even if we don't understand the detailed mechanistic reasons, we haven't deciphered every
single causal, you know, element in the process. - [Priya] So Sarah, if I may
ask for a quick clarification. - Sure. - In the sort of multi, highly
multidimensional spaces, what is predicated on the fact that you know all the
variables in question, right? You know, what the variables are. Or is there an element of also looking for what might be potential? And I ask only because it's such a huge
multidimensional space. - So it's a great question. I think, you know, this
comes back to the, sort of, let's say the, you know, the thinking that is data mining unbiased? And of course the, you know, these methods that I'm
showing here for calculation of graphs between molecules
from high dimensional, this is from high dimensional
single cell genomics data or manifold projection,
you know, which is, this is a universal manifold
approximate projection here that I'm showing on the right-hand side. These are, these are unbiased, but in a way in doing these calculations, we are, we have, as the
scientist, you know, we have a sort of a certain hypothesis or a certain mental framework that drives us to actually
simplify and project the data in this way, even if the
computational methods are unbiased. That's getting a little
theoretical and philosophical. But for instance, what I'm
showing here is a graph calculated through the correlation
of transcription factors. And as such, it's completely unbiased. That's a completely
hypothesis free method, but actually my hypothesis, you know, I have a sort of hypothesis
that I haven't articulated here. And that is that the subset
of genes that I'm showing here are transcriptional regulators. They are the class of genes
that switch other genes on and off. And so the reason that
we did this analysis was that our hypothesis is that these are the key regulatory factors that determine these cell states. And that's why we're calculating the correlation between
expression between, you know. - And you have confidence. - So I think there's always
a sort of yin and yang in a, you know, data mining. There's always a, there
isn't a contradiction between the hypothesis
driven mechanistic modeling and the unbiased quote, unquote data
science kind of approach, because actually they're not so, they're not completely different worlds because actually there's
often a hypothesis or a sort of framework
that you start from, even if that ends up, you know, surprising and it's not correct and
you discover new things using the unbiased machine
learning or data mining tools. Does that make sense? - Yeah, thanks. - So that really, your question
really comes back to this sort of supposed dichotomy
between the modelers who do mechanistic modeling
and the data miners, and, you know, are they,
completely different, you know, coming from completely
different ends of the world? No, they're not. Actually, you know, both
of them are scientists who, you know, start
from certain hypotheses. Yes, the data mining approach can seem completely hypothesis free and unbiased, but actually, you know, you always start from a certain way of analyzing the data which in itself implies a hypothesis. You then discover components, which in turn can enter the
mechanistic models and so on. So at the end of the day, there's really a productive symbiosis between these different approaches within the domain of theoretical
and computational biology. And of course, in molecular
and cellular biology, we are in a luxurious situation where we can actually interrogate
the systems experimentally that we're studying. So we're not in a, operating
in a domain like climate change and climate modeling where we can't tinker with one variable, one factor
and see what the outcome is of doing that perturbation. Or let's say the cosmos,
you know, that, of course, my friend Priya works in where, you know, it's very difficult to
eliminate one planet or something like that
and see what changes. In biology, that's not the case. We can actually do experiments. And so it's, you know, and that in a way that means that the
modeling is, has always been part of the, or the theoretical and
computational approaches and modeling and inference have
always been part of biology to some extent. And one example is genetics. And what I'm showing
here is Gregor Mendel, the monk who observed the
peas and the variation in color size and shape and
flowers and so on of the peas, according to their inheritance and from the patterns that he observed in terms of the offspring's features, he inferred that there
must be the existence of inherited factors that
came to be known as genes. And so this, the earliest
genetics experiments that are relating changes to
genetic crosses, you know, are making an inference that it's, the stroke of genius that he
did was to make this inference that there are factors that
are inheriting these features. What I also mentioned
is that some, you know, in cell biology, and this is a case of
developmental biology, so embryonic development, there are other kinds of perturbations. So the genetic perturbation
is basically variation that's the perturbation in
here in developmental biology. It's basically taking a bunch
of cells or a tissue section and transplanting it into another embryo and showing that you can
control access formation through that region called the organizer, that region of tissue that
acts on a donor tissue, on a host tissue, the host tissue takes on the
features of the donor tissue. And so that kind of, that's
another kind of perturbation. The third kind of more modern, getting towards more high
throughput experiments in genetics is exemplified by this classic experiment by Nusslein-Volhard and Wieschaus where they did chemical
perturbations in Drosophila embryos to uncover the genes that
control the patterning of the embryo, where you get these beautiful striped-like
patterns that are controlled through a hierarchical set of genes that determine the polarity, the gaps, and the smaller stripes. And ultimately these iconic homeotic genes that control the anterior,
posterior access. So it's, and this was through systematic chemical
mutagenesis and observation, by Nusslein-Volhard and Wieschaus. So we have these, the genetics basically and
perturbation experiments. And we also have, what's
more the molecular and biochemical level of experiments in terms of individual molecules, and what I wanna use
to exemplify this here, where inference plays a role
at multiple different levels is making molecular models. So making models of the
three-dimensional structure of molecules. And of course the legendary
example is double helix, where you had two different
approaches to tackle that. On the one hand, the careful crystal graphic
experimental approach pursued by the group in London and the Franklin and
Gosling paper that publishes the x-ray diffraction pattern of crystals of deoxyribonucleic acid. And then of course, the double helix is inferred computationally
from the diffraction pattern. And that in itself is a kind of inference in the sense that you're
calculating or inferring what the three-dimensional
structure of the atoms must be from the x-ray diffraction
pattern measurements. So there you can see that
the experimental measurement really relies on a computation. It's not like Gregor Mendel's peas, where he's simply looking
at the color with his eyes. Here, in order to interpret
this very complex data, this x-ray diffraction pattern, and then sort of predict
the double helical structure from this pattern, what's
required is computations, and that require a computer basically. And on the other hand
then, Watson and Crick, of course, used a combination of
different pieces of evidence to build the model in a more, in a more conventional way
with actual physical models by using the information
from the diffraction pattern and the step, the distance between the steps
and the rungs of the ladder or the base pairs in the DNA double helix, combined with chemical information about the complimentary
(audio blurs) A T and G and C, and a few other pieces of evidence. And then they stitched together these orthogonal pieces of evidence to come up with a double helical model. So there are two different
types of inference here. One is the calculations on
the experimental measurement, and the other is basically
putting together different pieces of experimental data. But the point that I
wanna bring across here is that the experimental data
was key to make the models. And at the same time, the
model had to be validated by experimental data. So it's this molecular
structure prediction, which I'll also talk about later is very closely intertwined with the observations from experiments being coupled with different
types of inference, computational, and more
intellectual sort of model, and this challenge of
predicting protein structure. And we've, you know,
Priya just before the talk raised the question of AlphaFold,
which of course has been, you know, very widely publicized as this deep learning approach for systematic structural prediction. This has been a challenge basically from when the first protein
sequences were sequenced using sort of Sanger, Sanger sequencing. And, you know, the simple
little sequence of insulin for instance, had to wait for many years, until Dorothy Crowfoot Hodgkin
solved the crystal structure. And so this, because of
the powerful information that's inherent in these
molecular structures, you just saw the double helix, which is a very clear one, because the, that basically then
immediately gave the clue that the genetic code consists
of this sequence of letters, A, T, G, and C for the proteins, which consists of a 20 letter alphabet. It's perhaps a bit more difficult to see what's so important about knowing these molecular structures, but basically you'll
just have to bear with me and believe me when I
say that this structure also gives a lot of information about the function of the protein and what its interaction
partners could be in, and so on and so forth. And so this exercise in critical, in protein structure prediction,
which has been, you know, a competition in the
community on an annual basis to assess or benchmark what the best structure
prediction methods are, it's been going on for
many, many, for, you know, 14 or more iterations, has in a way culminated
in this deep learning. And the reason for that
is that this approach can now at this juncture in our history draw on such rich experimental data sets, where there are on the order
of hundreds of thousands of individual protein structures that the rules for making
these proteins structures can now be learned in an automatic way. And that wasn't possible in the early days when there was much less data, but there are many areas of
biology for which that's true. So the protein data bank is the repository of crystal structures, the three-dimensional atomic
coordinates of molecules, proteins like I just showed you, and that's been going on for 50 years now, and it's this incredible
big data resource. What I'm showing here are
other databases in biology the protein sequences and
UniProt, genes and genomes and Ensembl, and then
down at the bottom two, sort of more functional genomics databases that I'm involved with the
Human Cell Atlas Data Portal and the EBI Single Cell Expression Atlas that are portals for the single
cell genomics data that the, in particular the gene expression data at the level of single cells, which is being possible through a resolution revolution genomics that allows us to measure the
expression of single cells I'll come to later in my talk. - [Man] I haven't heard of UniProt. What is that? - UniProt is the protein
sequence database, ao the amino acid sequences of proteins. Yeah, so my point here was really that there are a lot of databases now that provide the substrate
for data mining in biology. And this is really, you
know, is a development that's gone on over decades, but it's really accelerated
over the past few years that we're now in the era
of big data in biology. And that's, there's absolutely
no question about that. It's, you know, an
exponential increase in data that's happening at the moment. And there are a range of
theoretical numerical approaches that aid inference. And I've said on the one
hand there's modeling. There are also in silico simulations, in silico representations of systems where you can perturb systems
completely theoretically. And so in contrast to these
modeling simulation approaches, there are data science approaches. And some of these are
when the data on its own is self-evident, like in to
Gregor Mendel, but there are, there are also these
larger scale approaches, and yeah, these for AlphaFold,
it's the structures, and with these big data approaches now, the methods are really
statistical, computational, machine learning and
AI, as I've mentioned. So, you know, given that
there is this very big and important field of
computational biology in molecular biology, so that there can be a situation
where this field, you know, which wasn't, you know, a
traditional part of biology, this field of computational
biology and using big data to analyze and predict,
feed biological structures, biological interactions,
molecular interactions, and so on has meant that
some experimentalists might view the theoretical
component of molecular biology with suspicions. Okay, and I'm gonna pick up right here and dive into the main meat of my talk, which is going to be three elements. One is talking about predicting pathways approaching complex assembly, which is in that area of
predicting molecular structure. The second is predicting cell types using single cell genomics data. And the third is the Human Cell Atlas and using Human Cell Atlas data to predict cell
communication and infection by COVID-19 SARS-CoV-2. So if you think about
the inside of a cell, and the reason why it's important to predict how protein
complex is assembled, this is a molecular simulation
of the inside of bacterium. And you can see that essentially, it's a very, very crowded place. All these proteins and
protein nucleic acid complexes are multi molecular complexes that are rubbing up against each other in a sort of gel like very,
very compact environment where there's hardly, there
are hardly any water molecules between them, although they
are in a aqueous solution. And so understanding how these large multi-subunit
components, these globules that you can see in the
interior of the cell assemble and how they find their partners is a really fundamental
question in biology that kind of goes beyond
those individual components that are predicted by
AlphaFold to the next level of these molecular assemblies. And that's an area that I
worked on for over a dozen years and where we're using the big
data for molecular structure. So the, what we're
looking at are proteins, the amino acid sequences, and they are really the output of genes. So genes are code level of DNA. They're transcribed into messenger RNA, which is the messenger molecule, the intermediate level of information that we'll come back onto later. And that's then translated
into the protein level. It's this amino acid level. And the question that
we're asking in this work is how do protein complexes assemble? And can we predict
these assembly pathways? And so what we needed to
have in the first place is a big data set that we could, we could build our models from. And that, we built that data set from over tens of thousands
of crystal structures that were available at the time. When we started this work, which was in the early 2000's actually, we published this really
complex database in 2004. So we made our own database that provide where we modeled protein complexes that we first of all predicted to be in their physiological confirmation as graphs of the individual
amino acids sequences where the edges that are
shown here between the graphs are the physical interfaces of the interacting protein subunits. And so this then allows
us to use graph theory as a theoretical basis
for relating complexes for matching them to each other, for finding sub components of these graphs that are connected to each other. And if, what's important to
understand the basis of this is that over half of all protein complexes are assemblies of multiple
subunits of the same type. They're exactly they're called homomers. They're basically repeats
of the same subunit, and they can be related to each other through axis of rotation. And they can be twofold,
as you can see here in a dihedral symmetry where there's a twofold axis of rotation, or they can be cyclic, which
can have any number of elements around the circle, sort
of in this donut shape. We've got four here, but there
could be six, seven, eight and so on. And then all homermeric complexes, because they're closed symmetries, consist of combinations of these dihedral
two-fold axis of rotation and the larger cyclic axis of rotation. And the important difference
between these is that in a, in this kind of flat donut-shaped sort of repetitive structure
with a cyclic axis of rotation, you've got the head of one subunit connecting to the tail of the next, the head of one to tail
the next and so on. And so this interface consists
of two different surfaces. Whereas in the cyclic case, in this dihedral axis of
rotation, the two-fold access, it's actually the exact same interface. It's two heads that are
contacting each other. It's exactly the same
surface that's mirrored, or that's used, not
mirrored, but that's used within this dimeric protein structure. And so that, that basically means that there's a different evolutionary, there are different evolutionary
pathways and pressures that make these different
symmetries of protein complexes. And the very simplest, as
I said, to build something, either in evolution or
assembling inside a cell would be for a single protein
subunit to stick to itself. So it encounters another copy of itself, and it sticks to itself
through a dihedral axis, two-fold axis of rotation,
where it's like a handshake. It's like you shaking
your neighbor's hand. Your right hand, it's
the palm of your hand is contacting the palm of their hand, so it's exactly the same
surface sticking to itself. It's two different, it's a hand shake kind of symmetry with a single two-fold axis of rotation. And what that means in evolution is that if there's one mutation, let's say in the fingers of your hand, that increase the affinity or
stickiness for the other hand in the hand shake, then that
mutation will come twice and will increase the stickiness
of the two hands twice because you have that symmetry. Now in the, in the
other, and here you can, you can have that process occurring again. And another set of
two-fold axis of rotation to make a dimer of dimers, in other words, to form this tetramer. So once you've got the dimer, the two dimers can
interact with each other and form the tetrimer, again with these interfaces that are reusing the same surface. And a different evolutionary or even kinetic kind of
assembly scenario in the cell is that three subunits
encounter each other, and form a triangular here, interactions that are
occurring within the same plane with a single axis of
rotation down the middle in a, in a cyclic manner. And here at what we need
is really, as I said, for the head of one subunit to interact with the bottom of the other sub unit, and for this to occur
for all three sub units, and what you, what one mutation would therefore
count three times within, within, amongst those three subunits. And so you get a weak or
slower kind of increase of the affinity for the three sub units. So one mutation would kind of
weakly increase the affinity across all three
interfaces in the same way. The same would go for four
subunits or five subunits. And this series can kind of
increase in the same way. So you see, you're kind of
building up these rings, and a single mutation
would slowly increase the affinity across the whole ring, across all the different
interfaces at the same time. Now building up these, these stacks of rings can occur by either the hexamer over here with the dihedral three-fold symmetry could either form through a trimeric, trimerization of the dimers. So it can come through this three dimers sticking to each other. The octamer can be four dimers. The decamer can be five and so on. Or alternatively, we have this, these other pathways
where a trimer can stack to form a hexamer or a tetramer can stack to form an octamer and so on. So you can get to these, there are two different paths to get to, for instance, the hexamer. One is from the monomer to
the dimer to the hexamer, the other one's from the monomer to the trimer to the hexamer, and these different intermediate states can be intermediates both in evolution or in the kinetic assembly, in a cell in a biochemical sense, and what we, that's what
we showed in this work. And this was a paper that kind
of synthesizes these ideas, but that builds on
several other theoretical and computational analysis, and what we showed here
was the key principle that these pathways are
reflected in evolution in the sense that you find monomers related to dimers and tetramers, but also trimers related
to dimers and hexamers, but all the pathways
that are not connected, for instance, a tetramer is never or only very, very, very
rarely connected to a hexamer. A trimer is very, very rarely
connected to a tetramer and so on. So the arrows in green and
red that you're seeing here are really the main
transitions that we see, and that holds both for
the evolutionary pathways and for assembly pathways. And we think that the reasons that these principles
of evolution assembly of protein complexes are
mirrored is that it's, it's the size of the interfaces
that really determine the intermediate forms that
are conserved in evolution. And also the intermediates
that are fastest, that are formed fastest kinetically. And so in a living cell. And so what that means
is that for instance, if you had the trimer as the
intermediate to the hexamer, then these interfaces will be larger, would have a higher
affinity to each other, would be more stable, or if it were the dimer
that's the intermediate for the hexamer, it will be that dimeric
interface that's larger than the trimeric interface. And that was really the
insight in this paper that comes through a large
data mining exercise. And we expanded this concept
for heteromeric complexes. These are complexes that consist of subunits of different types. So here you'd have an octamer
that consists of two subtypes of the yellow subunits, two
of the red, two of the green, two of the blue. And if these, if these
assembled in a random manner, you know, you'd still be there
in the middle of the night, trying to find the correct
order for these jigsaw pieces to come together. Whereas if there's an ordered assembly where the blue and the red
always form together first, and then the yellow, and then the green, then you've got a sort
of click, click, click Ikea kind of assembly, and it's very rapid and efficient, and there are no miss
assemblies or aggregates that are sticky aggregates
that form erroneously inside the cell. And so it's really the
speed and the efficiency of that assembly that's driving that, and we show that these
heteromeric assemblies also are reflected in the
evolutionary conservation. And in this case, what we're using is the principle of
gene fusion and fission because, of course, protein
subunits that are fused within the same gene and that are part of the
same polypeptide chain will be covalently linked, and will be directly
therefore also kinetically the most efficient subunits
to form interfaces first. And indeed what we show is
that the genetic organization of protein subunits reflects
their assembly order. And then these predictions that are based on a combination of protein structure, they're based on gene structure in the two dimensional of
genome sequence of organisms, were verified experimentally through a beautiful collaboration that we had with Carol Robinson's group. Carl Robinson is a physical chemist, was one of the inventors of micro molecular mass spectrometry where you measure the mass
of intact protein complexes, and through collaboration with her and based on the protein, express proteins from many generous collaborators
who gave the reagents, we showed that in vitro in
biochemical experiments, our pathway predictions
were accurate kind of in the vast majority of cases. And so overall in this body of work, we're showing that
protein assembly pathways just like protein folding
itself is an order process. It's fast. It's spontaneous. It's predictable, and it's
also conserved in evolution. And these predictions are validated through these physical
chemistry experiments. And so you can think of the proteins, the amino acid sequence doesn't
only encode the structure. It also encodes assembly instructions. You can think of it as
a set of building blocks where we've got two building blocks, and we've got, the instruction
is that the red interface always binds with the blue interface. G is neutral and exposed to the solvent. And that set of simple instructions for these building blocks would
encode this cross structure. We've got the red
connecting to four blues, and it makes this structured. In a way protein complexes
are kind of like that. And we can encode those using graph theory and complexity theory. This was a fantastic collaboration
with Sebastian Ahnert, my collaborator in the physics department. And we show here that we can
develop a sort of shorthand for what the individual
protein subunits are and what interfaces they will
form with their partners. So we've got gray and white here, A interacts with C interacts with D and that basically forms
this square structure. On the other hand, if
we've got gray and white, but C interacts with B, and A with A we'll get this linear structure. And with equal numbers of building blocks that are the general rule in nature, you take those principles into account, and you end up calculating
what we describe as a periodic table of protein complexes, and the power of the periodic
table of the elements just is similar in a way
to this periodic table of protein complexes
because the principles of atoms and electrons in the
case of the periodic table predict features of the elements. Here are principles of
evolution and assembly of something that's predict structures that are of the individual elements here in terms of the number of repeats and the number of subunits
of cells in this matrix that are not, that were not
filled in by experiments, but that we are able to
fill in computationally. And those predictions
then were also verified by later releases of
the protein data bank. We also showed why, how this is kind of accommodated in the translation of proteins. And what you're seeing here is
a simulation of two elements of the ribosome. This is from Adrian Alcock
from IO in collaboration with my group, and this simulation, this molecular simulation
shows two polypeptide chains coming out of two adjacent ribosomes sitting on a messenger RNA. And what that tells us
is that the end terminus that comes out of the ribosomes first, so that's beginning of the
chains is likely to also interact with it is likely to also interact first. And that kind of, that
sort of in vivo, you know, taking into account these in vivo kinetics of co-translational assembly also gives us much more detailed information
about the constraints on making homomeric proteins, and that the parts of
the amino acid chains that come out first are
interacting with each other before the other parts assemble. And so that's a kind of more
detailed level of the principle of how these polypeptide chains assemble in terms of homomeric relationships where exactly the same copies
of exactly the same protein are interacting with themselves, with more copies of themselves. So in summary, we've used
evolutionary relationships and principles of protein biophysics to predict assembly pathways. They're inferences from
thousands of proteins structures and evolutionary sequence
relationships between proteins as well as gene structure. And our predictions were
experimentally validated both by macromolecular mass spectrometry and also by structures of proteins. And so that's really the lesson about predicting protein
complex assembly pathways. And you can see from this little story that it was a synergistic
exercise basically that involved bioinformatics
at the structural level, bioinformatics at the sequence level, simulations like the molecular simulation that you just saw of the
ribosome making these proteins and coming out of the ribosomes, coupled with physical chemistry, macromolecular mass
spectrometry experiments, and then also by physical
experiments that I didn't show. So there's really, you know, this science is really a collaborative effort between different disciplines and different types of
scientists working together to try to discover the truth centrally. In the second part of my talk, I'm going to shift gears and go on to predicting a completely
different building block of life, and that is the cell, and the, at the, the basis of this is really
the evolution of genomics from sequencing DNA as in
the human genome project to sequencing RNA, which gives
us a molecular fingerprint of cells in terms of the subset of RNA that's inside each single cell. And it's that subset of messenger RNA that tells us about the
molecular features of the cell and in terms of what proteins
would also be expressed inside that cell. And of course the cell is basically the fundamental unit of life. It's a component of the tissues, and the tissues in our body are the individual
micro-environments of organs. So the nose here, upper
respiratory system, the lungs, and lower respiratory system, the thymus, which is where T-cells are
made, the heart, and so on. Each one consists of many
different types of cells, and conventional RNA sequencing, sort of conventional bulk genomics used to require thousands
and thousands of cells to be mashed up together
before that the nucleic acid was extracted and put on the sequencer. Over the past, over almost 15 years, the genomics has undergone a so-called resolution revolution, where we're now able to sequence
the nucleic acid content from each individual cell in a sample. And that's called single cell genomics. And it's that has really
opened up the ability to interrogate single cells
almost more powerfully than using a microscope. Because what we can do is
isolate individual cells, either in well plates or
using microfluidic robotics, and then extract or
label the messenger RNA, the comprehensive nucleic acid
content of each single cell and sequence that, and then
analyze the vast data sets that tell us the genes that
are active in single cells. And this has been a series
of technological innovations that's gone on from 2009 to date really with many different types
of isolation technologies, genomics protocols and
computational methods that have evolved at pace. And that now individual
experiments routinely encompass on the order of a million cells. And of course, that advance in technology has been, you know, absolutely
revolutionary for biology. I think it's not kind of an
underestimate to say that. And it's also been coupled
to kind of slightly behind the single cell genomics with
a spatial genomics revolution, which allows us to measure
nucleic acid content of tissue sections, where
the cells are actually in their native tissue context. So you're then taking a slice of a tissue, like a mouse brain that's shown here, and what we see here is the
expression of six genes, each in a different spatial region. And this is from members
of my group together with Omer Bayraktar's group
and Oliver Stegle's group that have developed a statistical probabilistic
inference framework for mapping individual cells
to spatial genomics data. And from the single cell genomics data, one of the key kind of
data mining exercise is to enforce cell types. And the way we do that as
using a set of algorithms called clustering, which was, you know, encompasses a vast array
of different approaches, but essentially what the exercise entails is finding individual data points that are similar to each
other and then grouping them. And what you're seeing here are data sets from gut, embryonic and fetal on the left, pediatric on the right. So they're intestinal samples, and each little spot on this
two-dimensional projection consists of a single cell. And that data point
encompasses in and of itself thousands and thousands
of gene expression levels. So it's a vast matrix of
hundreds of thousands of cells, each with thousands of
thousands of genes expressed. And the total number of genes, of course, in our human genome is
somewhere between 25 and 30,000, depending on how you count. And so this matrix that we're clustering consists of hundreds of thousands of cells multiplied, you know,
across 25,000 genes roughly. And then the exercise is to
cluster those data points that are similar with each other, and are, and that's basically
then our interpretation of the cell types that are
present in the data set. And you can see here that
interpretation's a combination of clustering and data mining, but also kind of like Priya
was asking at the beginning, there is intellectual input. There are hypotheses. There is external data that we're putting into
the interpretation. And they're, the cell types
are shown in different colors and annotated and different things. So you see immune cells,
they're kind of lumped together, neurons of the gut. We have a kind of brain
that surrounds our gut, interestingly, they're labeled in yellow. We've got smooth muscle
that's around the gut and that's responsible for
sort of moving material through the gut, and
that's shown in brown, and the actual and enterocytes that are absorbing material inside the gut are all shown in the blue colors here. And so there are different, there are different cell types that are shown in different colors for the different clusters
based on the similarity in their expression profiles. And so, again, how the
cell clustering works is that we're grouping cells with similar expression patterns. In a way it's a hypothesis free grouping. It's a data mining
grouping on the one hand. On the other hand, as I said, there is in the final
interpretation of the data, there's often external
information that enters it. So there's a kind of intellectual puzzle where the scientist is
also putting in data. This is high dimensional data. Different clustering algorithms can indeed produce different results. So how do we know what result to use and what's the best algorithm? One of the tests for this is theoretical, and that is self consistency. And our single cell clustering
assessment framework is attempting to do exactly that. Sorry, this should be two C's and one A. That was a slight error here, so. And what we're doing here
is a cross validation, so a simulation or assessment
of the clustering results by testing whether the two
clusters can actually model and discover each other with the, using logistical question
machine learning method, or whether they are sufficiently distinct that they don't cross match to each other when we model their properties in terms of their weighted
levels of gene expression. And so there's basically ways of assessing the different
clustering algorithms from a theoretical and
computational point of view. And then there's of course
also ways of assessing the results from clustering
algorithms experimentally or using external data
from the literature. In any case, so the, we've
predicted cell types here for 20 tissues in the human body. This is almost a million
data points very recently. It's very exciting, you know, to be able to be at this
juncture where we have data that covers a representative
set of tissues from the whole body. It's not the entire body, but it's maybe half of
the tissues in our body. You know, we're getting very close, and it's a very exciting
time now for the community. And as I said, this isn't an automatic. You know, a million
cells, it's not something that you can do by hand. Gregor Mendel or Watson Crick, you know, those kinds of approaches
are not gonna work. We need to do this by machine learning. And we use a supervised or
semi-supervised approach so that we can then make
these models of cell types, which allows to transfer
labels from the known data sets to new datasets. And in general, of course,
you know, we're at the point where we're somewhere in
between these regimes. When you're, when you have little data, you require a lot of knowledge, and the more data that you have, the less knowledge you
really need to interpret it because you're kind of crystallizing. You have enough data sort of crystallized very, very accurately,
basically the, for instance, the clusters in your data sets, or the protein structures in
your AlphaFold predictions. And this is from Carl Henrik Ek who's a colleague of Neil
Henderson, Neil Lawrence in the computer science
department in Cambridge. And I think it summarizes
that the kind of development of single cell genomics
and computational biology in this area of cell type
annotation really nicely. So we are, we're getting to the era of large-scale single
cell expression data sets, where we can, we don't need
so much model selection and tuning anymore. And so the motivation is
where we can make classifiers that can automate the annotation. We want, we compiled this data from many, many data sets, 19 different data sets from
different regions of the body and ask whether they're
tissue-specific cell states, whether they're cell states
that are shared across tissues. There are many really exciting questions that we can now use this data set for to interrogate the cells
from across the human body. We made, so they cluster into roughly 100 different cell types. And there are subdivisions
of 10 broad cell types. And just to sort of give you
an overview of the workflow, you have to basically
integrate the datasets, harmonize, assembles, or clean the data, do initial training, model the data, and then you can feed
into unannotated new data and interrogate it with
the reference model. And that can in turn
then reenter the pipeline and contribute to the
training data in itself. And so you can go through
a kind of iterative process where you expand your models with more and more data systematically. And so basically the take home from this is that we're entering the area of big data in single cell genomics, just the way we did in protein structures in the previous section of
the talker on the AlphaFold and the biological
interpretation of that data just needs computational inference because it's just so vast. And the computational tools in mathematics on the other hand need to keep pace with the experimental technology. So there's a kind of to
and fro and a symbiosis between the theory and the computation. I'm coming to the last
part of the talk now, and where I'm going to discuss prediction of cell communication, how
cells talk to each other, how they communicate with each
other and how we can predict which cells get infected
using Human Cell Atlas data. So the Human Cell Atlas is
an international consortium with a mission of creating a
comprehensive reference map of cells using single cell genomics coupled with spatial data and interpreted with
computational methods. And it's, you can think
of it as a Google maps of the human body, where we're using this new cutting edge, high
resolution technologies to get from the kind of course
grain view of the human body to the Google street maps view. And we founded this about five years ago. This is this my co-founder, my partner in crime in this
endeavor is Aviv Regev. And we've now grown to a
community of about 2000 members across, you know, really across the globe, 77 countries across the world. And this is a grassroots, bottom up, scientist-led initiative. We are organized into working
groups and biological networks that focus on the different
organs and tissues in the body, as well as human development, organoids and genetic diversity. And one of the first, my first project on
human cells and tissues as opposed to a mouse
or other model systems was the placenta. And what we set out to do
in from about 2015, 2016, was to map this organ. That's a transient organ. That's at the interface between
the mother and the fetus, sort of in the, in the inside of the womb. And it's of course only present for nine months of your life, but it's absolutely crucial and essential to your development for those nine months. And you wouldn't be
here without this organ. And it's relatively poorly understood because it's, the human is
very different from mouse and even different from the most closely
related non-human primates, the chimpanzees. It's shown here in red. And, of course, the really
intriguing conundrum to me here was that there's a mystery about how the maternal
immune system tolerates the paternal antigens. So of course our immune systems are tuned to reject anything that's non-self, and that's really the basis of our health and our homeostasis. And yet, in, when we're pregnant, we are tolerating
something that's non-self that has antigens or proteins
that come from the father. And so how is that actually possible? And what we set out to do therefore, was to study this tissue by dissociation, sorting cell, single cell genomics, using two different technologies, microfluidic droplets and well
plates studied robotically, and then computationally
cluster the cells. And you can see here, we
studied both the decidua, which is the uterus or
endometrium, the placenta, which is the, so the uterus
is the maternal side, the placenta is the fetal side, and the maternal blood in
order to distinguish cells that were from the maternal blood. And this gave us the cell
phone book, I'll call it, which means what are the
individual cellular components of both the maternal
side on the fetal side? And you can see here,
there are immune cells and K cells, T cells. There are glandular epithelial cells that do the secretions
to support the fetus. There are fibroblasts, which are kind of structural
components of the uterus. And it's really through
the statistical inference that we're able to find
all these cell types. But what we really wanted to understand was how does this immunological
tolerance take place? And to understand that, what
we need to also dig into is how the cells are
talking to each other. And we call this CellPhone, a statistical inference
system for deciphering cellular interactions through
the receptor ligand complex, and so this links to the
protein complex assembly that I talked about in
the first part of the talk and how those protein complexes on the surfaces of adjacent cells are mediating the
interactions between cells. So this is a kind of interaction that's at a higher
level than the proteins. It's at the level of the individual cells. We developed a statistical
inference framework to look for receptor/ligand
interactions between cell types, between single cell clusters that are specific to those cell types. So they're not ubiquitous. These aren't molecules that
are expressed everywhere. They're molecules that are
specifically complimentary between individual cell types. And that allowed us to decipher the cells. From the placental side, the fetal side that you can see up here where you've got the
placental trophoblast cells, you've got placental immune cells, blood vessels, fibroblasts. The maternal blood is
swishing around here. You've got extravillous trophoblast, which are the, invading the uterus, the maternal blood vessels, the maternal glands, and
maternal immune cells. And there are a lot of
interactions that are taking place that guarantee that
immunological tolerance that I mentioned earlier, and that we discovered in this work. And so this is really a kind of flavor of a Cell Atlas exercise. It's one of the first organs
that was mapped in this way. We mapped all different
regions of the placenta to get a comprehensive overview. And this, we published this work in 2018. And then of course over
the intervening years, there are a lot of
tissue and organ datasets that became available. And there are those 20 tissues
and almost a million cells that I mentioned early in
that integrated data set. And so when the pandemic
hit in early 2020, we became aware that there
was this virus circulating. What we leveraged was
the Human Cell Atlas data and also the Human Cell
Atlas scientific community to understand COVID-19 and that endeavor has really continued. But I wanna tell you this
story from the very early days of the pandemic, where we mapped the viral entry receptors from the Human Cell Atlas data. So we asked in all of this
single cell genomics data from around the body, where are the viral
entry receptors expressed that could welcome the
virus into the cells? Because of course the virus is docking onto the surface of cells,
and so we're simply asking, can we predict where the virus is entering in the healthy Human Cell Atlas data? And we assembled data
from all around the body. You can see the different tissues here, and then mapped where ACE2 is expressed, and I'm gonna focus very
briefly on the barrier tissues, where the healthy reference
data is probably most useful and is telling us where the virus can hit. And of course, the nose
is one of the main ones where we have the nasal passages
where you've got inhalation of aerosol droplets. And indeed here, we find epithelial cells, goblet and ciliated cells that
have high expression levels for ACE2 and TMPRSS2. So we pointed towards those specific cells as being potential viral entry points. In the lower airways, the bronchi, you've got club and ciliated. Sorry, my internet disconnected. - [Priya] Yeah, no worries, yeah. - So we found cells in the eye and in the gut epithelium and teracytes and also in the placenta, which I just talked about
where we hypothesized that there are cells
that can be responsible for the vertical transmission from the maternal to the fetal side. Now, while this is rare,
what this data shows is that there's a potential
pathway of transmission. These were all predictions about where the virus could enter. We were trying to get
to truth and information and insights into this
infection as quickly as possible using Cell Atlas data. And what's really exciting
is that since then, we published that work last year, it's been cited about 1300 times. It got a lot of attention also from the public health point of view. What's really exciting
is that those predictions have largely come true
from experimental data where we can see, for
instance, from nasal swabs that we can read, next generation sequencing
reads mRNA of the virus inside individual cells
in the nasal epithelium. I should also say that in the mouth, which I haven't discussed,
we predict salivary glands, ductile cells in the top
of this salivary glands, in the ducts where the virus could enter. And indeed, in these microscopy images, we can see that the virus is indeed sitting inside those cells, and
we published that this year. So for ACES and the SARS-CoV-2 virus, the correlation with the receptor does indeed reflect the
infection and the infected cells. And so that's really the
last part of my story. I've told you about protein
complex assembly pathways and data mining from protein structures, inferences from single cell genomics to predict cell types and cell states, the clustering, predicting
cell communication of the maternal fetal interface, and viral entry into cells through the cell surface receptors. And so I'd like to summarize
basically at a high level, what this is telling us
about correlation, causation is that in the protein complex assembly, this data mining approaches
for linking evolution assembly, the correlation was
experimentally validated using physical chemistry,
mass spectrometry, and other methods. In single cell genomics,
the cell typist models really rely on, the
data is of such a scale that the interpretation really relies on machine learning and
computational tools like clustering. And for the Human Cell Atlas, What I told you about the cell
interactions in the placenta and the COVID-infected cells is that these correlations
are now coming true based on orthogonal
experimental measurements, and later experimental data. And with that, I'd like to thank
you and take any questions. - Thank you so much, Sarah. First of all for your enormous patience through the disruptions for today, what an exciting set of experiments and conceptual model building that you and collaborators
have been working on, and truly interdisciplinary. So thank you so much for
allowing us a glimpse into this world and the results. So I think we have a couple of questions. We had one from Deepti in the chat, and her question is how do you account for
heterogeneity of time? What is the time that you are using in the spatial clustering analysis? - So these are all snapshots essentially. So in the, yeah, so the
example in the spatial genomics of the mouse brain for instance, you know, that's a snapshot
for at one point in time through a tissue section,
same for the placenta. You know, it's a snapshot of a tissue that you're measuring
at one point in time. Now within that, there are kind of pseudo
temporal relationships between cells, which I didn't discuss because you have cells that
are progenitors of other cells. So you kind of have stem cells that give rise to other cells,
and those relationships, I mean, inferring those relationships is another whole field in and of itself. And actually in some of the
gut data that I mentioned, there are stem cells
and differentiated cells within the epithelial compartment, and in the data projection
that I showed in 2-D, you can almost see those relationships where the stem cells are
kind of more at the bottom, at the root of a trajectory, and the differentiated cells
are kind of more at the top, but we weren't formally
doing that inference here. It just comes out in the
manifold projection, yeah. - Thank you. If I may, I have a question. So, you know, one of the
arguments that is always made about, you know, comparing
disciplinary approaches, say physics and biology,
is that, you know, in physics we have some
guidance from conservation laws. So, you know, you have
conservation of energy, conservation of angular
momentum and so on. And those really form the bedrock on which we build inference models because these give you constraints. So, and the question
often arises, are there, when there are clearly symmetries, do these symmetries similarly
translate into conservation, new conservation laws or are
there just similar sort of, you know, entropy and energy? I mean, are these the
sorts of conservation laws that operate? - No, I mean, that's a great,
that's a great question. I'd say, so there are some, I mean, at one level biology is physics. And so, you know, in
biophysics, like in the, in the macromolecular assembly, you know, at some level, this is physics. It's biophysics, it's physical chemistry. The laws of physics apply,
and of course they apply to all of biology in that sense, but when you're dealing with
the sort of big data sets, then you're not in that regime, but there are still
like sort of, you know, the central dogma that I showed, which is DNA makes RNA makes protein. Like that central dogma would still enter into any of these data mining assumptions. So for instance, we are predicting
cell types from the RNA, but our kind of unarticulated
assumption is that those, those RNA fingerprints will translate into what proteins the cell is expressing. So there are some fundamental
truths that everybody has as part of their kind of
mental map when they're doing, you know, when they're
doing biological science. - But do they then become
constraints as well? So-- - Can, they can, yes, they can. Yes, and, you know, because the central
dogma may not always hold because you may have RNA that doesn't make protein, you know, and so that can, you know,
that correlation may not hold or that assumption may be misleading, but I would say it's the
exception rather than the rule. - [Priya] Right. - ]Sarah] Yeah, but,
yeah, I get your point. - Right, because I mean,
I think that is sort of one of the features
when you make inferences in sort of physical systems,
and if you have, you know, you're looking for these
unmapped correlations and when you find them, you still, if some of them violate the conservation of energy, for example, then you know that you are
probably missing a variable or there is a hidden
variable that, you know, you in your inference
structure you have overlooked, or, and so on. So I guess what you're saying is that it's not quite the same. - Yeah, I mean there are,
so I see what you mean. So there are kind of like sense
tracks, you know, like that. And one of them would be, so
for instance, that, you know, cell types that are, that have developed from
different progenitors or that are completely different from, let's say epithelial and neuronal. So the, you know, the brain
and the epithelial tissues. Those cell types should never, let's say, logistical question models,
machine learning models should never cross match each other. And so that was kind of the
internal consistency check that I mentioned where
we're looking for, you know, set clusterings that are
separate, or that are, that are hitting each other,
that are finding each other. There's that kind of
thing that you can use. - [Priya] Right. - So that's based on a sort
of developmental, you know. - Right, so I guess there is-- (cross talking between Sarah and Priya) So I guess the way this
would kind of connect to a physics kind of
argument that I was making is that one constraint
would be that, you know, these cells cannot
transmute into each other. That's a fundamental constraint. - Yeah. - That they have, they are
independent building blocks that don't transform into each other. - Exactly, exactly. - Okay, great. Thank you. Any other questions? - [Participant] I just
wanted to make a comment on what you said at
the last moment, Priya, about the cells can't become another cell. Well, as long as you're
not looking at stem cells, right, Sarah? - Right, right, yes. I mean, I guess they. - Right, but the progenitors, you know, let's say gut, you know, stem
cell wouldn't become a brain. I mean, naturally, you know, so the neuronal progenitors
will be different from the gut, epithelial progenitors. I mean, that's a kind
of reasonable assumption for any biologist I think. You can force them to do
those things in the dish in a Frankensteinian kind of way, but they wouldn't happen
naturally necessarily. - Right, so I think while
we wait for someone else, I mean, I have. - Well, I think Jenny
wants to ask a question. - [Participant] Yep,
Jenny has her hand up. - Oh, there's a hand up. Okay, sorry, please go ahead, Jenny. - [Jenny] Okay, thanks. I just try to be polite. Thanks for the great talk. Concerning Priya's question as well, I would like to follow up on this. You have single cells, and you
try to cluster them by type, but they are not, you
do not take into account the environment of these cells. This comes only when you take into account the Human Cell Atlas. So would your clustering types then change by taking into account the
environment of each cell type? Will you get different clusters, by taking this, the
environment into account? - I mean, so in theory,
you know, if the data, like you shouldn't. I mean, the molecular fingerprint
based on the RNA content should, you know, we think
that that should be sufficient to determine the identity of a cell. But of course, if you have
additional information about the environment, if
you have, let's say a cluster that's large and very disperse, and it, and, you know,
in the 20,000 gene space, then you may, you know,
having the precise micro, anatomical, micro environmental
niche information, the tissue, may tell you,
oh, you know, in the, in the crypts of the colon, there's this, let's say, intraepithelial T cell type, whereas in the villi, there's this intraepithelial T cell type. And we can now distinguish them, you know, based on very, very,
very subtle differences within this cluster because we have the micro
environmental information. I mean, that's what,
that's one of the things that we developed this, this method called Milo
with John Marioni's group that basically distinguishes sort of very subtle
neighborhoods in a KNN graph of the single cell genomics data based on external information, the kind of metadata parameters that you can give the algorithm. So, yeah, that's a, that's a good point. Taking into account
metadata about location can resolve clusters in a more, in a more fine-grained way. - [Jenny] Thanks. So you get sub clusters. - Yeah, and sub clusters
that are, that you couldn't, where there wouldn't be
sufficient information based on the single cell
genomics data alone, but using the metadata, you then have enough
statistical power and basis to distinguish them. - [Jenny] Mmm hmm, thanks. - [Priya] Wilson? - [Moderator] Hussein? - [Wilson] Yeah, I have a
question more about prediction towards the future. So I'm curious, as I know
that the Human Cell Atlas is in the process of building
the map of human cells. So just like we learned
about the differences between the sequences of
individual human genomes on the bulk level, where do you see this going in terms of this single cell level, in terms of different individuals having, I don't know, different
numbers of cells or cell types or expression profiles and so forth. - Yeah, so the human genetics
level for the Human Cell Atlas is basically coming now
for individual tissues. So where it started off, if you remember for the
human genome project, the equivalent is sort of,
the next level was kind of, you know, population genetics or GWAS, and it started off with individual genes rather than the whole human genome. And I would say for the Human Cell Atlas, the counterpart of that is
genetics of let's say blood or genetics of, let's say, you know, small intestinal biopsies, where you are able to gather
samples from hundreds of donors and can then calculate, like you said, differences in abundances of cell types or differences in patterns
of gene expression at the single cell level of, on the human genetic scale
using single-site EQTL or dynamic EQTL inference algorithms. And that's an area that's developing both on the experimental level and on the computational level, about what's the best way to do that. And in fact, we're collaborating
with Neil Lawrence, who's going to join me as a
discussant tomorrow about, in terms of algorithms for analyzing single cell genomics data
at the human genetics level, using Gaussian process
latent variable models. So I think that's where it's going, yeah. - [Wilson] Thanks. - Thanks. I guess we had, I guess,
Sunny, is that you? - So I think Fred was asking
a question, and he was saying. (cross talking with Priya and Sarah) - [Priya] We had one more
person, William, who had. - Yeah, and William, yep. - [William] Ah, hi, can you hear me? - Yep, hi. - Thanks for your talk. I really enjoyed it. I'm not a real scientist,
I'm a social scientist, so maybe, excuse my ignorance on this if it sounds kind of coming from like man on the street type perspective. But over the past year and a half, there seem to be certain countries that I wouldn't necessarily say are immune that have had a lot greater
chance of controlling, spread and transmission of
like the COVID-19 virus. I was wondering if there's any evidence in sort of the genomic field
of whether certain populations might be like more genetically
adept at overcoming COVID than other say like in Southeast Asia, perhaps some previous exposure
to regional waves of SARS has given some sort of
advantageous adaptations to the body's dealing with this new virus. - So advantageous adaptations
from the viral point of view? Sorry, I don't quite
understand the question. - [William] Yeah, I
was wondering, I guess, I guess what I'm wondering is
if you looked at all this data that you've collected, if there's any indications
that you're seeing unfold from certain populations
in parts of the world are different in a way that allows a better
response to viral infections? - Oh, okay, no now I
understand what you mean. So, the answer to that is we don't have enough data to really, so the answer to that is, there're a couple of different answers. So in this Gaussian process
latent variable model approach that I mentioned, what we can
do as map the associations with immunity to COVID-19 onto both nasal sort of epithelial data, and also onto blood data of populations and show that the OAS1 mutation in this RNA processing enzyme that's part of the innate immune
response against the virus that's present in every
single cell in our body, that there are indeed
sort of genetic variants that lead to a higher expression with a certain splice
variant versus other variants that lead to a lower expression
with a lower splice variant. And that the ones with the mutation that leads to the splicing
truncation has lower expression. And that may explain the higher propensity to severe COVID-19 of people
who have that genetic variant. So essentially we can sort
of take the association data and interpret it in terms of expression based on these data sets. That's one thing that we can do. And that's one thing that we're showing in this unpublished work
that we'll post online soon. And then the other thing that we can learn by looking at population data of COVID-19 is the difference between
children and adults in terms of their innate and
adaptive immune responses. And we have a post on that on
a publication on Med Archive that I didn't talk about. And what we see there is that the children have a much stronger
innate immune response and a more polyclonal,
adaptive immune response. So children are really
kind of better poised to get rid of the virus quickly. Whereas our innate immune
response is kind of more sluggish, our adult response is, and the kids are also developing a kind of TNB cell response
from scratch, if you will. So they don't rely on
their memory of TNB cells, which we do as adults. We rely on our immune memory. And so we have smaller,
larger clones of TNB cells. Whereas the kids are seeing
these viruses for the first time and have a much more kind
of diverse population of TNB cells in their response. So that's a kind of, those
are the bits of things that we've learned from this work. There's a lot more to
learn, but it's, you know, we have learned quite
a lot about the innate, about immune responses
in different genetic, with people with different
genetic variants, kids versus adults and so on. And we'll continue to study
this over the coming years because it's, you know,
incredibly important, I think not just for COVID-19
actually, but, you know, we're actually learning new things about virology immunity overall. - So there's a question by Fred. Let's take that question. - Yep, so Fred is asking, is there something in common between the three parts of the talk? And I would say the central
take home message is really, the central thing that
I wanted to illustrate as a common thread
throughout all these things is really the iterative cycle between having big data in biology, using statistical computational
machine learning approaches to analyze it and make predictions, and then going back to
the cycle of experiments to validate those predictions. Like I said, in biology,
we have the luxury of being able to do that cycle because we can interrogate molecular and cellular systems experimentally. It's not like the cosmos or the climate. And really the central message
that I wanna get across is that that's a very powerful paradigm of iterating from biology to prediction with these big data methods, and then back to experiments. You know, and that holds
across structural biology. It holds across cellular biology. It holds across tissue
biology and virology. It's whatever you, wherever
you look kind of thing. - Yeah, and I think, as
you mentioned, Sarah, this ability to do controlled experiments is really key in validating
these predictions for the kinds of conceptual
computational models that are being built. So if there are no further questions, just wait for a minute to see
if anyone raises their hand or types in the chat. Let's all thank Sarah
for an excellent talk. - Yes, thank you everybody
for staying here so late. - [Priya] Thank you very much, yeah. - Very much. Yeah.