PRESENTER: So welcome
to today's CBMM talk. It's great, really great, to
have Daniela Rus coming here. She's, of course, the director
of CSAIL, a great leader. I think you all know her. And from time to time, she has
these great, wonderful, simple, beautiful ideas
in robotics, which we read in papers and in
the news, in the tech news. And she's also a
great friend of CBMM, has been a great advisor for me. And it's somebody who really
likes the problem of the brain and not just artificial
intelligence, although artificial
intelligence, of course, is also a great problem. DANIELA RUS: Thank you for
this kind introduction. It's really a great
pleasure to be here to share some of our ideas
with the CBMM community. And so today, we will
tell you about a new idea we have been pursuing, together
with Dr. Ramin Hasani, who will present most of the talk. And the basic idea we
want to describe with you aims to bring the natural
world and the engineering world closer together. And Ramin and I are
going at this problem, in part because we have a
general curiosity and desire to understand
intelligence, in part because when I look at the
state of the art in the field of artificial intelligence,
I see a lot of advancements. And I see that these
advancements are really using decades-old
ideas that are enhanced by computation and data. And so natural question is
whether this is intelligence. Another question is,
are there other ideas? Can we use the natural
world to inspire us to think differently? Because I believe if we
don't come up with new ideas, then our results are going
to become increasingly more incremental. Because more and more people
will be plowing the same field. And so the field really
desperately needs some new ideas. And the idea that Ramin
will describe today aims to build machine
learned models that are much more compact, much
more sustainable, and much more explainable than
the models that are based on deep neural networks. And so let me just
say that much. And now, it is my great pleasure
to introduce more formally Dr. Ramin Hasani. Ramin is a postdoc in my group. Prior to joining my group,
he was a PhD student at the Technical
University in Vienna. And prior to that, he
did his master's degree at Politecnico di Milano. And so with that, Ramin,
please join us and tell us about your vision and results. RAMIN HASANI: So hi, everyone. Thanks, Daniela, for
the introduction. And thanks, Professor Poggio. All right, I'm very
excited to be here, presenting liquid
neural networks, a class of artificial
intelligence algorithms that tries to bring a
little bit of neuroscience in a structured way
to machine learning. So if you look at neural
activity in brains, in general, on the left side, you see the
brain activity of a mouse, and on the right side, you
see one of the networks that we trained end to end-- a controller for controlling
an autonomous car. We see that, basically,
the activation of the patterns and
activations maybe, superficially,
look very similar. But in principle, there are
fundamental differences. There are huge gaps
between intelligence as we know them in brains
compared to deep models, in particular, representation
learning capacities-- how natural brains actually
approach the organization of the world around them
to make use of them, to be able to control them
to achieve their goals. So we know that natural
brains interact highly with their environments in
order to understand their world. So by understanding-- I mean
when they can actually interact with the world and to
capture causality, basically, like the causal structure of the
task that they are performing. And this is one of the reasons
where natural brains can actually go out of distribution,
where statistical machine learning, by definition,
will stay in IID, right? And this is one area that would
be extremely beneficial if we can explore more and maybe
bring some of those insights from natural brains back
to artificial intelligence. And at the same time,
we know that brains are much more
robust and much more flexible in terms
of a perturbation or environments that
they are getting into. And finally, efficiency
of the models. So a network is
not always active, so there is always some
part of the network that is taking care of the
computations that is on demand. So allow me to
demonstrate this kind of a typical, statistical
end-to-end machine learning system, so where you have inputs
that are from camera inputs. And then you have a
deep neural network that is take care of the, let's
say, steering angle of a car. So in this kind of framework,
what we are seeing, we are seeing the
activity of the network. And we see that this
network is actually real work tested on a real car. And these are
demonstrations from the test set, where they are actually
deployed in the environment. They have been trained
using human data, and they are now deployed. So one of the things
that we actually looked into is, basically,
the attention of this network, like what kind of
representation has been learned? What pixels are the most
important pixels when a driving decision is being made? So this CNN actually
learned to attend to the sides of the road,
where we see lighter regions in this
attention map, in order to take driving decisions. And that's not a
actual causation. When you're driving, you're
not just looking around, right? You're looking into the
road and in front of you. So you want to actually have
your focus on that perspective. So the causal structure
here is missing, although the task is being
completed by the network. Now, if you add some
noise on top of the image, like a little bit of noise,
we see that this attention map is not even reliable anymore. Even if this noise is kind of
a small Gaussian perturbation, you can see that it has huge
influence on the decisions and the consistency
of the decisions that the network makes. So how can we improve this
by bringing neuroscience in. As Marr and Poggio said and
set up a framework for us for actually creating-- let's say, if you want to
explain a biological system, you want to say,
at a system level, you can look at it
from a system level and find out, what are
the goals of the system and what are the
kind of mechanisms that, actually, you get to the
goals, that's the system level. And then you can
also have this view of looking into building
blocks of these things, going down and looking
into how intelligence emerges from cells. You can go down
and basically use computational models,
precise mechanisms that exist in biology. So having this kind of framework
in mind, what we can do-- and that's what we
did, just showing you an outline of how this
research is a summary of what this research is about. So we looked into nervous
system of a small species. And we got down into
neural circuit level. And even for understanding
neural circuits, we actually went into the
neuron and synapse level even further to explain, to
really fundamentally figure out, what are the
building blocks there. And you know that
you can even go lower than that and computational
model down to atoms. But there is actually
a level that you have to satisfy yourself that
you don't want to go below that in order to actually get
there and then take this model and see what kind
of capabilities you can have using
the engineering, super-advanced machine learning
frameworks that recently got developed. So we stopped at
a certain level, which I'm going to explain
throughout the talk. And we saw that these models
are much more expressive than their compartments
in deep learning, although the kind of abstraction
that we did is really simple. But in terms of
how much capacity these networks can generate,
they are much more expressive. And I'm going to show
you the math behind and also the experimental
evidence for that. These systems can handle
memory, and these systems can handle explicit and
implicit memory mechanisms that I will explain
throughout the talk. More importantly, these systems
can capture the true causal structure of the data. And that's part of the reason
why these systems actually can be helpful in those kind
of this closed-form, real-world decision-making processes. The systems are basically
robust to perturbations. And we can use them for
generative modeling. We can even use them
for extrapolation. You can go out of distribution
with these type of networks. Because if some process can
capture the causal structure of the data and you can
prove that that's the case, then the system is being
able to actually go even out of distribution. And with that in
mind, we actually try to perform decision
making in real-world robotics. We are distributed
robotics lab, and we want to bring these
insights into the brains. Now, to show you what kind
of change we have done, you can look at this system. This system has now,
on the right-hand side, what you see is the 19
nodes of the system that is sparsely connected together. And this is described
by that model that, actually, we developed. And then you can actually
get into attention maps that are much more focused on
the true causal structure of the task. And this is not
just on this task. But we can actually see
more throughout the talk. Well, how do you get started
for creating a model? Let's look into the, let's
say, interaction of two neurons and the synaptic propagation
between information propagation between the two. So neural dynamics
are typically given-- unlike deep learning systems-- they're given with
continuous processes. And they are described by
differential equations. So synaptic release is
not just the scalar rate. So synaptic release
can be modeled with much more sophisticated
kind of mechanisms. So you can really get
down to probability of if a neurotransmitter
is actually going to stick to the
receptors of the second neuron. So you can really get into the
process, how much complexity. You can really add
nonlinearity to the system. And there are also recurrence in
the structure, there's memory, and there is a sparsity all over
the place in neural circuits. So having these
principles in mind, the goal is to
actually incorporate these small principles
that I mentioned into improving representation
learning, improving the robustness of
machine learning model and the statistical
models, and, at the same time, improving their
interpretability. So to get into a common ground
between the computational work of neuroscience and the
machine learning systems, I would like to
start exploring where do we have continuous dynamics. So let's start with these
processes that has been recently brought
up-- continuous time, or continuous steps models-- in the machine
learning community. So a continuous
time neural network is basically when a
neural network f that has certain number of
layers, has certain width, it has activation
function of choice. And it is a function of its
hidden state, its inputs. And it's parameterized
by parameters data. So if a neural network f
parameterizes the derivatives of the hidden state, then you
would have a continuous time process. Now, it's going to be a
continuous time neural network. With this
representation, you can go from a discrete
computational graph, like in residual
networks that we have. Like, you would actually take
a computation step each layer. Now, if you define your system
like the way we show it here, the depth dimension of your
system becomes continuous. And when you have a
continuous-time system, then you would have
a lot of advantages. First of all, the space
of possible functions that you could actually
explore and generate is much more than that of
the discrete representations. Second advantage is the
arbitrary computation. So you don't need to perform
computation at every time step. You can have arbitrary
step time computation. So your depth becomes
very variable, basically. So it can be
infinitely depths kind of networks with one process. And this would naturally,
this continuous process, would be a natural fit for
modeling sequential behavior. So let's say, compared to
the normal recurrent neural networks that you
know, the updated state of a neural network is actually
given with this discretization. If you have a neural
ODE and, basically, a more stable version of that
where it has a damping factor, then you can use this also as
a recurring neural network. On the top row, you see the
interpolation and extrapolation capability of a
recurrent neural network on irregularly sampled data
that are put around the spiral. And we see that the red
line in between is actually extrapolation capability
of this model, where it cannot actually
capture the dynamics very well. But on the bottom row,
you would actually see that the dynamic process
generated by a continuous time recurrent neural network
actually captures those dynamics properly and
even extrapolates to that. So this is nice. Now, how do we
implement these things? I'm just going
through the details of how to implement
these type of models. So you basically, you
want to, actually, because they are ODEs, you want
to use numerical ODE solvers. So you basically
unroll this difference. And then you can use any
type of numerical ODE. So let's say we use an
explicit Euler solver. And then, there,
you can actually create the forward
path of your network based on this unrolled
version of your network. And then, choice of
these ODE will actually define the complexity
of your map. You can use a more
complex adaptive solvers that has adaptive
step sizes to have a more accurate forward path. How do you now do
backward paths? You can use a mathematically
known adjoint sensitivity method, where, let's say
you have a loss function, and your dynamic is
given by a neural ODE. So your loss
function, basically, if you have the dynamic of your
system starting from t0, given by this time, and you
have labeled data, you can compute the output
dynamic to compute a loss. And this loss is
getting computed by running this ODE solver
which basically give you this trajectory. And then, the adjoint
method actually creates a new state, an
auxiliary differential equation, that connects the
dynamics of the loss in respect to the state of the system. And then you can run this
ODE backward one step at a time to get the gradients
of the loss in respect to the state of the system. And at the same
time, you would be able to also get the
gradient of the loss in respect to the
parameters of the system. So this adjoint sensitivity
method on the backward path would give you a constant
memory propagation. Because it actually
forgets the previous states and it just do one step
at a time computation. When it does back propagation, You can also train this
network for backpropagation through time, gradient base. And what you do, you perform
one forward pass, and then you compute the
derivatives of your-- based on the chain
rule, you can actually compute your derivatives. And you can update
your parameters. This way, you are actually
not treating the solver in a black box manner. So you are actually
going through the solver. So the dynamics of the solver
becomes part of your gradient, as well. So you need to be
careful about that. But at the same time, the
memory complexity of this method is really high. But it is much more accurate
than the adjoint method if you use it in
a vanilla sense. So I told you how these models
are getting implemented forward and backward. Now, we have this neural ODE. So we said the
continuous-time processes, and this representation actually
can have a spatiotemporal kind of data processing powers. And it actually has a
really good potential. But we didn't define any
biological process there. We didn't actually
get any inspiration from the biological insights
that I talked before. And a really funny fact is
that when you deploy them in real world,
they're even worse than a simple long short-term
memory network, right? So basically, what's
the point, right? If you define a really fancy
equation they cannot even work in real-world
applications very well, then what are we even doing? So let's improve. Now, by this improvement,
what we want to do, we want to get into biology. I told you that
activity of neurons are described by
differential equations. And you can actually model
the dynamics of a cell or of a membrane as
a leaky integrator and with these simple
linear dynamics. And the more important part is
the conductance-based synapse model, where you can have
a nonlinearity included in the synapse of
the system and not in the neurons of the system. So basically, the
interaction between two nodes or two differential equations
is given by a nonlinearity. And this is what is
inspired by channel modeling behavior of Hodgkin and
Huxley when they did channel modeling of ion channels. So you can actually get into
this kind of a steady state behavior from those differential
equations of Hodgkin-Huxley. You can reduce them
into this abstract form. And if you want to bring
it, the nonlinearities look like a sigmoid and
activation function. So you actually
can, in principle, bring neural networks, inside
artificial neural networks, in the representation
of a synapse. Now, putting these two systems-- very simple things, has been
there for over a century-- together, you will get a
dynamical system of such. And this dynamical system
has certain properties and certain advantages. It's obviously a neural ODE. It's an ODE-based
neural network. It has a component
neural network f and nonlinearity that
appears in the coefficient of x of t, or a state of your system,
and in the state of the system itself. So there is a coupling
between the state and the time constant of
your differential equation. So at the same time
that f for that linear-- let's say I don't have
recurrent connections. So x of t in that f is 0. Then f becomes only
a function of I, or the inputs of the system. Then the whole system
becomes a linear system. Now, if you have
that linear system, the coefficient of x of
t is input-dependent. So if the inputs of
the system is changing, then the kind of behavior of the
differential equations changes. Because that defines
the damping factor of your very simple neural
network that you have and very simple dynamical
system that you have. So just to show you a
block diagram, like how does it look like, in a
standard neural network, the range of
possible connections that you might have is
basically you can have-- let's say you have two neurons. They have activation function. You might be able to have
reciprocal connections. You might have feedback. You might have an external
input to the system, and they have their
own scalar weights. Now, in a liquid
network, you would have the same kind
of a structure but, at the same time,
you have a nonlinearity that controls the interaction
of two differential equations. So the difference here is
that activations are changed to differential equations. And their interactions are
given by a nonlinearity that can be a neural network. So in terms of what
does it represent, let's say I trained a
neural network for driving, for autonomous driving,
from visual data. I'm showing the visual
data in the middle. I did that with a standard
neural network that has a constant time constant. And I did that with
a liquid network. What we are seeing on
the x-axis is 1 over tau. That means 1 over the time
constant of the system. And on the y-axis, what we
see is the steering angle of the car. And the color shows
left for blue and yellow for turning right. And in the middle, you
have the middle part. So now, we see that
a neuron actually learned to associate its
behavior, its timing behavior-- without any prior, just to plug
in those very simple building blocks together-- actually learned to
associate the dynamics of the task to its behavior. So that's one of the
advantages that you receive from these type of networks. Another property
of these networks is that the state of
these systems are stable. And their time constant and
their behavior is stable. So if you define
the time constant of the system as
that expression that is the coefficient of x of
t, or the hidden states, then you can actually
write that down as relaxing for not having
a recurrent connection. Let's say, x of t is out. Then you would be able to bound
the time consent of the system. And these are actually the
bounds that you can have. So the network
cannot go unstable. You can also bound the
state of the system. Let's say a neuron is receiving
many synaptic connections. A, in this representation,
is a synaptic parameter, and its synapse is specific. So each synapse
has a bias, or has an A, that actually has a
connection to this neuron. And now, basically, you can say
the maximum of the A parameter would be the maximum amount that
your state can actually reach. And the minimum of that, the
one that has the least one, actually has the least amount
of impact on your activity of your differential equation. We can also show that this
biologically inspired system is actually a universal
approximately. You can actually do a
function approximation, use those methods, actually,
to prove that, actually, this expression can
approximate any given dynamics with arbitrary
precision given in number of their cells. But to truly, actually
find out how expressive is a neural network from
the theoretical standpoint, we want to get down to a
more fine-tuned expression. So for example, there
are more measures of expressivity
of neural networks that we can use for measuring
expressivity of a network-- for example, the
trajectory lengths. Imagine I have a
circular trajectory, and I input this
circular trajectory to a deep neural network. I'm just defining what is this
trajectory length measure. You input this to
a neural network. This neural network
is parameterized. And then we can observe that,
at every layer of the network, this trajectory
gets deformed, gets more complex and the lengths
of the trajectory getting more complex and complex. And it actually
increased exponentially. You can measure that
length of this trajectory with an arc length measure. And you can actually
find the lower bound for the expressivity
of the neural network. Given its depth,
you can actually measure the expressivity
of a neural network by its parameterization,
properties of its synaptic
parameterization, the width of the network,
and the depth of the network, basically. So we actually did use
this expressivity measure. Because this actually
draws a boundary between shallow networks
and deep networks. The deeper you get,
the more expressive you can get based
on this measure. Now, in our space, we have
continuous-time processes, let's say, liquid time
constant networks, or LTCs. We have continuous
time neural networks. And we have neural
ODE representations. Now, if we give the
same neural networks-- we parameterize
this neural network f for all of these processes,
given their representation of differential equation-- we see that we consistently
get longer and more complex trajectories out
of the LTC network. Now, we systematically analyzed
this in an empirical fashion, where we changed, basically-- like, on the x-axis, you see
different types of ODE solvers for these three
types of networks. Neural ODEs, CTRNNs, and LTCs. And we see that the
yellow line actually shows the trajectory lengths. For these LTC networks,
we see that, even if you change the width of
a network, on the x-axis, you see that the trajectory
length is always higher. And we can see that if the
initialization of your network is actually changing, you also
have a dependency on that. Now, we also figured
out, theoretically, lower bound for
expressivity of, basically, these type of networks
where the lower bound is a function of weighted
scale, biases scale, width of the network,
depth of the network, and number of
discretization steps that you're taking for your ODE. And we also implemented
that for LTCs. You cannot compare lower
bounds to say that, yeah, so this network is more
expressive than the other one. But it's just a good
measure to just see where are we standing in terms
of this type of behavior. Now that we have this type
of measure and theoretically evaluated them, let's really
put these networks in action, and let's see how good they
are in representation learning. So one of the
things we start with modeling physical dynamics. When I told you that a neural
ODE cannot beat an LSTM network, you see that here. And you see that we can
actually get better performances while using these networks. You can compare them across a
large series of advanced RNMs. And this [INAUDIBLE]
inspired network is actually beating them even in person
activity in a real example, just to perform, in
irregularly sample data. We Also performed some analysis
on some real-world examples. And we saw that, on most of
these tasks, LTCs are better. For example, one task
is LSTM is better, and that's the task where we
have longer term dependency. And that's one of
the issues that you have to solve
gradient propagation in continuous-time
processes is problematic. So you always have to take care. If you actually wrap them inside
a kind of well-behaved gradient propagation, then you
would be also getting a better performance there. We didn't stop there. And we actually scaled
the applications to this end-to-end autonomous
driving that, at the beginning, I showed you. We have human-collected data. And we trained deep
learning models. Typically, a deep
learning pipeline actually looks like that when
you want to have a set of convolutional heads. And then you would have fully
connected networks that has, basically, the
over-parameterization part of their network is actually
there, in the hidden layers. Between five to 100
million parameters it takes to actually
perform lane-keeping, or this type of task, if you
have this type of networks. What we did, we said
that let's replace the fully connected networks
by continuous-time processes, and let's see what kind
of behavior we get. So we get four
types of variance. We take a neural circuit policy,
which is the first one, NCP. That has a four-layer
architecture-- again, nature-inspired-- that has
interneurons, command neurons, and motor neurons,
all LTC-based neurons based on the masses
I showed you before. You can replace that
fully connected layers with LSTMs and
CTRNNs, and you have the convolutional
neural network. So I'm going to talk about
differences of these four variants. So the first thing, the
number of parameters that requires to actually
perform autonomous driving is basically
significantly reduced when you're using
these type of networks. Now, remember the representation
of the network where I was showing that
convolution on a fully connected convolutional
network can get perturbed, the kind of
representation they learn. And now, with LTCs, we
would be able to have 19 neurons at control. And then we perform and see that
the convolutional part of it-- so what I'm showing
in the attention map, we are not changing the
convolutional neural network structure of these
variants, of these network variants that I showed you. We see that this architecture
imposes an inductive bias on the convolutional
networks that let them learn a causal structure. Now, if you add, even, noise,
we see that the explanations are not scattered
as much as it was for convolutional
neural networks. We also take to a
real-world measure of this, like how many crashes would
you have if you increase the amount of input noise? And you will see that
these kind of networks are basically much more robust
to this type of perturbations. And now let's look at the
convolutional neural network attention of these
end-to-end trained networks when their heads
are different-- when they had a CTRNN,
when they had a LSTM, and when they had
our LTC-based model. And we see that
the kind of prior that the recurrent
neural network had put on convolutional
neural networks makes them learn different
types of weights. So the representations that
are learned out of this system are completely different
from each other. And we see that the only one
that has a consistent behavior is the CNN itself
in our solution. But CNN actually
focuses consistently on the outside of the road,
so we don't want that. LSTM is actually giving you
a good-- most of the time-- a good representation. But it is actually sensitive
to lighting condition. So if I stop the
video in some parts, you will see that when the
shading areas are not good, the attention of that LSTMs
are actually getting scattered. And the CTRNN, or
the neural ODEs, basically cannot actually gain
a nice representation in this task. Now, why is this the case? Now, let's explore
the why of this. So if you look at the
taxonomy of possible modeling frameworks, at the bottom
at one end of this-- I don't want to
call it the bottom-- at one end of the spectrum,
we have the statistical models where statistical models are
amazing in learning from data and, at the same time, basically
performing inference in IID, so predicting in IID. So this is actually what the
statistical models can do. On the other side
of the spectrum, we have physical models. So physical models are
basically described, usually, by differential equations. When you have differential
equations that describes the dynamics
of your system, they can actually
answer questions. They can account for
interventions in the system. So if you can actually design
a universal approximator that is closer to the
physical kind of models, then you would actually get
into a more causal structure by nature. And also, you're being
able to actually get insights about the system. You can learn from data. You can answer counterfactual
questions and predicting IID and outs of distribution. So as I said, physical dynamics
can be modeled by ODEs. And this set of
ODEs can actually predict the future
evolution of your system. They can describe the results
of interventions in the system. And the coupled-time
evolution helps us define averaging
mechanisms for capturing the statistical
dependencies in data. And it enhances
our understanding of the physical phenomena. And because of that, they are
actually causal structures. So now, let me get
more formal about this. Let's say we have a
differential equation given by dx over dt equal to g. And g of x is basically a
nonlinearity of the system. So we have the
Picard-Lindelof theorem that actually shows that this
kind of differential equation would have a unique solution if
the nonlinearity is Lipschitz. Now, if you unroll
this system with Euler, then the representation, the
underlying representation under this uniqueness condition,
would be a causal mapping. Why? Because you can
actually say what happens in the future events,
which is the xt plus dt based on the previous events. Now, there is a framework within
this spectrum of causal models. It's called dynamic
causal model. So a dynamic causal model
has the nonlinearity of the shape that you're seeing. It does take a
bilinear approximation, or a second-order Taylor
approximation, of that ODE. And it gives you these
coefficients for the system. So coefficient 1 controls
the internal coupling of the system, A.
Coefficient B controls the coupling sensitivity
among networks nodes. So it actually accounts
for internal interactions and interventions. And coefficient C regulates
the external inputs. This framework is
actually a graphical model that is implemented by ODEs. So you can put these
things together to actually create this system. They allow for
feedback, as opposed to their kind of Bayesian
network architectures that you can actually receive. Now, if we look at the
liquid neural networks, or the representation that we
gain from that representation, under two conditions,
that f is C1 mapping-- that means like f is
Lipschitz-continuous, basically, and is bounded-- I didn't write the
bounded, no? no, I didn't write that, so it
has to be, also, bounded-- and tau is positive. And if you have a
strictly positive tau, then this network would
also have a unique solution. Now, let's say I assume that
this f, the nonlinearity, is given by a
tangent hyperbolic. It has recurrent connections. And it has weights
like an input mapping. And then, with
this nonlinearity, I would be able to
compute the coefficients. If you look at the
coefficients for causal models, we can compute the coefficients
of this causal behavior. So that means there are certain
parameters of the system that are responsible for a
certain type of intervention in the system--
internal intervention and external intervention
in the system. Just from the
diagram perspective-- going back to our diagram-- we will actually have
a dynamic causal model that can have the
parameter B that controls the amount
of collaboration of two nodes with each other,
or interactions of two nodes, and coefficient C that controls
the inputs, or external inputs, to the system. You would have the
same type of behavior-- it's a nonlinear version of
that dynamic causal model-- that actually performs
the same thing. And they have more
sophisticated causal structures. Now, with that, we
did some experiments. They are behavioral
cloning kind of experiments where we have drone agents that
are moving in the environment. And they are given-- visually, there is actually
a target in the environment. And we ask the drones--
so actually, we drive the drones
towards that target. And with this visual
demonstration, what we want to do, we want to
learn this behavior and gain agents that are good in closed
loop when they're interacting with the environment. We see that this is
actually a learned behavior of this system, where as soon
as the target becomes apparent, then we see that this neural
network actually learned to focus on that target. Because that's the kind
of important matter in this kind of task process. So basically, the causal
structure of the task is learned by these drone agent. Now, if you compare the
kind of focus, or attention, of these networks to
other neural networks, we see that the only
representation that, actually, we see this type of
process is actually the liquid network-based
solutions, where this attention is not
persistent in the other ones. So we cannot say that the
other systems actually learned to navigate towards the target
and understood what they were doing. We also did that in multi-agent. Right now, you're
a follower drone. And there is a leader
drone in front of it. And the target is basically
to follow this drone. And in this type of
environment, also, we observe that the
attention of the network is, actually, always on the
second drone, basically. So that means the causal
structure is actually captured. Now, how you can show this
even more quantitatively? Then we looked into
close form interaction. We trained these
networks in open loop and from training data. Now, we deploy them,
actually, in that environment. And we measure the
amount of success rate that they can have in different
type of tasks in closed loop. So if they do not have
their true causal structure of the task, they wouldn't be
able to perform this task very well. And we did across different kind
of spectrum of perturbations on the system. We see that the
systems are being able to perform much
better than the other ones. Of course, there are always
room for improvement, even for these systems. Because we didn't add
any kind of constraint on helping these systems
to learn more and more. So we were just
trying to see what's the gap between these type
of networks and the others. So obviously, these
type of networks come with certain limitations. So the complexity
of the networks are basically tied to the
complexity of their ODE solver. So as a result, you might
have longer training times and longer test time if
you use these networks. You can have a
solution for that. You can use the
fixed-step ODE solvers. You can use the sparse flows. You can use a sparsity-- and the process that optimizes
sparse neural networks-- on, let's say, CPUs or
any kind of hardware that you're running or GPUs. And then you can
use hypersolvers. And these are the class of
solvers where they can actually integrate everything together,
and they can actually run much faster when you
have differential equations. You can also use
closed-form variants in these kind of scenarios. So you can use the closed form-- if you solve these differential
equations as closed form, then you can end up with
a nicer presentation. And that's one of the
things that we did and we're very excited about. So there's another limitation
that this ODE-based network. They might also express
vanishing gradient problem. Because they're continuous
systems, and their memory is given by an
exponential decay. So then, you would face
learning long-term dependencies. So the solution is that you wrap
it inside a well-behaved kind of process-- for example, a
gating mechanism that you can actually put these
networks together-- for example, if you have
the state of an LSTM network defined by an LTC network. So if you do that, then you
would have gating mechanism, and you have a
gradient propagation preserve the gradients. Now, in summary,
what I showed you I showed you that you
can acquire knowledge by these flexible
neural models that can perform inference model-free. They can really capture the
temporal aspects of the task that is at hand better than-- the tasks that require temporal
kind of data processing, they can actually infer the-- and these are all thanks
to their causal structure. And they would be able to
perform credit assignment better than the other
models that are out there. So you might use them
for generative modeling. And if you want to
model the world, you basically can use
these representations or also get representation
of your world in order to do further inference
from those kind of models. So there are certain
properties that I mentioned-- the compositionality of,
layer-wise, these networks, you can actually put them
in different architectures. And you can connect them
in a sparse fashion. And the network is
actually differentiable. And you can use this. And if you're dealing with
visual data or video data, it would be adding CNN
heads or perception modules. And then this can act as
your decision-making engine. They're expressive,
they're causal, and they add more
into interpretability of the networks. So some of the perspectives
that we have is that there is-- I just put two different
hundred-years-old models together, and this is all
kind of properties that emerge from those kind of things. And you can see how much
potential is actually in this type of research
that you can put, and you can really explore
what's going on in the brain. And why do you need to do that? Because, basically,
the research space is huge if you just want to
algorithmically implement something intelligence, right? So you would narrow down if
you actually focus on brains and how they acquire knowledge. And definitely, because we have
these machine learning tools these days, you would be
able to actually do much more than it was possible before. We can also work with
the objective functions. In this talk, in this
research that I showed, we just focused on the model
and the properties of the model in a structured fashion. So you can also work with
the objective function of your learning problem. You can also, for
learning processes, you can use
physics-informed kind of learning processes
in order to perform this type of learning. You can do causal entropic
forces, for example. This is like
defining intelligence as a force that maximizes
the future freedom of action. So that would be a new way
of formulating intelligence. And then, from there, you
would be able to actually get into much more. So this is actually an
exciting area of research that could be enabled and
scaled by what we showed today. And as I said, one of the
properties that we showed today is that there are certain
structures that can emerge from these liquid networks. And those structures are good. So you would be able to use
these for more complex tasks. So these are good candidates-- this could be giving
you some candidates for performing decision-making,
better decision-making, based on these selective computations. With that, I would like to
thank you for your attention. And all this technology
is open source. You can actually
get them online.