Hi everyone, my name is Fay. I'll be your host
today. So, today's session is "Insights from Kaggle Grandmasters and Experts on Competitive
AI and LLM Frontiers". Make sure that you're in the right session. We'll have time for questions
at the end of the session, so you're welcome to submit your questions on the Envy app or ask
questions using one of the iOS microphones. So yeah, without further ado, I'll give you
the floor to the speakers. Thank you. Great, thanks Fay. Hi everybody, my name is David
Austin. I'm a Kaggle Grandmaster and I work at NVIDIA. I'm fortunate enough to spend some
of my time at NVIDIA working on AI competitions and learning a lot of new techniques and methods.
We're here to share a lot of that with you today. A big thing that we like to do is apply what
we learn, whether it's competitions or taking something out of research and putting it into the
application domain. So we're going to talk about a lot of different topics around LLMs, vision,
generative AI, competitive AI, but the real slant today is going to be around how do we take
all these cool things that are happening in the world today and apply them to real problems.
We will leave time at the end for questions, so if you came with a question that we didn't
address, please feel free to ask. There's also a "Meet the Expert" session at 2:00 where you can
come talk to us one-on-one. So one way or another, you can get your questions answered at some
point today. But first, I'd like to introduce my panel of colleagues here, and let's go ahead
and start with G. Hello everyone, my name is G Le. I'm a data scientist and software developer
from the Large Language Model technology team. I'm working on code generation, retrieval, and maned
generation. I'm also a Kaggle Grandmaster. I used to work a lot on the competitions before
all the LLM stuff. I'm working on Rapids, which is a GPU-accelerated data science
framework. Nice to meet you all. Great. Uh, and Chris, hi, I'm Chris. I'm a senior
data scientist at NVIDIA. I have a PhD in mathematics with a specialization in computational
science. I love doing data science competitions, and I'm currently a quadruple Kaggle Grandmaster.
Next, we have Laura. Yeah, hello, I'm Laura. I'm a research manager at NVIDIA. Before, I was a
professor at the Technical University in Munich, Germany. My research group is interested in
perception dynamics and understanding. So today, I will talk a lot about LLMs and
their interaction with vision systems. And lastly, we have Kazuki. Hi, I'm Kazuki, and
I'm also a Kaggle Grandmaster. I joined this team four years ago, and my expertise
is recommended systems. Thank you. Thanks, Kazuki, and thanks for
coming in from Japan for this talk. So let's go ahead and get started. You know,
probably the hottest topic that we've heard at the conference and that we're seeing evolve
in the competition space is around LLMs, and specifically the large generative models.
So, G, maybe you could start us off talking a little bit about these generative models, how
they work, how they're trained, how we use them. Yeah, of course. So, training a large language
model or AI, like GPT, is a very computationally intensive task, and it is a multi-stage
process. The first stage is pre-training a foundation language model. So basically, we
collect massive text data from the internet and train the model to imitate human language
and learn how to complete documents. The second step is what we call supervised
fine-tuning. So basically, we want to create a smaller but high-quality dataset, you know,
by human labelers, for specific use cases like chatbots, QA, creative or professional
writing, or coding. So when we have these smaller high-quality datasets, we apply the same
language modeling to continuously train the model. The third step is called RLHF, which stands
for reinforcement learning from human feedback, or DPO, which stands for direct preference
optimization. So the goal is basically the same as the second step, but it is from a
more, I mean, a cheaper and easier dataset like user feedback in terms of preferences.
So it's usually a binary signal to tell us, like, the chatbot generates two answers for the
same question, so which one is more helpful, useful, or better. So this preference gives us
feedback, and we continue to train the model. Lastly, we could apply a guardrail to the model
to prevent it from generating any toxic or harmful information. So yeah, that's how we train the
chatbot, right? Right. There's a lot going on there, a lot we could do with them. We see them
used a lot in competitions today. But it was not long ago that there was another family of
models that was probably the most prevalently used. And I don't know if anybody used them more
than Chris in competitions. And that's really more the BERT style of models, where you know we
need additional context. So Chris, can you talk a little bit about BERT and how that compares
to some of the LLMs that we're using today? Yeah, certainly. So there are a lot of language
models out there, even more so than the chatbots, and it gets really confusing. They basically fall
into three families. There are models like GPT, which stands for Generative Pre-trained
Transformer. There are models like BERT, which is a bidirectional encoding representation
from Transformers. And there are models that are full architecture Transformers, like T5. The
difference between the groups, the first major difference, is how they're pre-trained. So before
you fine-tune a model on your specific task, it's pre-trained on billions of texts to get a general
understanding of language. BERT is pre-trained by showing it lots of text, and then randomly words
are hidden. And then BERT needs to use the words before and after the hidden words to try to guess
what the hidden word is. This is an autoencoding task. And as such, BERT understands vocabulary,
structure, and semantics very well. Now, GPT-like models, when during their pre-training,
they see a lot of text and they need to predict the next word. So as such, they're very
good at flow and what comes next. And then, in addition to the differences in pre-training,
there are also differences in the architecture. So a full Transformer has an encoder and a decoder,
and this group includes models like T5. Now, BERT is just an encoder. So you input text and it goes
through a series of self-attention layers, and out comes a mathematical vector called an embedding,
which represents the text. Now, GPT is just the decoder. So you put in an embedding, and then
after a series of layers, out comes text. So you can see there are lots of different LLMs, lots of
different differences, and as such, they all excel at different tasks. So there's constantly going to
be the need for different encoder-decoder types of models, just depending on the application. Kazuki,
could you maybe talk a little bit about what are some of those applications when you would use
the encoder versus decoder type of models? Sure. Speaking of BERT, there are some
Kaggle competitions where BERT was used. And one of the competition's goals is to evaluate
student summaries, and another competition's goal is to evaluate the complexity of passages. So
both tasks require evaluating and classifying sentences. I think these are good examples of
use cases for BERT because BERT is very good at classifying text. But GPT is used for generating
sentences, like a chatbot. For me, I'm using GPT for generating simple code. When I say, "Can
you show me an example of the PyTorch DDP?" GPT returns an example. I often hear people say
they don't want to code without GPT. So I think the roles of BERT and GPT are very different.
Yeah, certainly. There are applications for both. The cool thing is, it's not just limited to
the LLM and NLP space. We can actually apply these LLMs in other areas. My background is in vision,
and I'm seeing some really cool stuff happening in the vision space as we're using language
models. Laura, maybe you could talk a little bit about that. What are you seeing in terms of that?
Yeah, definitely. LLMs have had a huge impact in vision, and in particular, in the way that we
interact with our vision systems. Before LLMs made this big splash, we were not even thinking
about interacting with our vision systems using natural language. This was made possible by CLIP,
which was one of the first algorithms that said, "How about we align the text modality with the
image modality?" Chris explained before how to obtain an embedding from text, and now the idea of
CLIP would be to obtain an embedding from an image and put these two together in the same embedding
space. If they represent the same object, for example, if you have the text "dog" and
you have an image of a dog, you want to put these two embeddings closer together. How do you
train such a system? You need a bunch of images with their corresponding captions, captions that
actually explain the content of the image. Then you train this system to align the embeddings.
What is cool now is that you can go from one modality to the other, and you can do really nice
things.You can now talk to your vision system using natural language, which has really allowed
us to think bigger in terms of how to apply our vision systems to much more than just categories
like cars and pedestrians. We're now thinking big in terms of natural language and perception.
The perspective has really changed with LLMs. This idea of bringing embeddings from different
modalities into a common embedding space opens up so many possibilities and is
very powerful. What capabilities are you seeing that this is opening up?
For us, we're interested in perception, as I mentioned before. LLMs have allowed us to do
what we now call open-world scene understanding. For example, let's take the task of semantic
segmentation. Before, what we used to do is grab a certain number of classes that we were interested
in. If you're interested in autonomous vehicles, you want to detect and segment pedestrians,
cars, roads, etc. So there was this fixed set of classes, and we were training our
systems to segment based on that. But now, with LLMs, the perspective has changed. Before,
it was unclear how to scale up such a system to handle the infinite number of objects that we
can find in the world. But now, with LLMs, we actually see a path forward. The idea is that you
use prompts, you use natural language to express what you want to find in the image, and the vision
system needs to segment anything that you prompt, like fire hydrants, dogs, cows, whatever, not
just a set of predefined classes. So this is a way of doing open-world semantic segmentation
or scene understanding, which is a completely different game from what we were doing before.
And of course, LLMs have also changed the way we do generative AI. We now have things like
DALL-E or Mejourney that leverage the alignment capabilities of CLIP that I mentioned before.
For example, DALL-E takes a text embedding and, using a diffusion model, generates an image that
represents what you describe in the text.Yes, you've probably seen those demos where you
can write a description like "a polar bear on a skateboard in Times Square" and get a
nicely generated image of exactly what you described. This opens up endless possibilities
for designers, artists, and the general public to interact with these vision systems
because now everything is through natural language. It opens up tons of possibilities.
For those of us working on competitions, we're always looking for what's next, what
is the next edge that we can get. And some of the capabilities you're talking about
are really exciting. I mean, what do you think is next? What are the next frontiers that
we're talking about here with vision and LLMs? Well, we've only started exploring text
and images. But there are tons of other modalities. Without going too much further,
we have videos. We have seen things like Zora, for example, that generates videos from text. But
there's still a lot to explore. There's a question of how temporally coherent those videos are and
the captions used to train these models. It's the same idea as with CLIP, where you want to align a
video with a caption explaining its content. But the question is whether this caption only explains
individual objects or also describes motion and actions. So there's a whole new research field
to explore in terms of what kind of captions we use to train these systems and how temporally
coherent our videos will be. There's a lot of work that will appear in this area, I think.
And then there's also the whole 3D world. We have other senses, like light, and we also want to
align geometric features with language and images. So there's really tons to explore in different
modalities. We have been working, for example, on light and trying to prompt objects in the
light space using geometric and shape features. So, I think it's going to be super exciting
because now we're going to be able to generate, for example, full objects in 3D using text
prompts. There's tons and tons that is going to appear, I think, in the upcoming years. Yeah,
yeah, really exciting stuff. You know, starting to bring it into the competition space a little
bit. You know, it wasn't that long ago where the things that wowed us a little bit were things
like retrievers, where you can just retrieve images or retrieve text and get commonalities.
But now, with generative AI, that's become, we've been able to move far beyond that, and
actually, we can combine the two concepts. And so, there's this thing now called RAG that everybody's
talking about. About RAG, this RAG, that. Chris, why don't you demystify RAG a little bit?
Tell us what RAG is and how it's used. Okay, so RAG is a really cool technique
that extends the capabilities of LLM, and it stands for Retrieval-Augmented Generation.
So, if you ask a basic chatbot a question, then it's going to answer that question from its
memory, from what it already knows. When you use RAG, you have an LLM and a set of documents.
So, then you ask a question, and the first step is we search all the documents for chunks
of text that relate to the question. And then we give both the question and all those helping
chunks of text to the LLM. It looks at it all, and then it gives an answer. And this happens all
without us even knowing. But as such, the answer comes back, and it's so much more accurate.
I had a chance to experience this in a recent Kaggle competition called the LLM Science Exam.
We were challenged to build a system that could answer multiple-choice science exam questions.
And we were limited in how big the language model can be, and there were also time constraints and
resource constraints. So, as such, we couldn't submit a model, say, as big as ChatGPT, which may
already have a lot of the knowledge in its memory. But we had to sort of submit smaller models. So,
the solutions that won this competition were RAG, and specifically, people were submitting
models and at the same time, they'd submit a set of documents. Specifically, they submit
all of all six million Wikipedia articles. They submitted all together, and then what their
code would do is, when it was about to answer a science exam question, it would first scan all
six million articles in the blink of an eye and find any texts that relate to the question. Then,
it would feed that helpful information plus the question to the LLM, and it would give back an
answer. I witnessed this firsthand because on my computer, I would just make challenging questions.
I would make a question about quantum physics, about a specific detail or a number, and think,
"No way would it find it." But sure enough, in the blink of an eye, it would come back
with the answer, and it was something like 97-98% correct. So, it's truly incredible
what these RAG systems can do. And the most impressive thing is that all of this is happening
behind the scenes. You're just asking a question, and answers are coming back. It's doing
the retrieval and all that kind of stuff, and it's just all in the blink
of an eye. It's really amazing. For those of you who might be interested in
finding out more about that or seeing this in action, Chris published some really great
notebooks that were some of the highest voted ones in Kaggle a few months back during this
competition. So, you can go and check those out and see how he trained RAG and how he
did inference with RAG. Really good stuff. Kazuki, Chris talked about a couple of
things there. He talked about retrieval, he talked about LLMs doing some generation.
How do you balance those? Is one more important than the other? How do you view
the trade-off between retrieval and the LLMs? Let me talk about this topic for RAG and
fine-tuning. There are some papers that compare RAG and fine-tuning, and almost all of the
papers show that RAG is better than fine-tuning. This is because fine-tuning is a very difficult
method to apply due to catastrophic forgetting. That means when you want to train new things,
like the latest news, of course, you can do that, but the model often forgets all the previous
knowledge. More than that, RAG is very cost-effective compared to fine-tuning because
fine-tuning requires a lot of computing resources. So, yeah, uh, uh, but I think, uh, it's worth
to try the fine-tuning when you want specialized understanding. And also, I think, yeah, I think
we should find the sweet spot between saving money and meeting requirements. Yeah, so basically,
RAG is something that can make LLMs even better than the LLM itself. And based on what you're
saying, you know, it could be cheaper as well, not having to fine-tune models and get additional
data. And it can be more efficient. So, that's obviously very powerful.
But, you know, something, of course, we're interested in is the applications of
that. So, Gway, what are you seeing in terms of different applications for RAG right now?
Yeah, so, um, I think there are two kinds of interesting applications using RAG. The first is
to protect privacy. We all have a lot of private data, either personal or enterprise, which we
don't want to share online. And what we can do is bring LLM to a local controlled environment, like
deploying an open-source LLM, and create a vector database, like an embedding model. And, yeah,
specifically, like a RAG system connecting our local, our private data to this locally deployed
LLM. So, this allows you to talk to your data, experience, leveraging the capability of LLM while
protecting the privacy of the data. We actually have two demos you can interact with on the second
floor, the demo booth. We have the chat with RTX, so basically, it's deployed on a Windows
laptop, and you can talk to some PDF files, some other kinds of files using large language
models. Another demo is "Talk to Your Data with Nemo Agent." So, whenever you have a
question, there's an agent which can route the question to an unstructured text agent or
to a structured SQL retriever and synthesize the answer and get back to you. So, I think these
are quite interesting privacy-protecting demos. The second kind of applications, I think, is to
enhance the recency of the use cases. For example, like a news or finance agent or a search
LLM-powered search, and also co-pilot. So, it processes real-time streaming data and helps
us accomplish tasks like replying to an email, helping me write a short summary of the
conference, or writing code. So, yeah, I think those are the interesting applications.
Yeah, yeah, yeah, the applications are just limitless. So, you know, we've been talking about
applications for LLMs and RAGs, and this common embedding space between vision LLMs and some
hot areas. You know, I'm interested, I know we all are, well, how can you take these things and
actually apply them in the competition space? So, you know, with these new technologies, it seems
like competitions are starting to change a little bit. For example, we're seeing LLM competitions
where there's no data provided or only one data point, and you've got to generate your own data.
We're starting to see changes there. G, what other changes are you seeing in the competition space?
Yeah, so, just like you mentioned, I think a very interesting trend in the Kaggle competitions
is that there are more and more competitions which don't provide any training data at all or
provide very little training data, which is not enough to train a powerful predictive model. So,
the challenge here is to ask all the participants to come up with novel ideas and solutions to
collect their own data, curate their own training data. This is actually a very critical step
for any machine learning task. But previously, on Kaggle at least, the training data is fixed,
and it's very hard or impossible to expand the training data. But now, we are seeing more and
more use cases where participants leverage LLMs to generate training data, which actually creates
a great computational edge advantage to win a competition. So, yeah, actually, I expect this
is also very cost-effective compared to manual labeling. So, I expect more such competitions,
and I think this skill is actually quite useful for other tasks outside competitions.
Yeah, yeah, yeah, I totally agree. You know, another area where we're seeing the application of
some of these things that we weren't seeing before in the competition space is maybe in recommender
systems. And Chris, I know you've done a lot of work in recommender systems before. Have you had
a chance to use LLMs with recommender problems? Yeah, we have. So, as LLMs are being developed,
we're actually seeing them improve all other areas of AI. And Laura had spoken about how
it's helping with vision. But another example is recommender systems, right? So, recommender
systems are when you go onto an online shopping site and it suggests something you might like, or
a streaming video website and it suggests movies. So, the way recommender systems work is there are
users and items, and it attempts to recommend an item that the user is going to like. Typical
ways of solving this are: you could look at the items that a user previously engaged with and
then find items that are similar to those items, or you could look at a user and find other users
that are similar to that user and then see what items they like. Lastly, you can find patterns
between users and the items they engage with. The way LLMs help is, if you remember,
we had mentioned how a model like BERT can encode a block of text. So, items can
be represented by their text description, and we can take that description and encode
it into an embedding. An embedding is like a point in space, a little dot. And when you
encode all the items, you have all these dots, and then we can find which items are similar
by just finding which dots are the closest. So, it now gives us a new way to find similar items.
Likewise, we can apply that to users. And lastly, by using these embeddings, these dots, we can
actually find patterns between users and items in this embedding space. So, using LLMs is really
helping us make more accurate recommender systems. And I think actually, you were able to
use this in a recent KDD Cup competition, right? Maybe you could tell us about that.
Yeah, we did. So, recently, I teamed up with a bunch of co-workers and we entered the
prestigious annual KDD Cup, which was in 2023. And the task was hosted by Amazon, and the
task was to build three recommender systems. So, when you visit the Amazon website in different
countries and in different languages, the tasks were: we had to build a recommender
system for languages where we have lots of data, then we had to build a recommender system for
underrepresented languages with not a lot of data, and lastly, we had to build a recommender system
which would recommend products that do not exist yet. So, yeah, interesting challenge.
Our solution used large language models, and specifically, we used embeddings to find
similar items. And then, furthermore, embeddings allowed us to do something else, which is when we
found patterns in the languages which had lots of data via transfer learning or translation because
we're working in an embedding language space. We were able to transfer those patterns to apply them
to the recommender system for the underrepresented languages, and that gave us a huge edge there. And
then, in the third task, where we had to generate potential items that don't even exist yet, we
used models like GPT, which would start with an embedding of items that users like, and then it
would generate text descriptions of products that don't even exist. So, using language models
allowed us to combine classical techniques and make very accurate models. And the Nvidia
team actually won first place in every single competition. I thought you were getting ready
to clap. So, we were super excited about that, and it was a great demonstration of the power
of LLMs helping out with other forms of AI. Yeah, that's a great example of how some of
these new technologies are coming in and can be applied not only in the real world, like
some of the applications we talked about, but also in competitions. So, clearly,
we're seeing changes in that space. So, Kazuki, I mean, where is this headed? What
do you see as the future of competitions? How might they look different in the future?
Yeah, I think LLMs would be a more powerful tool for human annotators. They can speed up their
annotation process by taking over augmentation and suggesting labels. In other words, they can
focus on more essential tasks, which is exactly what the organizers are looking for. So, I think,
as LLMs improve, the machine learning models will be more accurate and robust using high-quality
data. Also, I think it makes computer vision and natural language understanding more reliable.
Yeah, which goes back to what Gway was talking about, about the problem with data, and now we can
use LLMs to do more with data and annotation and generation. So, certainly, that should be a change
that we should be looking out for. So, great. Well, we covered a lot of topics today, you know,
some of the latest technologies, how we're using them, how they could be used in competitions.
But we'd love to hear from you. Any questions that you have for us about any of these topics or
anything beyond? We'd be happy to take questions. Is it working? Oh, cool. First of all, thank you
for the awesome panel. The question I have about the future of machine learning competitions is,
in the past, if you participated in a machine learning competition, there was a chance
you would contribute to the state-of-the-art research. AlexNet would be a perfect example.
And to do that, the barrier to entry was pretty low. You just needed a computer with a GPU, and
you basically had to be smart. That's it. Now, cutting-edge research, state-of-the-art research,
requires you to train large models, which cost at least a few million dollars and require a cluster
of computers. Not everyone in this room has access to those kinds of resources. So, do you think that
in the future, machine learning competitions will still provide a venue for discovering cutting-edge
breakthroughs and state-of-the-art developments? Or will they become marginalized and mostly
serve as a venue for recruitment and a place for people to enjoy their hobbies?
Sure. Yeah, I'll start with that, and maybe somebody else wants to contribute.
So, there's a self-regulating factor involved, which is the amount of compute you have for entry.
You can go off and train these advanced models, but the way competitions are working today is
mostly through code competitions. You have to commit your code to an entrance server that has a
limited compute envelope. So, what we're seeing is a lot of neat innovations on how you can compress
these models, how you can quantize them, how you can get them to run within this limited envelope.
And I think that's the factor that normalizes the playing field a little bit and doesn't
make it just about who has the most compute. Because if it was about that and you just had
to submit a static CSV file with your solution, then I think the premise of your question would
be exactly right. It would just go to whoever has the most compute. But that's not the case,
and we're seeing some really innovative things, even beyond the scope or intent of the actual
competition, that go into this efficiency problem. Because everybody's trying to take advantage
of the latest and greatest in state-of-the-art, but how you can compress that into a limited
compute envelope that everybody has access to becomes almost a challenge in and of itself.
Yeah, I can add. So, I think even now, all machine learning competitions can still
contribute to the state-of-the-art research. I think two examples are, first, the mixture
of experts. So, if you take a look at the Hugging Face Open LM leaderboard, many of the top
entries are actually created by mixing several language models in an innovative way. It's not
as computationally intensive as one assumes. It can be done on a laptop or even on a single GPU.
It's possible, and it's like an assembly of LMs. A second example is the QARA (Quantized Low-Rank
Adapters) approach. You train a very small adapter, even though the LMs have billions of
parameters. The adapter itself is just megabytes in size. In some cases, it can greatly enhance
the capability of the LM in a low-cost way. Thanks. Thank you. We have the next
question. Yeah, great talk, by the way. So, I have a question about the third part of the
competition that you guys were mentioning, that you guys won. I felt like you kind
of skipped a step. You're talking about taking the embeddings and then using them to make
recommendations on new products. I didn't really understand the jump between the embeddings and
the recommendations. Could you expand on that? Yeah, so let's say a user previously
browsed a bunch of black shirts. Basically, a good assumption of what they would like
in the future is maybe more shirts. They're obviously interested in shirts, and
maybe they like the color black. So, you basically pick items that are
similar to their history of items. The process of embedding involves taking
the text description, like "a collared shirt made of this material." You take the text
description, and embedding is essentially a mathematical vector. It's a dot. Then, you
can take every other item on the website and embed them into dots. In this embedding space,
all the dots that are close to the black shirt will most likely be other shirts and things of
similar colors. So, all the dots will cluster. That's what we look at. We look at their
previous history, which is a bunch of dots, and then we pick recommendations that are
close by. But, I'm sorry, I didn't describe myself very well. So, how do you come up with
new ideas for new products based on that? Oh, you mean the third task, the
generative AI one? Sorry, yes. Oh, okay. Well, for the generative AI task, once
you have an embedding of products, for example, you can take five of their previous products,
average the embeddings, and get an average embedding. Then, you run a decoder. You put that
embedding in, and it will attempt to convert that embedding into a product. But since you
essentially generated a new embedding, the description it writes is not an existing one.
Okay, I'm sorry. I have so many questions. I apologize for taking up all
the time. How do you go from embedding to description when you average the
embedding? I'm not really sure about that step. I see. Basically, you have to fine-tune the model.
You need a lot of data where you have embeddings and their corresponding text descriptions. Then,
you train the model to convert an embedding to text. The model generalizes by being able to
take a new embedding it has never seen before and attempt to convert it to text. It will come
up with some text that it has not seen before. So, you created a particular model for that?
Yeah, correct. There's not a pre-existing Amazon recommender model from Hugging Face.
Got you. So, we have one more question in the middle, and then we'll go to online questions. And
if we have time, you can ask the experts as well. My question is more about representation
and generation. Specifically, to Laura, you mentioned CLIP, right? And there's CLIP, CLAP,
ImageNet. Do you see these representation models learned separately with some grounding, and then
those embeddings are fixed and used in whatever generative model to generate images, like image
tokens, or in language models, like text tokens, etc.? Or do you see the future as everything
together, where both representation and generation happen in the same model, like Palmyra, where you
feed everything in as a token and then generate? Mhm, yeah, that's actually a great question. Um,
so right now, for research, it's much, much easier to treat the problem separately. Right? So, we
usually take pre-trained models. We don't even touch them. They are frozen, and we just try
to extract the knowledge from there, right? And this relates also to the first question. This
is something that you can do with much fewer resources. Um, so I think this makes sense. Uh,
but also, there's another reason for doing that, and that is because the training data that you use
for CLIP is not the same one that you're going to use, for example, to train a stable diffusion
model that generates the images, right? Um, so I think it's much easier if each system is just
optimized for the task that it has to do, and then you just plug them together, right? So, I think
CLIP is already perfect for its purpose, right? And then you can just extract the information
and do your generation, do your perception task separately, and you don't need to retrain both
models together. This would be a huge overload. Okay, so let's go through some of the
online questions. Um, so the first one is, how can we get the community more involved
in AI for open-source technologies? And what are the most exciting parts? And how can we
offer this to the community even more? So, J, I know you do a lot of work in the open-source
community. Do you want to tackle that one? Yeah, so, uh, yeah, one thing I can think of is
to lower the hardware requirements of LMs. So, actually, one of the open-source projects we're
working on, unfortunately, it's not available right now, but it will be soon. We are trying
to create, you know, to reproduce the use case Chris just mentioned, the Caro Science exam using
RAG. And we want to reproduce that solution on a single GPU, specifically, like taking T4 with 20
or 30-something gigabytes of GPU memory. So that, you know, it can be run on a single GPU. So,
in the process, we made several, you know, improvements. So, like, FP8 quantization of
the language model, and we use FPQ algorithm to create the vector database. So that,
you know, Chris mentioned, we have 65 million text documents. So that translates to
something like 110 gigabytes. So, with IVF GQ, our vector database is just 6 gigabytes.
So, yeah, so we apply these optimizations. Hopefully, so we could create a demo that users
could experience with an entry-level GPU and can reproduce the exact same solution on, you
know, like Colab kernels or on Google Colab. So, I think that would make it easier for people
to start with large image models. Thank you. Um, so we have one more question from the
online audience. Um, what are the most important data science challenges related to
LMs that are still unsolved, and which ones do you think we will be able to solve? There
are still problems with, I wasn't aware. Well, yeah, so I'll share my thoughts, and maybe
some of the other panelists have theirs. Um, and it goes a little bit to what J is just
talking about around accessibility. I mean, the models are big, they're heavy, they take a
long time to infer. Um, and there have been a lot of innovations over the past six months, and
my gosh, they're coming out every week it seems like now, on how we can compress them, make
them run faster, make them easier to train, cool training techniques. But we've got to
improve the accessibility problem for wider adoption and application. And as you can tell from
the tenor of the talk, we're really interested in application and applying these things. So,
to me, that's the biggest macro challenge, but we're seeing a lot of micro solutions to that,
but still a long way to go. Any other thoughts? Yeah, I'll add. So, one of the things I'm
looking forward to seeing is, currently, one of the weaknesses of LMs is mathematical
reasoning and logic. They really excel at all humanity and social sciences. So, I'm looking
forward to, and they're constantly doing research in this area. I think a new model was released
recently which actually maybe outperforms ChatGPT on some mathematical tasks. So, I'm looking
forward to development there. Um, I think we have time for questions from the audience.
Yeah, hi. You commented earlier that for the competition, it's very important to come up with a
creative way to prepare the data. Could you share some experiences on what worked well so far and
what didn't work well for you from experience? Yeah, sure. So, I think there was a recent
competition, the LM essay detection. So basically, the task is to detect which essays are written
by students in high school and which other essays are written by large language models.
In this competition, most of the training data provided are from real student-generated data.
No LM-generated data is provided, only three data points. So participants have to experiment with
different flavors, different LM families like the Palmyra family, the Palmyra, and other open-source
generated essays. And they somehow have to figure out which one has the closest distribution to the
test data. So there's a lot of analysis going on, studying the subtleties of the LM-generated text
and trying to figure out, "Oh, maybe I should use model A, maybe that's the test data." Kaggle is
used to evaluate. I think that's actually a big factor in the final winning solution.
Yeah, and what I would add to that is, in this case, diversity is king. The more models
you can generate with different parameter tuning or parameter changes, varying temperature,
basically, you cannot throw enough generated data at the problem because, to some degree,
you're guessing what the hidden test set or the application set would look like, and you
don't know. And so, when you don't know, the only way to combat it is to flood it with
as much diverse data as you possibly can. Hi, thank you for the talk. It's been very
insightful. I am really interested in what you guys were speaking about in terms of
multimodality. As what we've seen today, text seems to be sort of the gold standard, where
you're either taking an image and creating text from that and then using that as some sort
of embedding, or you're doing it separately. Every time you go from video to image or image
to text, you're losing a lot of information. Now, is text really the gold standard
because we have that as an interface, people typing on keyboards? And do you guys
see a future in which the standard might be asking a question by submitting a video and
getting a better response? Or is it really only going to be text for the foreseeable future?
Oh yeah, maybe I can take that. Um, so I mean, I don't know, there's so much that we could
discuss here. But I think there are systems, right, that, for example, you can imagine that
work on getting not only the text but also a bunch of documents to look into. You can also
look into a bunch of images that are retrieved, for example. So it's not that your system is just
limited to text. It's just that the first step of interacting with a human is so much easier
with text that that's how you start with, right? But for example, like we have been working
on aligning brain signals with images and with text, and the alignment with images is just
much easier. So text doesn't really describe everything that is represented in the brain, maybe
because you're actually looking at a movie. You're recording the brain signal, and so the brain
signal is just much more correlated with an image. So I think your systems don't necessarily
need to go through text, but it's the human input that is so much easier with text. I think that
is kind of here to stay, but it doesn't mean that in the middle we cannot have other types of
connections between images and other modalities. It will not necessarily go through text.
And just as a quick follow-up, is there a way that you guys have seen effective to
go from a low-information environment into a high-information modality, such as from text
to voice, as opposed to the other way around? Sorry to interrupt, I think that's all the
time we have. And just a reminder, we have the opportunity to meet the experts in this afternoon
session as well. So if you have more questions, please feel free to ask the panelists. So let's
thank the panel, and thank you all for coming. Thank you for joining this session. Please
remember to fill out the session survey in the GTC app for a chance to win a $50 gift card. If
you are staying in the room for the next session, please remain in your seat and have your
badge ready to be scanned by our team.