Insights from Kaggle Grandmasters and Experts on Competitive AI and LLM Frontiers | NVIDIA GTC 2024

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi everyone, my name is Fay. I'll be your host  today. So, today's session is "Insights from   Kaggle Grandmasters and Experts on Competitive  AI and LLM Frontiers". Make sure that you're in   the right session. We'll have time for questions  at the end of the session, so you're welcome to   submit your questions on the Envy app or ask  questions using one of the iOS microphones.   So yeah, without further ado, I'll give you  the floor to the speakers. Thank you. Great,   thanks Fay. Hi everybody, my name is David  Austin. I'm a Kaggle Grandmaster and I work   at NVIDIA. I'm fortunate enough to spend some  of my time at NVIDIA working on AI competitions   and learning a lot of new techniques and methods.  We're here to share a lot of that with you today.   A big thing that we like to do is apply what  we learn, whether it's competitions or taking   something out of research and putting it into the  application domain. So we're going to talk about a   lot of different topics around LLMs, vision,  generative AI, competitive AI, but the real   slant today is going to be around how do we take  all these cool things that are happening in the   world today and apply them to real problems.  We will leave time at the end for questions,   so if you came with a question that we didn't  address, please feel free to ask. There's also   a "Meet the Expert" session at 2:00 where you can  come talk to us one-on-one. So one way or another,   you can get your questions answered at some  point today. But first, I'd like to introduce   my panel of colleagues here, and let's go ahead  and start with G. Hello everyone, my name is G   Le. I'm a data scientist and software developer  from the Large Language Model technology team. I'm   working on code generation, retrieval, and maned  generation. I'm also a Kaggle Grandmaster. I used   to work a lot on the competitions before  all the LLM stuff. I'm working on Rapids,   which is a GPU-accelerated data science  framework. Nice to meet you all. Great.  Uh, and Chris, hi, I'm Chris. I'm a senior  data scientist at NVIDIA. I have a PhD in   mathematics with a specialization in computational  science. I love doing data science competitions,   and I'm currently a quadruple Kaggle Grandmaster. Next, we have Laura. Yeah, hello, I'm Laura. I'm   a research manager at NVIDIA. Before, I was a  professor at the Technical University in Munich,   Germany. My research group is interested in  perception dynamics and understanding. So today,   I will talk a lot about LLMs and  their interaction with vision systems.  And lastly, we have Kazuki. Hi, I'm Kazuki, and  I'm also a Kaggle Grandmaster. I joined this   team four years ago, and my expertise  is recommended systems. Thank you.  Thanks, Kazuki, and thanks for  coming in from Japan for this talk.  So let's go ahead and get started. You know,  probably the hottest topic that we've heard at   the conference and that we're seeing evolve  in the competition space is around LLMs,   and specifically the large generative models.  So, G, maybe you could start us off talking   a little bit about these generative models, how  they work, how they're trained, how we use them.  Yeah, of course. So, training a large language  model or AI, like GPT, is a very computationally   intensive task, and it is a multi-stage  process. The first stage is pre-training   a foundation language model. So basically, we  collect massive text data from the internet and   train the model to imitate human language  and learn how to complete documents.  The second step is what we call supervised  fine-tuning. So basically, we want to create   a smaller but high-quality dataset, you know,  by human labelers, for specific use cases like   chatbots, QA, creative or professional  writing, or coding. So when we have these   smaller high-quality datasets, we apply the same  language modeling to continuously train the model.  The third step is called RLHF, which stands  for reinforcement learning from human feedback,   or DPO, which stands for direct preference  optimization. So the goal is basically the   same as the second step, but it is from a  more, I mean, a cheaper and easier dataset   like user feedback in terms of preferences.  So it's usually a binary signal to tell us,   like, the chatbot generates two answers for the  same question, so which one is more helpful,   useful, or better. So this preference gives us  feedback, and we continue to train the model.  Lastly, we could apply a guardrail to the model  to prevent it from generating any toxic or harmful   information. So yeah, that's how we train the  chatbot, right? Right. There's a lot going on   there, a lot we could do with them. We see them  used a lot in competitions today. But it was   not long ago that there was another family of  models that was probably the most prevalently   used. And I don't know if anybody used them more  than Chris in competitions. And that's really   more the BERT style of models, where you know we  need additional context. So Chris, can you talk   a little bit about BERT and how that compares  to some of the LLMs that we're using today?  Yeah, certainly. So there are a lot of language  models out there, even more so than the chatbots,   and it gets really confusing. They basically fall  into three families. There are models like GPT,   which stands for Generative Pre-trained  Transformer. There are models like BERT,   which is a bidirectional encoding representation  from Transformers. And there are models that are   full architecture Transformers, like T5. The  difference between the groups, the first major   difference, is how they're pre-trained. So before  you fine-tune a model on your specific task, it's   pre-trained on billions of texts to get a general  understanding of language. BERT is pre-trained by   showing it lots of text, and then randomly words  are hidden. And then BERT needs to use the words   before and after the hidden words to try to guess  what the hidden word is. This is an autoencoding   task. And as such, BERT understands vocabulary,  structure, and semantics very well. Now,   GPT-like models, when during their pre-training,  they see a lot of text and they need to predict   the next word. So as such, they're very  good at flow and what comes next. And then,   in addition to the differences in pre-training,  there are also differences in the architecture. So   a full Transformer has an encoder and a decoder,  and this group includes models like T5. Now, BERT   is just an encoder. So you input text and it goes  through a series of self-attention layers, and out   comes a mathematical vector called an embedding,  which represents the text. Now, GPT is just the   decoder. So you put in an embedding, and then  after a series of layers, out comes text. So you   can see there are lots of different LLMs, lots of  different differences, and as such, they all excel   at different tasks. So there's constantly going to  be the need for different encoder-decoder types of   models, just depending on the application. Kazuki,  could you maybe talk a little bit about what are   some of those applications when you would use  the encoder versus decoder type of models?  Sure. Speaking of BERT, there are some  Kaggle competitions where BERT was used. And one of the competition's goals is to evaluate  student summaries, and another competition's goal   is to evaluate the complexity of passages. So  both tasks require evaluating and classifying   sentences. I think these are good examples of  use cases for BERT because BERT is very good at   classifying text. But GPT is used for generating  sentences, like a chatbot. For me, I'm using GPT   for generating simple code. When I say, "Can  you show me an example of the PyTorch DDP?"   GPT returns an example. I often hear people say  they don't want to code without GPT. So I think   the roles of BERT and GPT are very different. Yeah, certainly. There are applications for   both. The cool thing is, it's not just limited to  the LLM and NLP space. We can actually apply these   LLMs in other areas. My background is in vision,  and I'm seeing some really cool stuff happening   in the vision space as we're using language  models. Laura, maybe you could talk a little bit   about that. What are you seeing in terms of that? Yeah, definitely. LLMs have had a huge impact in   vision, and in particular, in the way that we  interact with our vision systems. Before LLMs   made this big splash, we were not even thinking  about interacting with our vision systems using   natural language. This was made possible by CLIP,  which was one of the first algorithms that said,   "How about we align the text modality with the  image modality?" Chris explained before how to   obtain an embedding from text, and now the idea of  CLIP would be to obtain an embedding from an image   and put these two together in the same embedding  space. If they represent the same object,   for example, if you have the text "dog" and  you have an image of a dog, you want to put   these two embeddings closer together. How do you  train such a system? You need a bunch of images   with their corresponding captions, captions that  actually explain the content of the image. Then   you train this system to align the embeddings.  What is cool now is that you can go from one   modality to the other, and you can do really nice  things.You can now talk to your vision system   using natural language, which has really allowed  us to think bigger in terms of how to apply our   vision systems to much more than just categories  like cars and pedestrians. We're now thinking big   in terms of natural language and perception.  The perspective has really changed with LLMs.  This idea of bringing embeddings from different  modalities into a common embedding space opens   up so many possibilities and is  very powerful. What capabilities   are you seeing that this is opening up? For us, we're interested in perception,   as I mentioned before. LLMs have allowed us to do  what we now call open-world scene understanding.   For example, let's take the task of semantic  segmentation. Before, what we used to do is grab a   certain number of classes that we were interested  in. If you're interested in autonomous vehicles,   you want to detect and segment pedestrians,  cars, roads, etc. So there was this fixed   set of classes, and we were training our  systems to segment based on that. But now,   with LLMs, the perspective has changed. Before,  it was unclear how to scale up such a system to   handle the infinite number of objects that we  can find in the world. But now, with LLMs, we   actually see a path forward. The idea is that you  use prompts, you use natural language to express   what you want to find in the image, and the vision  system needs to segment anything that you prompt,   like fire hydrants, dogs, cows, whatever, not  just a set of predefined classes. So this is a   way of doing open-world semantic segmentation  or scene understanding, which is a completely   different game from what we were doing before. And of course, LLMs have also changed the way   we do generative AI. We now have things like  DALL-E or Mejourney that leverage the alignment   capabilities of CLIP that I mentioned before.  For example, DALL-E takes a text embedding and,   using a diffusion model, generates an image that  represents what you describe in the text.Yes,   you've probably seen those demos where you  can write a description like "a polar bear   on a skateboard in Times Square" and get a  nicely generated image of exactly what you   described. This opens up endless possibilities  for designers, artists, and the general public   to interact with these vision systems  because now everything is through natural   language. It opens up tons of possibilities. For those of us working on competitions,   we're always looking for what's next, what  is the next edge that we can get. And some   of the capabilities you're talking about  are really exciting. I mean, what do you   think is next? What are the next frontiers that  we're talking about here with vision and LLMs?  Well, we've only started exploring text  and images. But there are tons of other   modalities. Without going too much further,  we have videos. We have seen things like Zora,   for example, that generates videos from text. But  there's still a lot to explore. There's a question   of how temporally coherent those videos are and  the captions used to train these models. It's the   same idea as with CLIP, where you want to align a  video with a caption explaining its content. But   the question is whether this caption only explains  individual objects or also describes motion and   actions. So there's a whole new research field  to explore in terms of what kind of captions we   use to train these systems and how temporally  coherent our videos will be. There's a lot of   work that will appear in this area, I think. And then there's also the whole 3D world. We   have other senses, like light, and we also want to  align geometric features with language and images.   So there's really tons to explore in different  modalities. We have been working, for example,   on light and trying to prompt objects in the  light space using geometric and shape features.  So, I think it's going to be super exciting  because now we're going to be able to generate,   for example, full objects in 3D using text  prompts. There's tons and tons that is going   to appear, I think, in the upcoming years. Yeah,  yeah, really exciting stuff. You know, starting   to bring it into the competition space a little  bit. You know, it wasn't that long ago where the   things that wowed us a little bit were things  like retrievers, where you can just retrieve   images or retrieve text and get commonalities.  But now, with generative AI, that's become,   we've been able to move far beyond that, and  actually, we can combine the two concepts. And so,   there's this thing now called RAG that everybody's  talking about. About RAG, this RAG, that. Chris,   why don't you demystify RAG a little bit?  Tell us what RAG is and how it's used.  Okay, so RAG is a really cool technique  that extends the capabilities of LLM,   and it stands for Retrieval-Augmented Generation.  So, if you ask a basic chatbot a question,   then it's going to answer that question from its  memory, from what it already knows. When you use   RAG, you have an LLM and a set of documents.  So, then you ask a question, and the first   step is we search all the documents for chunks  of text that relate to the question. And then   we give both the question and all those helping  chunks of text to the LLM. It looks at it all,   and then it gives an answer. And this happens all  without us even knowing. But as such, the answer   comes back, and it's so much more accurate. I had a chance to experience this in a recent   Kaggle competition called the LLM Science Exam.  We were challenged to build a system that could   answer multiple-choice science exam questions.  And we were limited in how big the language model   can be, and there were also time constraints and  resource constraints. So, as such, we couldn't   submit a model, say, as big as ChatGPT, which may  already have a lot of the knowledge in its memory.   But we had to sort of submit smaller models. So,  the solutions that won this competition were RAG,   and specifically, people were submitting  models and at the same time, they'd submit   a set of documents. Specifically, they submit  all of all six million Wikipedia articles. They submitted all together, and then what their  code would do is, when it was about to answer a   science exam question, it would first scan all  six million articles in the blink of an eye and   find any texts that relate to the question. Then,  it would feed that helpful information plus the   question to the LLM, and it would give back an  answer. I witnessed this firsthand because on my   computer, I would just make challenging questions.  I would make a question about quantum physics,   about a specific detail or a number, and think,  "No way would it find it." But sure enough,   in the blink of an eye, it would come back  with the answer, and it was something like   97-98% correct. So, it's truly incredible  what these RAG systems can do. And the most   impressive thing is that all of this is happening  behind the scenes. You're just asking a question,   and answers are coming back. It's doing  the retrieval and all that kind of stuff,   and it's just all in the blink  of an eye. It's really amazing.  For those of you who might be interested in  finding out more about that or seeing this   in action, Chris published some really great  notebooks that were some of the highest voted   ones in Kaggle a few months back during this  competition. So, you can go and check those   out and see how he trained RAG and how he  did inference with RAG. Really good stuff.  Kazuki, Chris talked about a couple of  things there. He talked about retrieval,   he talked about LLMs doing some generation.  How do you balance those? Is one more   important than the other? How do you view  the trade-off between retrieval and the LLMs?  Let me talk about this topic for RAG and  fine-tuning. There are some papers that   compare RAG and fine-tuning, and almost all of the  papers show that RAG is better than fine-tuning.   This is because fine-tuning is a very difficult  method to apply due to catastrophic forgetting.   That means when you want to train new things,  like the latest news, of course, you can do that,   but the model often forgets all the previous  knowledge. More than that, RAG is very   cost-effective compared to fine-tuning because  fine-tuning requires a lot of computing resources. So, yeah, uh, uh, but I think, uh, it's worth  to try the fine-tuning when you want specialized   understanding. And also, I think, yeah, I think  we should find the sweet spot between saving money   and meeting requirements. Yeah, so basically,  RAG is something that can make LLMs even better   than the LLM itself. And based on what you're  saying, you know, it could be cheaper as well,   not having to fine-tune models and get additional  data. And it can be more efficient. So,   that's obviously very powerful. But, you know, something, of course,   we're interested in is the applications of  that. So, Gway, what are you seeing in terms   of different applications for RAG right now? Yeah, so, um, I think there are two kinds of   interesting applications using RAG. The first is  to protect privacy. We all have a lot of private   data, either personal or enterprise, which we  don't want to share online. And what we can do is   bring LLM to a local controlled environment, like  deploying an open-source LLM, and create a vector   database, like an embedding model. And, yeah,  specifically, like a RAG system connecting our   local, our private data to this locally deployed  LLM. So, this allows you to talk to your data,   experience, leveraging the capability of LLM while  protecting the privacy of the data. We actually   have two demos you can interact with on the second  floor, the demo booth. We have the chat with RTX,   so basically, it's deployed on a Windows  laptop, and you can talk to some PDF files,   some other kinds of files using large language  models. Another demo is "Talk to Your Data   with Nemo Agent." So, whenever you have a  question, there's an agent which can route   the question to an unstructured text agent or  to a structured SQL retriever and synthesize   the answer and get back to you. So, I think these  are quite interesting privacy-protecting demos.  The second kind of applications, I think, is to  enhance the recency of the use cases. For example,   like a news or finance agent or a search  LLM-powered search, and also co-pilot. So,   it processes real-time streaming data and helps  us accomplish tasks like replying to an email,   helping me write a short summary of the  conference, or writing code. So, yeah,   I think those are the interesting applications.  Yeah, yeah, yeah, the applications are just   limitless. So, you know, we've been talking about  applications for LLMs and RAGs, and this common   embedding space between vision LLMs and some  hot areas. You know, I'm interested, I know we   all are, well, how can you take these things and  actually apply them in the competition space? So,   you know, with these new technologies, it seems  like competitions are starting to change a little   bit. For example, we're seeing LLM competitions  where there's no data provided or only one data   point, and you've got to generate your own data.  We're starting to see changes there. G, what other   changes are you seeing in the competition space? Yeah, so, just like you mentioned, I think a very   interesting trend in the Kaggle competitions  is that there are more and more competitions   which don't provide any training data at all or  provide very little training data, which is not   enough to train a powerful predictive model. So,  the challenge here is to ask all the participants   to come up with novel ideas and solutions to  collect their own data, curate their own training   data. This is actually a very critical step  for any machine learning task. But previously,   on Kaggle at least, the training data is fixed,  and it's very hard or impossible to expand the   training data. But now, we are seeing more and  more use cases where participants leverage LLMs   to generate training data, which actually creates  a great computational edge advantage to win a   competition. So, yeah, actually, I expect this  is also very cost-effective compared to manual   labeling. So, I expect more such competitions,  and I think this skill is actually quite   useful for other tasks outside competitions. Yeah, yeah, yeah, I totally agree. You know,   another area where we're seeing the application of  some of these things that we weren't seeing before   in the competition space is maybe in recommender  systems. And Chris, I know you've done a lot of   work in recommender systems before. Have you had  a chance to use LLMs with recommender problems?  Yeah, we have. So, as LLMs are being developed,  we're actually seeing them improve all other   areas of AI. And Laura had spoken about how  it's helping with vision. But another example   is recommender systems, right? So, recommender  systems are when you go onto an online shopping   site and it suggests something you might like, or  a streaming video website and it suggests movies.   So, the way recommender systems work is there are  users and items, and it attempts to recommend an   item that the user is going to like. Typical  ways of solving this are: you could look at   the items that a user previously engaged with and  then find items that are similar to those items,   or you could look at a user and find other users  that are similar to that user and then see what   items they like. Lastly, you can find patterns  between users and the items they engage with.  The way LLMs help is, if you remember,  we had mentioned how a model like BERT   can encode a block of text. So, items can  be represented by their text description,   and we can take that description and encode  it into an embedding. An embedding is like   a point in space, a little dot. And when you  encode all the items, you have all these dots,   and then we can find which items are similar  by just finding which dots are the closest. So,   it now gives us a new way to find similar items.  Likewise, we can apply that to users. And lastly,   by using these embeddings, these dots, we can  actually find patterns between users and items   in this embedding space. So, using LLMs is really  helping us make more accurate recommender systems.  And I think actually, you were able to  use this in a recent KDD Cup competition,   right? Maybe you could tell us about that. Yeah, we did. So, recently, I teamed up with   a bunch of co-workers and we entered the  prestigious annual KDD Cup, which was in   2023. And the task was hosted by Amazon, and the  task was to build three recommender systems. So,   when you visit the Amazon website in different  countries and in different languages,   the tasks were: we had to build a recommender  system for languages where we have lots of data,   then we had to build a recommender system for  underrepresented languages with not a lot of data,   and lastly, we had to build a recommender system  which would recommend products that do not exist   yet. So, yeah, interesting challenge. Our solution used large language models,   and specifically, we used embeddings to find  similar items. And then, furthermore, embeddings   allowed us to do something else, which is when we  found patterns in the languages which had lots of   data via transfer learning or translation because  we're working in an embedding language space. We   were able to transfer those patterns to apply them  to the recommender system for the underrepresented   languages, and that gave us a huge edge there. And  then, in the third task, where we had to generate   potential items that don't even exist yet, we  used models like GPT, which would start with an   embedding of items that users like, and then it  would generate text descriptions of products that   don't even exist. So, using language models  allowed us to combine classical techniques   and make very accurate models. And the Nvidia  team actually won first place in every single   competition. I thought you were getting ready  to clap. So, we were super excited about that,   and it was a great demonstration of the power  of LLMs helping out with other forms of AI.  Yeah, that's a great example of how some of  these new technologies are coming in and can   be applied not only in the real world, like  some of the applications we talked about,   but also in competitions. So, clearly,  we're seeing changes in that space. So,   Kazuki, I mean, where is this headed? What  do you see as the future of competitions?   How might they look different in the future? Yeah, I think LLMs would be a more powerful tool   for human annotators. They can speed up their  annotation process by taking over augmentation   and suggesting labels. In other words, they can  focus on more essential tasks, which is exactly   what the organizers are looking for. So, I think,  as LLMs improve, the machine learning models will   be more accurate and robust using high-quality  data. Also, I think it makes computer vision and   natural language understanding more reliable. Yeah, which goes back to what Gway was talking   about, about the problem with data, and now we can  use LLMs to do more with data and annotation and   generation. So, certainly, that should be a change  that we should be looking out for. So, great.   Well, we covered a lot of topics today, you know,  some of the latest technologies, how we're using   them, how they could be used in competitions.  But we'd love to hear from you. Any questions   that you have for us about any of these topics or  anything beyond? We'd be happy to take questions.   Is it working? Oh, cool. First of all, thank you  for the awesome panel. The question I have about   the future of machine learning competitions is,  in the past, if you participated in a machine   learning competition, there was a chance  you would contribute to the state-of-the-art   research. AlexNet would be a perfect example.  And to do that, the barrier to entry was pretty   low. You just needed a computer with a GPU, and  you basically had to be smart. That's it. Now,   cutting-edge research, state-of-the-art research,  requires you to train large models, which cost at   least a few million dollars and require a cluster  of computers. Not everyone in this room has access   to those kinds of resources. So, do you think that  in the future, machine learning competitions will   still provide a venue for discovering cutting-edge  breakthroughs and state-of-the-art developments?   Or will they become marginalized and mostly  serve as a venue for recruitment and a place   for people to enjoy their hobbies? Sure. Yeah, I'll start with that,   and maybe somebody else wants to contribute.  So, there's a self-regulating factor involved,   which is the amount of compute you have for entry.  You can go off and train these advanced models,   but the way competitions are working today is  mostly through code competitions. You have to   commit your code to an entrance server that has a  limited compute envelope. So, what we're seeing is   a lot of neat innovations on how you can compress  these models, how you can quantize them, how you   can get them to run within this limited envelope.  And I think that's the factor that normalizes   the playing field a little bit and doesn't  make it just about who has the most compute.   Because if it was about that and you just had  to submit a static CSV file with your solution,   then I think the premise of your question would  be exactly right. It would just go to whoever   has the most compute. But that's not the case,  and we're seeing some really innovative things,   even beyond the scope or intent of the actual  competition, that go into this efficiency problem.   Because everybody's trying to take advantage  of the latest and greatest in state-of-the-art,   but how you can compress that into a limited  compute envelope that everybody has access to   becomes almost a challenge in and of itself. Yeah, I can add. So, I think even now,   all machine learning competitions can still  contribute to the state-of-the-art research. I think two examples are, first, the mixture  of experts. So, if you take a look at the   Hugging Face Open LM leaderboard, many of the top  entries are actually created by mixing several   language models in an innovative way. It's not  as computationally intensive as one assumes. It   can be done on a laptop or even on a single GPU.  It's possible, and it's like an assembly of LMs.  A second example is the QARA (Quantized Low-Rank  Adapters) approach. You train a very small   adapter, even though the LMs have billions of  parameters. The adapter itself is just megabytes   in size. In some cases, it can greatly enhance  the capability of the LM in a low-cost way.  Thanks. Thank you. We have the next  question. Yeah, great talk, by the way. So,   I have a question about the third part of the  competition that you guys were mentioning,   that you guys won. I felt like you kind  of skipped a step. You're talking about   taking the embeddings and then using them to make  recommendations on new products. I didn't really   understand the jump between the embeddings and  the recommendations. Could you expand on that?  Yeah, so let's say a user previously  browsed a bunch of black shirts. Basically,   a good assumption of what they would like  in the future is maybe more shirts. They're   obviously interested in shirts, and  maybe they like the color black. So,   you basically pick items that are  similar to their history of items.  The process of embedding involves taking  the text description, like "a collared shirt   made of this material." You take the text  description, and embedding is essentially   a mathematical vector. It's a dot. Then, you  can take every other item on the website and   embed them into dots. In this embedding space,  all the dots that are close to the black shirt   will most likely be other shirts and things of  similar colors. So, all the dots will cluster.  That's what we look at. We look at their  previous history, which is a bunch of dots,   and then we pick recommendations that are  close by. But, I'm sorry, I didn't describe   myself very well. So, how do you come up with  new ideas for new products based on that? Oh, you mean the third task, the  generative AI one? Sorry, yes. Oh,   okay. Well, for the generative AI task, once  you have an embedding of products, for example,   you can take five of their previous products,  average the embeddings, and get an average   embedding. Then, you run a decoder. You put that  embedding in, and it will attempt to convert that   embedding into a product. But since you  essentially generated a new embedding,   the description it writes is not an existing one. Okay, I'm sorry. I have so many questions. I   apologize for taking up all  the time. How do you go from   embedding to description when you average the  embedding? I'm not really sure about that step.  I see. Basically, you have to fine-tune the model.  You need a lot of data where you have embeddings   and their corresponding text descriptions. Then,  you train the model to convert an embedding to   text. The model generalizes by being able to  take a new embedding it has never seen before   and attempt to convert it to text. It will come  up with some text that it has not seen before.  So, you created a particular model for that? Yeah, correct. There's not a pre-existing   Amazon recommender model from Hugging Face. Got you. So, we have one more question in the   middle, and then we'll go to online questions. And  if we have time, you can ask the experts as well.   My question is more about representation  and generation. Specifically, to Laura,   you mentioned CLIP, right? And there's CLIP, CLAP,  ImageNet. Do you see these representation models   learned separately with some grounding, and then  those embeddings are fixed and used in whatever   generative model to generate images, like image  tokens, or in language models, like text tokens,   etc.? Or do you see the future as everything  together, where both representation and generation   happen in the same model, like Palmyra, where you  feed everything in as a token and then generate? Mhm, yeah, that's actually a great question. Um,  so right now, for research, it's much, much easier   to treat the problem separately. Right? So, we  usually take pre-trained models. We don't even   touch them. They are frozen, and we just try  to extract the knowledge from there, right? And   this relates also to the first question. This  is something that you can do with much fewer   resources. Um, so I think this makes sense. Uh,  but also, there's another reason for doing that,   and that is because the training data that you use  for CLIP is not the same one that you're going to   use, for example, to train a stable diffusion  model that generates the images, right? Um,   so I think it's much easier if each system is just  optimized for the task that it has to do, and then   you just plug them together, right? So, I think  CLIP is already perfect for its purpose, right?   And then you can just extract the information  and do your generation, do your perception task   separately, and you don't need to retrain both  models together. This would be a huge overload.  Okay, so let's go through some of the  online questions. Um, so the first one is,   how can we get the community more involved  in AI for open-source technologies? And what   are the most exciting parts? And how can we  offer this to the community even more? So, J,   I know you do a lot of work in the open-source  community. Do you want to tackle that one?  Yeah, so, uh, yeah, one thing I can think of is  to lower the hardware requirements of LMs. So,   actually, one of the open-source projects we're  working on, unfortunately, it's not available   right now, but it will be soon. We are trying  to create, you know, to reproduce the use case   Chris just mentioned, the Caro Science exam using  RAG. And we want to reproduce that solution on a   single GPU, specifically, like taking T4 with 20  or 30-something gigabytes of GPU memory. So that,   you know, it can be run on a single GPU. So,  in the process, we made several, you know,   improvements. So, like, FP8 quantization of  the language model, and we use FPQ algorithm   to create the vector database. So that,  you know, Chris mentioned, we have 65   million text documents. So that translates to  something like 110 gigabytes. So, with IVF GQ,   our vector database is just 6 gigabytes.  So, yeah, so we apply these optimizations. Hopefully, so we could create a demo that users  could experience with an entry-level GPU and   can reproduce the exact same solution on, you  know, like Colab kernels or on Google Colab. So,   I think that would make it easier for people  to start with large image models. Thank you.  Um, so we have one more question from the  online audience. Um, what are the most   important data science challenges related to  LMs that are still unsolved, and which ones do   you think we will be able to solve? There  are still problems with, I wasn't aware.  Well, yeah, so I'll share my thoughts, and maybe  some of the other panelists have theirs. Um,   and it goes a little bit to what J is just  talking about around accessibility. I mean,   the models are big, they're heavy, they take a  long time to infer. Um, and there have been a   lot of innovations over the past six months, and  my gosh, they're coming out every week it seems   like now, on how we can compress them, make  them run faster, make them easier to train,   cool training techniques. But we've got to  improve the accessibility problem for wider   adoption and application. And as you can tell from  the tenor of the talk, we're really interested in   application and applying these things. So,  to me, that's the biggest macro challenge,   but we're seeing a lot of micro solutions to that,  but still a long way to go. Any other thoughts?  Yeah, I'll add. So, one of the things I'm  looking forward to seeing is, currently,   one of the weaknesses of LMs is mathematical  reasoning and logic. They really excel at   all humanity and social sciences. So, I'm looking  forward to, and they're constantly doing research   in this area. I think a new model was released  recently which actually maybe outperforms ChatGPT   on some mathematical tasks. So, I'm looking  forward to development there. Um, I think we   have time for questions from the audience. Yeah, hi. You commented earlier that for the   competition, it's very important to come up with a  creative way to prepare the data. Could you share   some experiences on what worked well so far and  what didn't work well for you from experience?  Yeah, sure. So, I think there was a recent  competition, the LM essay detection. So basically,   the task is to detect which essays are written  by students in high school and which other   essays are written by large language models.  In this competition, most of the training data   provided are from real student-generated data.  No LM-generated data is provided, only three data   points. So participants have to experiment with  different flavors, different LM families like the   Palmyra family, the Palmyra, and other open-source  generated essays. And they somehow have to figure   out which one has the closest distribution to the  test data. So there's a lot of analysis going on,   studying the subtleties of the LM-generated text  and trying to figure out, "Oh, maybe I should use   model A, maybe that's the test data." Kaggle is  used to evaluate. I think that's actually a big   factor in the final winning solution. Yeah, and what I would add to that is,   in this case, diversity is king. The more models  you can generate with different parameter tuning   or parameter changes, varying temperature,  basically, you cannot throw enough generated   data at the problem because, to some degree,  you're guessing what the hidden test set or the   application set would look like, and you  don't know. And so, when you don't know,   the only way to combat it is to flood it with  as much diverse data as you possibly can.  Hi, thank you for the talk. It's been very  insightful. I am really interested in what   you guys were speaking about in terms of  multimodality. As what we've seen today,   text seems to be sort of the gold standard, where  you're either taking an image and creating text   from that and then using that as some sort  of embedding, or you're doing it separately. Every time you go from video to image or image  to text, you're losing a lot of information. Now,   is text really the gold standard  because we have that as an interface,   people typing on keyboards? And do you guys  see a future in which the standard might be   asking a question by submitting a video and  getting a better response? Or is it really   only going to be text for the foreseeable future? Oh yeah, maybe I can take that. Um, so I mean,   I don't know, there's so much that we could  discuss here. But I think there are systems,   right, that, for example, you can imagine that  work on getting not only the text but also a   bunch of documents to look into. You can also  look into a bunch of images that are retrieved,   for example. So it's not that your system is just  limited to text. It's just that the first step of   interacting with a human is so much easier  with text that that's how you start with,   right? But for example, like we have been working  on aligning brain signals with images and with   text, and the alignment with images is just  much easier. So text doesn't really describe   everything that is represented in the brain, maybe  because you're actually looking at a movie. You're   recording the brain signal, and so the brain  signal is just much more correlated with an   image. So I think your systems don't necessarily  need to go through text, but it's the human input   that is so much easier with text. I think that  is kind of here to stay, but it doesn't mean   that in the middle we cannot have other types of  connections between images and other modalities.   It will not necessarily go through text. And just as a quick follow-up, is there   a way that you guys have seen effective to  go from a low-information environment into   a high-information modality, such as from text  to voice, as opposed to the other way around?  Sorry to interrupt, I think that's all the  time we have. And just a reminder, we have the   opportunity to meet the experts in this afternoon  session as well. So if you have more questions,   please feel free to ask the panelists. So let's  thank the panel, and thank you all for coming.   Thank you for joining this session. Please  remember to fill out the session survey in the   GTC app for a chance to win a $50 gift card. If  you are staying in the room for the next session,   please remain in your seat and have your  badge ready to be scanned by our team.
Info
Channel: NVIDIA Developer
Views: 6,508
Rating: undefined out of 5
Keywords:
Id: k2EcIX0HgzA
Channel Id: undefined
Length: 55min 58sec (3358 seconds)
Published: Wed Apr 10 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.