Insights from Kaggle Grandmasters and Experts on Competitive AI and LLM Frontiers | NVIDIA GTC 2024

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi everyone, my name is Fay. I'll be your host today. So, today's session is "Insights from Kaggle Grandmasters and Experts on Competitive AI and LLM Frontiers". Make sure that you're in the right session. We'll have time for questions at the end of the session, so you're welcome to submit your questions on the Envy app or ask questions using one of the iOS microphones. So yeah, without further ado, I'll give you the floor to the speakers. Thank you. Great, thanks Fay. Hi everybody, my name is David Austin. I'm a Kaggle Grandmaster and I work at NVIDIA. I'm fortunate enough to spend some of my time at NVIDIA working on AI competitions and learning a lot of new techniques and methods. We're here to share a lot of that with you today. A big thing that we like to do is apply what we learn, whether it's competitions or taking something out of research and putting it into the application domain. So we're going to talk about a lot of different topics around LLMs, vision, generative AI, competitive AI, but the real slant today is going to be around how do we take all these cool things that are happening in the world today and apply them to real problems. We will leave time at the end for questions, so if you came with a question that we didn't address, please feel free to ask. There's also a "Meet the Expert" session at 2:00 where you can come talk to us one-on-one. So one way or another, you can get your questions answered at some point today. But first, I'd like to introduce my panel of colleagues here, and let's go ahead and start with G. Hello everyone, my name is G Le. I'm a data scientist and software developer from the Large Language Model technology team. I'm working on code generation, retrieval, and maned generation. I'm also a Kaggle Grandmaster. I used to work a lot on the competitions before all the LLM stuff. I'm working on Rapids, which is a GPU-accelerated data science framework. Nice to meet you all. Great. Uh, and Chris, hi, I'm Chris. I'm a senior data scientist at NVIDIA. I have a PhD in mathematics with a specialization in computational science. I love doing data science competitions, and I'm currently a quadruple Kaggle Grandmaster. Next, we have Laura. Yeah, hello, I'm Laura. I'm a research manager at NVIDIA. Before, I was a professor at the Technical University in Munich, Germany. My research group is interested in perception dynamics and understanding. So today, I will talk a lot about LLMs and their interaction with vision systems. And lastly, we have Kazuki. Hi, I'm Kazuki, and I'm also a Kaggle Grandmaster. I joined this team four years ago, and my expertise is recommended systems. Thank you. Thanks, Kazuki, and thanks for coming in from Japan for this talk. So let's go ahead and get started. You know, probably the hottest topic that we've heard at the conference and that we're seeing evolve in the competition space is around LLMs, and specifically the large generative models. So, G, maybe you could start us off talking a little bit about these generative models, how they work, how they're trained, how we use them. Yeah, of course. So, training a large language model or AI, like GPT, is a very computationally intensive task, and it is a multi-stage process. The first stage is pre-training a foundation language model. So basically, we collect massive text data from the internet and train the model to imitate human language and learn how to complete documents. The second step is what we call supervised fine-tuning. So basically, we want to create a smaller but high-quality dataset, you know, by human labelers, for specific use cases like chatbots, QA, creative or professional writing, or coding. So when we have these smaller high-quality datasets, we apply the same language modeling to continuously train the model. The third step is called RLHF, which stands for reinforcement learning from human feedback, or DPO, which stands for direct preference optimization. So the goal is basically the same as the second step, but it is from a more, I mean, a cheaper and easier dataset like user feedback in terms of preferences. So it's usually a binary signal to tell us, like, the chatbot generates two answers for the same question, so which one is more helpful, useful, or better. So this preference gives us feedback, and we continue to train the model. Lastly, we could apply a guardrail to the model to prevent it from generating any toxic or harmful information. So yeah, that's how we train the chatbot, right? Right. There's a lot going on there, a lot we could do with them. We see them used a lot in competitions today. But it was not long ago that there was another family of models that was probably the most prevalently used. And I don't know if anybody used them more than Chris in competitions. And that's really more the BERT style of models, where you know we need additional context. So Chris, can you talk a little bit about BERT and how that compares to some of the LLMs that we're using today? Yeah, certainly. So there are a lot of language models out there, even more so than the chatbots, and it gets really confusing. They basically fall into three families. There are models like GPT, which stands for Generative Pre-trained Transformer. There are models like BERT, which is a bidirectional encoding representation from Transformers. And there are models that are full architecture Transformers, like T5. The difference between the groups, the first major difference, is how they're pre-trained. So before you fine-tune a model on your specific task, it's pre-trained on billions of texts to get a general understanding of language. BERT is pre-trained by showing it lots of text, and then randomly words are hidden. And then BERT needs to use the words before and after the hidden words to try to guess what the hidden word is. This is an autoencoding task. And as such, BERT understands vocabulary, structure, and semantics very well. Now, GPT-like models, when during their pre-training, they see a lot of text and they need to predict the next word. So as such, they're very good at flow and what comes next. And then, in addition to the differences in pre-training, there are also differences in the architecture. So a full Transformer has an encoder and a decoder, and this group includes models like T5. Now, BERT is just an encoder. So you input text and it goes through a series of self-attention layers, and out comes a mathematical vector called an embedding, which represents the text. Now, GPT is just the decoder. So you put in an embedding, and then after a series of layers, out comes text. So you can see there are lots of different LLMs, lots of different differences, and as such, they all excel at different tasks. So there's constantly going to be the need for different encoder-decoder types of models, just depending on the application. Kazuki, could you maybe talk a little bit about what are some of those applications when you would use the encoder versus decoder type of models? Sure. Speaking of BERT, there are some Kaggle competitions where BERT was used. And one of the competition's goals is to evaluate student summaries, and another competition's goal is to evaluate the complexity of passages. So both tasks require evaluating and classifying sentences. I think these are good examples of use cases for BERT because BERT is very good at classifying text. But GPT is used for generating sentences, like a chatbot. For me, I'm using GPT for generating simple code. When I say, "Can you show me an example of the PyTorch DDP?" GPT returns an example. I often hear people say they don't want to code without GPT. So I think the roles of BERT and GPT are very different. Yeah, certainly. There are applications for both. The cool thing is, it's not just limited to the LLM and NLP space. We can actually apply these LLMs in other areas. My background is in vision, and I'm seeing some really cool stuff happening in the vision space as we're using language models. Laura, maybe you could talk a little bit about that. What are you seeing in terms of that? Yeah, definitely. LLMs have had a huge impact in vision, and in particular, in the way that we interact with our vision systems. Before LLMs made this big splash, we were not even thinking about interacting with our vision systems using natural language. This was made possible by CLIP, which was one of the first algorithms that said, "How about we align the text modality with the image modality?" Chris explained before how to obtain an embedding from text, and now the idea of CLIP would be to obtain an embedding from an image and put these two together in the same embedding space. If they represent the same object, for example, if you have the text "dog" and you have an image of a dog, you want to put these two embeddings closer together. How do you train such a system? You need a bunch of images with their corresponding captions, captions that actually explain the content of the image. Then you train this system to align the embeddings. What is cool now is that you can go from one modality to the other, and you can do really nice things.You can now talk to your vision system using natural language, which has really allowed us to think bigger in terms of how to apply our vision systems to much more than just categories like cars and pedestrians. We're now thinking big in terms of natural language and perception. The perspective has really changed with LLMs. This idea of bringing embeddings from different modalities into a common embedding space opens up so many possibilities and is very powerful. What capabilities are you seeing that this is opening up? For us, we're interested in perception, as I mentioned before. LLMs have allowed us to do what we now call open-world scene understanding. For example, let's take the task of semantic segmentation. Before, what we used to do is grab a certain number of classes that we were interested in. If you're interested in autonomous vehicles, you want to detect and segment pedestrians, cars, roads, etc. So there was this fixed set of classes, and we were training our systems to segment based on that. But now, with LLMs, the perspective has changed. Before, it was unclear how to scale up such a system to handle the infinite number of objects that we can find in the world. But now, with LLMs, we actually see a path forward. The idea is that you use prompts, you use natural language to express what you want to find in the image, and the vision system needs to segment anything that you prompt, like fire hydrants, dogs, cows, whatever, not just a set of predefined classes. So this is a way of doing open-world semantic segmentation or scene understanding, which is a completely different game from what we were doing before. And of course, LLMs have also changed the way we do generative AI. We now have things like DALL-E or Mejourney that leverage the alignment capabilities of CLIP that I mentioned before. For example, DALL-E takes a text embedding and, using a diffusion model, generates an image that represents what you describe in the text.Yes, you've probably seen those demos where you can write a description like "a polar bear on a skateboard in Times Square" and get a nicely generated image of exactly what you described. This opens up endless possibilities for designers, artists, and the general public to interact with these vision systems because now everything is through natural language. It opens up tons of possibilities. For those of us working on competitions, we're always looking for what's next, what is the next edge that we can get. And some of the capabilities you're talking about are really exciting. I mean, what do you think is next? What are the next frontiers that we're talking about here with vision and LLMs? Well, we've only started exploring text and images. But there are tons of other modalities. Without going too much further, we have videos. We have seen things like Zora, for example, that generates videos from text. But there's still a lot to explore. There's a question of how temporally coherent those videos are and the captions used to train these models. It's the same idea as with CLIP, where you want to align a video with a caption explaining its content. But the question is whether this caption only explains individual objects or also describes motion and actions. So there's a whole new research field to explore in terms of what kind of captions we use to train these systems and how temporally coherent our videos will be. There's a lot of work that will appear in this area, I think. And then there's also the whole 3D world. We have other senses, like light, and we also want to align geometric features with language and images. So there's really tons to explore in different modalities. We have been working, for example, on light and trying to prompt objects in the light space using geometric and shape features. So, I think it's going to be super exciting because now we're going to be able to generate, for example, full objects in 3D using text prompts. There's tons and tons that is going to appear, I think, in the upcoming years. Yeah, yeah, really exciting stuff. You know, starting to bring it into the competition space a little bit. You know, it wasn't that long ago where the things that wowed us a little bit were things like retrievers, where you can just retrieve images or retrieve text and get commonalities. But now, with generative AI, that's become, we've been able to move far beyond that, and actually, we can combine the two concepts. And so, there's this thing now called RAG that everybody's talking about. About RAG, this RAG, that. Chris, why don't you demystify RAG a little bit? Tell us what RAG is and how it's used. Okay, so RAG is a really cool technique that extends the capabilities of LLM, and it stands for Retrieval-Augmented Generation. So, if you ask a basic chatbot a question, then it's going to answer that question from its memory, from what it already knows. When you use RAG, you have an LLM and a set of documents. So, then you ask a question, and the first step is we search all the documents for chunks of text that relate to the question. And then we give both the question and all those helping chunks of text to the LLM. It looks at it all, and then it gives an answer. And this happens all without us even knowing. But as such, the answer comes back, and it's so much more accurate. I had a chance to experience this in a recent Kaggle competition called the LLM Science Exam. We were challenged to build a system that could answer multiple-choice science exam questions. And we were limited in how big the language model can be, and there were also time constraints and resource constraints. So, as such, we couldn't submit a model, say, as big as ChatGPT, which may already have a lot of the knowledge in its memory. But we had to sort of submit smaller models. So, the solutions that won this competition were RAG, and specifically, people were submitting models and at the same time, they'd submit a set of documents. Specifically, they submit all of all six million Wikipedia articles. They submitted all together, and then what their code would do is, when it was about to answer a science exam question, it would first scan all six million articles in the blink of an eye and find any texts that relate to the question. Then, it would feed that helpful information plus the question to the LLM, and it would give back an answer. I witnessed this firsthand because on my computer, I would just make challenging questions. I would make a question about quantum physics, about a specific detail or a number, and think, "No way would it find it." But sure enough, in the blink of an eye, it would come back with the answer, and it was something like 97-98% correct. So, it's truly incredible what these RAG systems can do. And the most impressive thing is that all of this is happening behind the scenes. You're just asking a question, and answers are coming back. It's doing the retrieval and all that kind of stuff, and it's just all in the blink of an eye. It's really amazing. For those of you who might be interested in finding out more about that or seeing this in action, Chris published some really great notebooks that were some of the highest voted ones in Kaggle a few months back during this competition. So, you can go and check those out and see how he trained RAG and how he did inference with RAG. Really good stuff. Kazuki, Chris talked about a couple of things there. He talked about retrieval, he talked about LLMs doing some generation. How do you balance those? Is one more important than the other? How do you view the trade-off between retrieval and the LLMs? Let me talk about this topic for RAG and fine-tuning. There are some papers that compare RAG and fine-tuning, and almost all of the papers show that RAG is better than fine-tuning. This is because fine-tuning is a very difficult method to apply due to catastrophic forgetting. That means when you want to train new things, like the latest news, of course, you can do that, but the model often forgets all the previous knowledge. More than that, RAG is very cost-effective compared to fine-tuning because fine-tuning requires a lot of computing resources. So, yeah, uh, uh, but I think, uh, it's worth to try the fine-tuning when you want specialized understanding. And also, I think, yeah, I think we should find the sweet spot between saving money and meeting requirements. Yeah, so basically, RAG is something that can make LLMs even better than the LLM itself. And based on what you're saying, you know, it could be cheaper as well, not having to fine-tune models and get additional data. And it can be more efficient. So, that's obviously very powerful. But, you know, something, of course, we're interested in is the applications of that. So, Gway, what are you seeing in terms of different applications for RAG right now? Yeah, so, um, I think there are two kinds of interesting applications using RAG. The first is to protect privacy. We all have a lot of private data, either personal or enterprise, which we don't want to share online. And what we can do is bring LLM to a local controlled environment, like deploying an open-source LLM, and create a vector database, like an embedding model. And, yeah, specifically, like a RAG system connecting our local, our private data to this locally deployed LLM. So, this allows you to talk to your data, experience, leveraging the capability of LLM while protecting the privacy of the data. We actually have two demos you can interact with on the second floor, the demo booth. We have the chat with RTX, so basically, it's deployed on a Windows laptop, and you can talk to some PDF files, some other kinds of files using large language models. Another demo is "Talk to Your Data with Nemo Agent." So, whenever you have a question, there's an agent which can route the question to an unstructured text agent or to a structured SQL retriever and synthesize the answer and get back to you. So, I think these are quite interesting privacy-protecting demos. The second kind of applications, I think, is to enhance the recency of the use cases. For example, like a news or finance agent or a search LLM-powered search, and also co-pilot. So, it processes real-time streaming data and helps us accomplish tasks like replying to an email, helping me write a short summary of the conference, or writing code. So, yeah, I think those are the interesting applications. Yeah, yeah, yeah, the applications are just limitless. So, you know, we've been talking about applications for LLMs and RAGs, and this common embedding space between vision LLMs and some hot areas. You know, I'm interested, I know we all are, well, how can you take these things and actually apply them in the competition space? So, you know, with these new technologies, it seems like competitions are starting to change a little bit. For example, we're seeing LLM competitions where there's no data provided or only one data point, and you've got to generate your own data. We're starting to see changes there. G, what other changes are you seeing in the competition space? Yeah, so, just like you mentioned, I think a very interesting trend in the Kaggle competitions is that there are more and more competitions which don't provide any training data at all or provide very little training data, which is not enough to train a powerful predictive model. So, the challenge here is to ask all the participants to come up with novel ideas and solutions to collect their own data, curate their own training data. This is actually a very critical step for any machine learning task. But previously, on Kaggle at least, the training data is fixed, and it's very hard or impossible to expand the training data. But now, we are seeing more and more use cases where participants leverage LLMs to generate training data, which actually creates a great computational edge advantage to win a competition. So, yeah, actually, I expect this is also very cost-effective compared to manual labeling. So, I expect more such competitions, and I think this skill is actually quite useful for other tasks outside competitions. Yeah, yeah, yeah, I totally agree. You know, another area where we're seeing the application of some of these things that we weren't seeing before in the competition space is maybe in recommender systems. And Chris, I know you've done a lot of work in recommender systems before. Have you had a chance to use LLMs with recommender problems? Yeah, we have. So, as LLMs are being developed, we're actually seeing them improve all other areas of AI. And Laura had spoken about how it's helping with vision. But another example is recommender systems, right? So, recommender systems are when you go onto an online shopping site and it suggests something you might like, or a streaming video website and it suggests movies. So, the way recommender systems work is there are users and items, and it attempts to recommend an item that the user is going to like. Typical ways of solving this are: you could look at the items that a user previously engaged with and then find items that are similar to those items, or you could look at a user and find other users that are similar to that user and then see what items they like. Lastly, you can find patterns between users and the items they engage with. The way LLMs help is, if you remember, we had mentioned how a model like BERT can encode a block of text. So, items can be represented by their text description, and we can take that description and encode it into an embedding. An embedding is like a point in space, a little dot. And when you encode all the items, you have all these dots, and then we can find which items are similar by just finding which dots are the closest. So, it now gives us a new way to find similar items. Likewise, we can apply that to users. And lastly, by using these embeddings, these dots, we can actually find patterns between users and items in this embedding space. So, using LLMs is really helping us make more accurate recommender systems. And I think actually, you were able to use this in a recent KDD Cup competition, right? Maybe you could tell us about that. Yeah, we did. So, recently, I teamed up with a bunch of co-workers and we entered the prestigious annual KDD Cup, which was in 2023. And the task was hosted by Amazon, and the task was to build three recommender systems. So, when you visit the Amazon website in different countries and in different languages, the tasks were: we had to build a recommender system for languages where we have lots of data, then we had to build a recommender system for underrepresented languages with not a lot of data, and lastly, we had to build a recommender system which would recommend products that do not exist yet. So, yeah, interesting challenge. Our solution used large language models, and specifically, we used embeddings to find similar items. And then, furthermore, embeddings allowed us to do something else, which is when we found patterns in the languages which had lots of data via transfer learning or translation because we're working in an embedding language space. We were able to transfer those patterns to apply them to the recommender system for the underrepresented languages, and that gave us a huge edge there. And then, in the third task, where we had to generate potential items that don't even exist yet, we used models like GPT, which would start with an embedding of items that users like, and then it would generate text descriptions of products that don't even exist. So, using language models allowed us to combine classical techniques and make very accurate models. And the Nvidia team actually won first place in every single competition. I thought you were getting ready to clap. So, we were super excited about that, and it was a great demonstration of the power of LLMs helping out with other forms of AI. Yeah, that's a great example of how some of these new technologies are coming in and can be applied not only in the real world, like some of the applications we talked about, but also in competitions. So, clearly, we're seeing changes in that space. So, Kazuki, I mean, where is this headed? What do you see as the future of competitions? How might they look different in the future? Yeah, I think LLMs would be a more powerful tool for human annotators. They can speed up their annotation process by taking over augmentation and suggesting labels. In other words, they can focus on more essential tasks, which is exactly what the organizers are looking for. So, I think, as LLMs improve, the machine learning models will be more accurate and robust using high-quality data. Also, I think it makes computer vision and natural language understanding more reliable. Yeah, which goes back to what Gway was talking about, about the problem with data, and now we can use LLMs to do more with data and annotation and generation. So, certainly, that should be a change that we should be looking out for. So, great. Well, we covered a lot of topics today, you know, some of the latest technologies, how we're using them, how they could be used in competitions. But we'd love to hear from you. Any questions that you have for us about any of these topics or anything beyond? We'd be happy to take questions. Is it working? Oh, cool. First of all, thank you for the awesome panel. The question I have about the future of machine learning competitions is, in the past, if you participated in a machine learning competition, there was a chance you would contribute to the state-of-the-art research. AlexNet would be a perfect example. And to do that, the barrier to entry was pretty low. You just needed a computer with a GPU, and you basically had to be smart. That's it. Now, cutting-edge research, state-of-the-art research, requires you to train large models, which cost at least a few million dollars and require a cluster of computers. Not everyone in this room has access to those kinds of resources. So, do you think that in the future, machine learning competitions will still provide a venue for discovering cutting-edge breakthroughs and state-of-the-art developments? Or will they become marginalized and mostly serve as a venue for recruitment and a place for people to enjoy their hobbies? Sure. Yeah, I'll start with that, and maybe somebody else wants to contribute. So, there's a self-regulating factor involved, which is the amount of compute you have for entry. You can go off and train these advanced models, but the way competitions are working today is mostly through code competitions. You have to commit your code to an entrance server that has a limited compute envelope. So, what we're seeing is a lot of neat innovations on how you can compress these models, how you can quantize them, how you can get them to run within this limited envelope. And I think that's the factor that normalizes the playing field a little bit and doesn't make it just about who has the most compute. Because if it was about that and you just had to submit a static CSV file with your solution, then I think the premise of your question would be exactly right. It would just go to whoever has the most compute. But that's not the case, and we're seeing some really innovative things, even beyond the scope or intent of the actual competition, that go into this efficiency problem. Because everybody's trying to take advantage of the latest and greatest in state-of-the-art, but how you can compress that into a limited compute envelope that everybody has access to becomes almost a challenge in and of itself. Yeah, I can add. So, I think even now, all machine learning competitions can still contribute to the state-of-the-art research. I think two examples are, first, the mixture of experts. So, if you take a look at the Hugging Face Open LM leaderboard, many of the top entries are actually created by mixing several language models in an innovative way. It's not as computationally intensive as one assumes. It can be done on a laptop or even on a single GPU. It's possible, and it's like an assembly of LMs. A second example is the QARA (Quantized Low-Rank Adapters) approach. You train a very small adapter, even though the LMs have billions of parameters. The adapter itself is just megabytes in size. In some cases, it can greatly enhance the capability of the LM in a low-cost way. Thanks. Thank you. We have the next question. Yeah, great talk, by the way. So, I have a question about the third part of the competition that you guys were mentioning, that you guys won. I felt like you kind of skipped a step. You're talking about taking the embeddings and then using them to make recommendations on new products. I didn't really understand the jump between the embeddings and the recommendations. Could you expand on that? Yeah, so let's say a user previously browsed a bunch of black shirts. Basically, a good assumption of what they would like in the future is maybe more shirts. They're obviously interested in shirts, and maybe they like the color black. So, you basically pick items that are similar to their history of items. The process of embedding involves taking the text description, like "a collared shirt made of this material." You take the text description, and embedding is essentially a mathematical vector. It's a dot. Then, you can take every other item on the website and embed them into dots. In this embedding space, all the dots that are close to the black shirt will most likely be other shirts and things of similar colors. So, all the dots will cluster. That's what we look at. We look at their previous history, which is a bunch of dots, and then we pick recommendations that are close by. But, I'm sorry, I didn't describe myself very well. So, how do you come up with new ideas for new products based on that? Oh, you mean the third task, the generative AI one? Sorry, yes. Oh, okay. Well, for the generative AI task, once you have an embedding of products, for example, you can take five of their previous products, average the embeddings, and get an average embedding. Then, you run a decoder. You put that embedding in, and it will attempt to convert that embedding into a product. But since you essentially generated a new embedding, the description it writes is not an existing one. Okay, I'm sorry. I have so many questions. I apologize for taking up all the time. How do you go from embedding to description when you average the embedding? I'm not really sure about that step. I see. Basically, you have to fine-tune the model. You need a lot of data where you have embeddings and their corresponding text descriptions. Then, you train the model to convert an embedding to text. The model generalizes by being able to take a new embedding it has never seen before and attempt to convert it to text. It will come up with some text that it has not seen before. So, you created a particular model for that? Yeah, correct. There's not a pre-existing Amazon recommender model from Hugging Face. Got you. So, we have one more question in the middle, and then we'll go to online questions. And if we have time, you can ask the experts as well. My question is more about representation and generation. Specifically, to Laura, you mentioned CLIP, right? And there's CLIP, CLAP, ImageNet. Do you see these representation models learned separately with some grounding, and then those embeddings are fixed and used in whatever generative model to generate images, like image tokens, or in language models, like text tokens, etc.? Or do you see the future as everything together, where both representation and generation happen in the same model, like Palmyra, where you feed everything in as a token and then generate? Mhm, yeah, that's actually a great question. Um, so right now, for research, it's much, much easier to treat the problem separately. Right? So, we usually take pre-trained models. We don't even touch them. They are frozen, and we just try to extract the knowledge from there, right? And this relates also to the first question. This is something that you can do with much fewer resources. Um, so I think this makes sense. Uh, but also, there's another reason for doing that, and that is because the training data that you use for CLIP is not the same one that you're going to use, for example, to train a stable diffusion model that generates the images, right? Um, so I think it's much easier if each system is just optimized for the task that it has to do, and then you just plug them together, right? So, I think CLIP is already perfect for its purpose, right? And then you can just extract the information and do your generation, do your perception task separately, and you don't need to retrain both models together. This would be a huge overload. Okay, so let's go through some of the online questions. Um, so the first one is, how can we get the community more involved in AI for open-source technologies? And what are the most exciting parts? And how can we offer this to the community even more? So, J, I know you do a lot of work in the open-source community. Do you want to tackle that one? Yeah, so, uh, yeah, one thing I can think of is to lower the hardware requirements of LMs. So, actually, one of the open-source projects we're working on, unfortunately, it's not available right now, but it will be soon. We are trying to create, you know, to reproduce the use case Chris just mentioned, the Caro Science exam using RAG. And we want to reproduce that solution on a single GPU, specifically, like taking T4 with 20 or 30-something gigabytes of GPU memory. So that, you know, it can be run on a single GPU. So, in the process, we made several, you know, improvements. So, like, FP8 quantization of the language model, and we use FPQ algorithm to create the vector database. So that, you know, Chris mentioned, we have 65 million text documents. So that translates to something like 110 gigabytes. So, with IVF GQ, our vector database is just 6 gigabytes. So, yeah, so we apply these optimizations. Hopefully, so we could create a demo that users could experience with an entry-level GPU and can reproduce the exact same solution on, you know, like Colab kernels or on Google Colab. So, I think that would make it easier for people to start with large image models. Thank you. Um, so we have one more question from the online audience. Um, what are the most important data science challenges related to LMs that are still unsolved, and which ones do you think we will be able to solve? There are still problems with, I wasn't aware. Well, yeah, so I'll share my thoughts, and maybe some of the other panelists have theirs. Um, and it goes a little bit to what J is just talking about around accessibility. I mean, the models are big, they're heavy, they take a long time to infer. Um, and there have been a lot of innovations over the past six months, and my gosh, they're coming out every week it seems like now, on how we can compress them, make them run faster, make them easier to train, cool training techniques. But we've got to improve the accessibility problem for wider adoption and application. And as you can tell from the tenor of the talk, we're really interested in application and applying these things. So, to me, that's the biggest macro challenge, but we're seeing a lot of micro solutions to that, but still a long way to go. Any other thoughts? Yeah, I'll add. So, one of the things I'm looking forward to seeing is, currently, one of the weaknesses of LMs is mathematical reasoning and logic. They really excel at all humanity and social sciences. So, I'm looking forward to, and they're constantly doing research in this area. I think a new model was released recently which actually maybe outperforms ChatGPT on some mathematical tasks. So, I'm looking forward to development there. Um, I think we have time for questions from the audience. Yeah, hi. You commented earlier that for the competition, it's very important to come up with a creative way to prepare the data. Could you share some experiences on what worked well so far and what didn't work well for you from experience? Yeah, sure. So, I think there was a recent competition, the LM essay detection. So basically, the task is to detect which essays are written by students in high school and which other essays are written by large language models. In this competition, most of the training data provided are from real student-generated data. No LM-generated data is provided, only three data points. So participants have to experiment with different flavors, different LM families like the Palmyra family, the Palmyra, and other open-source generated essays. And they somehow have to figure out which one has the closest distribution to the test data. So there's a lot of analysis going on, studying the subtleties of the LM-generated text and trying to figure out, "Oh, maybe I should use model A, maybe that's the test data." Kaggle is used to evaluate. I think that's actually a big factor in the final winning solution. Yeah, and what I would add to that is, in this case, diversity is king. The more models you can generate with different parameter tuning or parameter changes, varying temperature, basically, you cannot throw enough generated data at the problem because, to some degree, you're guessing what the hidden test set or the application set would look like, and you don't know. And so, when you don't know, the only way to combat it is to flood it with as much diverse data as you possibly can. Hi, thank you for the talk. It's been very insightful. I am really interested in what you guys were speaking about in terms of multimodality. As what we've seen today, text seems to be sort of the gold standard, where you're either taking an image and creating text from that and then using that as some sort of embedding, or you're doing it separately. Every time you go from video to image or image to text, you're losing a lot of information. Now, is text really the gold standard because we have that as an interface, people typing on keyboards? And do you guys see a future in which the standard might be asking a question by submitting a video and getting a better response? Or is it really only going to be text for the foreseeable future? Oh yeah, maybe I can take that. Um, so I mean, I don't know, there's so much that we could discuss here. But I think there are systems, right, that, for example, you can imagine that work on getting not only the text but also a bunch of documents to look into. You can also look into a bunch of images that are retrieved, for example. So it's not that your system is just limited to text. It's just that the first step of interacting with a human is so much easier with text that that's how you start with, right? But for example, like we have been working on aligning brain signals with images and with text, and the alignment with images is just much easier. So text doesn't really describe everything that is represented in the brain, maybe because you're actually looking at a movie. You're recording the brain signal, and so the brain signal is just much more correlated with an image. So I think your systems don't necessarily need to go through text, but it's the human input that is so much easier with text. I think that is kind of here to stay, but it doesn't mean that in the middle we cannot have other types of connections between images and other modalities. It will not necessarily go through text. And just as a quick follow-up, is there a way that you guys have seen effective to go from a low-information environment into a high-information modality, such as from text to voice, as opposed to the other way around? Sorry to interrupt, I think that's all the time we have. And just a reminder, we have the opportunity to meet the experts in this afternoon session as well. So if you have more questions, please feel free to ask the panelists. So let's thank the panel, and thank you all for coming. Thank you for joining this session. Please remember to fill out the session survey in the GTC app for a chance to win a $50 gift card. If you are staying in the room for the next session, please remain in your seat and have your badge ready to be scanned by our team.

Info

Channel: NVIDIA Developer

Views: 6,508

Rating: undefined out of 5

Keywords:

Id: k2EcIX0HgzA

Channel Id: undefined

Length: 55min 58sec (3358 seconds)

Published: Wed Apr 10 2024