AnyGPT: The Any-to-Any Multimodal LLM - Audio, Text, and Image! (Opensource)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] we haven't covered a research paper in a while but the one that we're going to cover in today's video deserves a lot of attention introducing any GPT a unified multimodal large language model with discrete sequence modeling now what does this all mean well in simple terms ngpt is a new kind of smart computer program that can understand and work with different types of information like speech text you have images as well as music now what's cool about ngpt is that it can do this without needing big changes to how it's built or even trained instead it learns to handle different types of data all by itself just take a look at this demo video can you draw me a picture of a sunny Beach give me a similar style of music I've been too tired recently where do you think is suitable to relax how about arene Lakeside can you give me some relaxing music enjoy this peaceful setting can you convert the feeling of this music into an image the music feels energetic and triumphant what do you see in this picture unity and the warmth of human connection shining brightly out here want to fill it too generate a music for [Music] it write a spring poem Springs Beauty the grass is green colors abound a blooming spring will see Nature's Miracle a new season of Grace Spring's here it's spring as I spoke I made him a gracious bow generate an angry dramatic character line I'm not going to take this anymore I know it's a long running joke but please let me live in peace in the heart of the Ancient Forest where the thick canopy [Music] [Music] sorry for being repetitive but this month we had insane Partnerships with big companies giving out subscriptions to AI tools completely for free these are tools that will streamline your business's growth and improve your efficiency just being a patreon this past month you were given access to six paid subscriptions completely for free not only do you access these subscriptions but you gain the ability for Consulting networking collaborating with the community as well as with myself you get access to daily AI news resources giveaways and so much more if you're interested check out the patreon link in the description below to gain access to these benefits now that was just beautiful to see and we can see that it's able to handle different types of information like speech you have text images as well as music all in one place and it's like a simple multitasking program now this is something that is able to process and understand various sorts of information in a structured way and this is the great part about it because it breaks it down into smaller parts or sequences based off its discrete sequence modeling and this is something that we're going to take a look at as we go further into the video so stay tuned guys we're going to take a look at the research paper showcase some more demos and a lot more so with that thought guys stay tuned and let's get straight into the video hey what is up guys welcome back to another YouTube video at the world of AI in today's video we're going to be taking a look at any GPT now this is a special computer program that's able to understand and work with different kinds of information this is something that we saw at the start of the video you can inter connectively work with speech text images as well as music all at once and the great part is that it's designed to handle this without needing any big changes as to how it's built or even trained it uses this clever method called the discrete representation and this is to process the information in a structured way now this is where it allows it to understand and generate content in a variety of formats now what they have done to train ngpt is that researchers a part of this project made a big data set with lots of different examples of mixed information like conversations of speech text images as well as music Al together in that data set and they basically use this data set to teach this model ngbt as to how it can handle any mix of information that you give it and you can see with this example over here where this is a table or overview of npt's model architecture so if you are to zoom in a little bit you're able to see the different ways it tokenized izes all types of data for example you're able to tokenize the speech text you can have image music and these are the det tokenizers so this is where it's then sent into the token discret tokens which is like a small piece of information and basically when it's sent into ngpt it uses these tokens to understand and create content that includes multiple types of data and it does this by an automatic approach and it's done step by step now this model structure is basically just showcasing how it's trained and doesn't need to change much after the tokens are received and you can only see that the data preparation before and after only uses the model when it's needed and this is where it keeps the overall process simple and efficient when it outputs these different Generations after the input is given now let's take a look at the any instruct data set which is the data set that is used to create any GPT now it is revolving around a two-stage process of creating this data set and in the first stage it works on topics scenarios as well as textual dialogues with multimodal elements and that's how it's basically generated secondly you have this other stage which is focused on text based conversations that are converted into fully multimodal dialogues and this is by incorporating various multimodalities such as image as well as audio and this process is basically making sure that the creation of this data set is containing Rich multimodal content for training the model of ngpt and you can see over here from this first segment that it's working on a topic pool and then constructs different scenarios in a textual Manner and it writes up different chat prompts which can then be used into the second stage which focuses on text based conversations that are converted into fully multimodel dialogues now that we understand how the data set was created as well as getting a bit of more information about this architecture let's take a look at some demonstrations this is one example where they have speech conversations being fully cloned now this is the voice of the person that they're trying to clone as I spoke I made him a gracious bow and based off that voice they're going to be now transcribing that voice into writing a spring poem and now it generates the poem and it also generates The Voice based off of that voice that you gave it and you can see that this is a prompt that was just simply given from a voice write a spring poem write a spring poem so this is the voice that was given to ngbt based off the cloning voice that you gave it and then this is the output that you get Green Springs Beauty the grass is green colors abound a blooming spring will see no that's just really great to see and this is something that's super easy to use and it's something that's going to be coming out fairly soon so stay tuned I'm going to be posting more about it on Twitter as well as the private Discord so if you're interested take a look at the patreon link in the description below now this example is possibly my favorite this is where a transcription is given can you draw me a picture of a sunny Beach and this is obviously the transcription through voice can you draw me a picture of a sunny Beach and we get this beautiful output and this music that's relates to that [Music] image now that's just really really cool to see another example is speech instruction plus music and this outputs text images as well as speech response so this is where the transcription is saying that can you convert the feeling of this music into an image so let's minimize the volume a little [Music] bit and this is the output that you get which is really something that relates to that music this is something that's quite energetic and you're achieving some sort of goal as you hike this mountain the music feels energetic and triumphant that's just really really cool another example is speech instruction which is something that we saw at the start of the video now if you scroll down a little bit there's many other examples and this is something that we haven't seen which is Text Plus image and this is where a text plus music is outputed so in this case this is The Prompt that is given can you translate the emotion in this picture into music and this is what is [Music] outputed that's just really really cool to see another example is text to music music sorry about that and that's inputed or outputed as text to image so this is where a prompt is given stating that what instrument in this piece of is in this piece of music [Music] and we can see that it draws a drum set and it actually describes it as the bass bass drum which is prominently featured and this is often indicating a strong beat in a music so there's so many possibilities as to what you can do with this I truly recommend that you check this blog post out cuz it gives you a lot of information as to what you can generate with any gbt and as soon as they release any sort of application that hosts this model I'm definitely going to be making another video on it cuz this is something that is really really useful for a lot of people and I truly believe this being something that many people actually end up using due to its multimodal capabilities now if you are interested in getting started with this they have actually released the code for it so if you want to take a look at it it's posted on the G repo so I'll leave this link in the description below so that you can access it as well as all the links that I Ed in today's video but with that thought guys thank you guys so much much for watching I hope you enjoyed this video and you got some sort of value make sure you check out the patreon page if you want to access amazing giveaways subscriptions to really really amazing tools as well as a lot of resources in Consulting if you haven't followed us on Twitter definitely do so so you can stay up to date with the latest AI news and if you guys haven't subscribed please do so turn on notification Bell like this video and check out our previous videos so you can stay up to date with the latest a news but with that thought guys thank you guys so much for watching have an amazing day spread positivity and I'll see you guys fairly shortly peace out f

Info

Channel: WorldofAI

Views: 8,496

Rating: undefined out of 5

Keywords: artificial intelligence, machine learning, computer vision, sora, openai sora, deeplearning, ai, ai video generator, text to video ai, openai, openai text to video, sora text to video, text to video, text to video model, text-to-video, sora openai, anygpt, multimodal, LLaVA, llava, large language and vision assistant, multimodal model, research project, visual understanding, image retrieval, conversational ai., vicuna, vicuna-13b, chatbot, image chatbot, minigpt, minigpt4, deep learning

Id: OyrAU1NHeA8

Channel Id: undefined

Length: 11min 31sec (691 seconds)

Published: Wed Feb 21 2024