$0 Embeddings (OpenAI vs. free & open source)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is the cheapest way to generate embeddings and the best way for that matter naturally with open AI blowing up lately many of us are using open AIS and bending models to generate embeddings for our projects and for good reason their latest model text embedding Ada 2 performs quite well and it's dirt cheap they are 0.0001 per 1000 tokens as of June 13th 2023 so that's very inexpensive but are they really the best there must be other models ideally open source that exist right what if we wanted to self-host the embedding models can't do that with openai really or what if we don't want to depend on an external API like open AI maybe we don't want vendor lock-in maybe we need to work completely offline that's a totally valid use case or maybe you're just looking for which embeddy model actually performs the best so yes they do exist thankfully that's what this video is going to be all about and in case you didn't know these models have actually existed way before openai made embeddings popular So today we're going to cover a quick background on embeddings we're going to talk about a popular set of open source and bennings that you could self-host and even run directly in the browser we're going to go over how to actually use them and we're going to find out there's like a whole another world here when it comes to embeddings that I want to uncover in this video we'll find out that each embedding model actually has some benefits over other embedding models for different use cases such as the input size limits the dimension size outputs Which models are designed for different types of tasks and that's right embeddings actually can be used for many different purposes I think a lot of people today use them for search but there's things like clustering classification re-ranking retrieval we're going to talk a little bit about this to understand when you might want to use one model over the other because as you start to go down the rabbit hole of all these embedding models it can start to become just a bit overwhelming trying to understand when to use each one right we're going to talk about how good these models actually are and how we decide how do we even rank these especially compared to open Ai and finally we're going to talk about what to expect in the future when it comes to embeddings there's some really exciting stuff coming up on the horizon and let's stay ahead of it okay here is a blank typescript node.js project we're going to be using typescript for this video for a couple reasons number one many of you guys who follow this channel are JavaScript typescript devs and would like to see all this new AI stuff from the typescript perspective number two later on in this video I'm going to be showing you actually how to do the same thing within the browser directly which is kind of incredible and of course that's going to need to be in JavaScript and number three I think most everyone knows python is like the ml AI dominant language and that's amazing I love the ecosystem that python has but personally I like to see these Concepts and tools brought to other languages like typescript so if you're a python Dev watching this video what I would say is feel free to still follow along if you're interested in this content I.E you're interested in learning more about embeddings what I'm about to show will absolutely translate to python in fact it's actually probably even easier to do in python as you'll find out shortly all the docs I'll be going through are actually python by D default so I've already scaffolded this repository super basic nothing fancy at all just your classic TS config try to keep it as Bare Bones as possible here here's our package.json just a couple scripts to do some typescript compiling and finally we have an index.ts file here which can act as our entry point by the way I will upload this code as always to GitHub so you can check that out in the video description feel free to follow along if you like if you've gotten this far and you don't know what an embedding is let me give you a quick refresher basically embeddings give you a way to relate content together for example for text embeddings you can have two different paragraphs and figure out how similar those two paragraphs are to each other and when I say similar I'm not talking about like keyword similarity also known as lexical similarity I'm talking about how similar there are true underlying meanings are right and this is something you can only really get with a neural network model to me the best way to think about embeddings is on a chart text with similar meaning will be plotted together on the chart and those that are dissimilarly far apart and the coordinates of these embeddings so if you were to look at a 2d chart the X and Y axis that represents the embedding those two numbers of course in reality you need way more than two Dimensions most models will produce in the hundreds or thousands of Dimensions if you want to get a deeper dive on embeddings and how you might use them in a real application like a custom chat CBT style search for your knowledge base feel free to watch my video on how I built clippygpt for the Super Bass documentation all right with that out of the way let's address another important piece when it comes to embeddings embeddings aren't just for text right they can exist for many different data types like images or even audio just like text embeddings where you can figure out how similar two pieces of text are image embeddings as you'd expect allow you to actually figure out how similar two different images are to each other right all you need is a model that can take an image and generate embedding from it and you get the same functionality if you've ever wondered how Google's reverse image search works where you can actually give it an image itself so instead of text you use an image to search it and it will come up with a bunch of other images that are similar to that image along with like their descriptions that's image embeddings okay so now once again we're building a node.js app right now why node.js well I think probably for many of you whether you're already doing this or planning to do something with embeddings quite likely you're going to be wanting to do this from the back end right lots of times when you're generating bunions you're probably going to be storing this in a database somewhere so a database like postgres can actually do it with the PG Vector extension as I've talked about before in my other video you can also use pine cone and there's a ton of other specialized Vector databases out there and to connect to these you're going to be connecting from the back end so I want to start with that again we will be actually checking out a browser version two and you can decide on your own if that's something that you actually want to run in a real application I can definitely see some use cases there too so we're in our index.ts just to prove that this is working let's do a quick console log and hello world I think is overrated let's do a modern in hello world which is just gonna be Emoji based do a little bit of a wave we'll do an npn run Dev and there we go there is our wave perfect so where do we start here when it comes to embeddings so there's two main places we're gonna hang out today the first is expert.net which if you search for pretty much anything to do with these open source embeddings I'm about to go through expert.net comes up because they're kind of one of the main sources of Truth around this idea of sentence Transformers or sentence embeddings this site was originally created by Niels rhymers and Irina guruvich these guys wrote the original paper on sentence Bert which is kind of where this is all started basically as we'll find out more throughout this video one of the first sentence embeddings was proposed in this paper using the Bert language model and sentence Bert or expert is a modification on that Burt Network model to actually output a single embedding that represents an entire sentence that can be used for exactly what we all use embeddings for similarity and we'll dive more into that as we go now you might notice right off the back here that this website is talking about how sentence Transformers is a python framework and so you might be saying Greg what the heck you told me we're working in typescript uh yeah we are don't worry like I said most all of the industry right now uses python for these things and so again if you are using python feel free to just follow this as is essentially they provide the sentence Transformers framework for python that makes it super simple to load in these different models and actually generate embeddings on them now we're going to be doing something similar very similar except we're going to be using typescript okay and the second site we need to be familiar with is hugging face I'm going to go about this assuming that hugging face is somewhat new to you just to kind of cover all our bases here so first off what is hugging face essentially this is a company who took the hugging face emoji and made a company around it no but seriously these guys you can kind of think of as the hub for machine learning models and data sets and tooling right so in the same way you might use GitHub to store your source code quite likely you would consider using hugging face to actually store your machine learning models and the data sets that contributed to training those models they have this thing called spaces which is kind of like a demo playground environment you can create that basically connects to these models and allows you to test them right there in the browser and that's kind of their three main offerings here so let's go to models so models is literally a list of neural network models that you can use for various purposes right as you can see here right now we're filtering by tasks and so these are showing some of the different types of tasks you might need a model for right so some of these might stand out to you text to image right this is very popular these days I'm sure you've heard of stable diffusion or some of the company's mid-journey these guys are using text to image models where you give it prompt and it's going to try to produce an image right you can go the other way around you've got an image and you want to describe that image what does this image actually all about what's contained in it so that's image to text and so on lots of really cool things in the computer vision space where you can give it an image and it's actually going to try to predict the physical depth within that image can be very useful We're not gonna have time to go through all these but definitely check them out it's quite amazing what's possible these days and quite inspiring too so why do we care about hugging face today in this video well as you might guess we're going to be pulling in our models from hugging face in order to run them and we're actually going to talk about a couple different ways we do that if you're wondering okay are we actually downloading these models to disk and then running them there are we going to be running them on a server somewhere great questions we're going to be covering that okay so back to espert I think it's worth pointing out that even though this site is called expert since that original paper there's been a ton of new embedding models that have surfaced which are mentioned right on this website my guess is the original authors created this site and then just kept on expanding off of that with these new models and honestly nowadays I don't think we're actually going to even be using the actual expert model but this site's going to contain just a ton of really useful information on sentence Transformers in general okay so I'm sure you're dying to see what these models are and how to use them so let's jump into pre-trained models and this page is going to go over these different models that we're able to use and which ones are best for different purposes right as you can see they're hosting these on hugging face which is essentially the de facto place again that everyone's hosting their models if we come down here first we have this all Dash models and these are basically categorized for different purposes as you can see here the all models are for general purpose so if you don't need anything specialized and you want to use these embeddings for multiple purposes which we're going to get into you might choose an all based model and you can see here they've actually done some benchmarks on these the performance of how well each of these actually performed on their task and then how fast they were also how how big the model is and all these factors are quite important I mean especially for the purposes of this video quite likely you guys are wanting to actually deploy these things use them in a real application right you want to get beyond the theory and turn this into something real right so immediately we have to be considering things like how fast these models are and how much disk space they're going to take up when we're deciding which one to use and they have a little bit of commentary here you can see that they're the all mpnet based V2 model has the best quality according to their benchmarks while the all mini LM L6 V2 is five times faster and still offers pretty good quality and of course as you can see here much much faster and also quite a bit smaller say about five times smaller and you'll find out shortly this all mini LM is actually very very popular we see this a lot right now and it's for pretty good reason if we keep scrolling down we're now getting into models that are specialized for certain tasks and we're gonna have a better discussion on what these tasks are in a second but just bear with me here there's a specific task called search which is given a query how well is that able to match with a document and at the service level this might just seem like how is this special isn't this just what all embeddings do and actually no this is specific to search right this is also known as a retrieval type task where you're trying to match queries which are actually typically very small right how big is London you know they might be less than 20 characters and we want to match those with potentially very long documents or sections within a document and that's actually not a default Behavior right you would need to train the model very specifically to understand and you know perform well at matching these small queries with these long documents otherwise by default it might actually only match pieces of text that have a similar length which is known as semantic textual similarity or you might see it as STS and you'll see that the way they're able to make these models more specialized is of course by training it on different data sets right they're focusing on very specific data sets to support those task so in this case you see these multi-question answer models for the semantic search are trained on 250 million question answered pairs from various sources including stack exchange Yahoo answers Google Bing search queries and more right and as you'd expect all these sources would mostly have this nice query response short long pairing right so that's one set here there's another one Ms Marco this has 500 000 real actual queries in the Bing search so again also targeting that search use case and we have a set here based on the Bert architecture and then down here using the the mini LM as we saw before but this one is focused just on search versus before the one we were looking at was the all now another thing to mention real quick as you're going through these is different models will produce embeddings of different formats right so in the case of these models here you can see that you need to be using dot product with these right and and if you're familiar with the different ways to calculate similarity you'll know that there's such thing as dot product cosine similarity and euclidean distance right and these are three different ways to actually calculate that similarity any kind of vector database will support these operations but not all models will support all of them right essentially in order to support all of them the output of the models needs to be normalized which means it's a unit Vector has a length of one I'm not going to go deep into linear algebra Theory right now but just kind of be aware of this aspect when it comes to the output vectors moving down we have also multilingual based models which targets another task called bitex mining bitex mining describes a process of finding translated sentence pairs right so maybe you haven't thought about this one but another perfect use case for embeddings right if I have a sentence in English and then another sentence in Portuguese and they supposedly mean the same thing perhaps that's valuable to me depending on whatever application I building to to understand that similarity and know when two sentences from two different languages are similar right so there are certain models that are trained for that exact purpose because they're trained on a data set that actually pairs two different languages so there's models specialized for that as well okay and finally we have imaged and text models which is also known as multimodal models and I'm going to talk more about this later near the end of the video but these ones are super super cool essentially what's happening is if you remember me talking about earlier how embeddings don't just apply to text right they could apply to images how similar to images to each other but if we have a multimodal model we actually can take an image or a piece of text and generate an embedding within the same Vector space and what that means is I could take a piece of text generate an embedding I could have an image and also generate an embedding and if I compare those two embeddings I'm actually comparing how similar that text is to that image which is actually pretty incredible that this is possible this opens up like tons of possibilities for things like image captioning here they call zero shot image classification what does that mean whenever you see Zero shot this is talking about the ability to classify in this case an image without the model having ever seen it before traditionally right if I want my model to be fed an image and it be able to classify that so if I wanted to give it a hamburger and it knows that's a hamburger and I give it a piece of pizza and it knows as a piece of pizza I would have needed to train the model with those specific classifiers right I would have to have a pizza hamburger Etc and train it with real pictures of that right zero shot means we never actually came up with those labels specifically during the training it's more of a general use case here but it's still able to classify that right and that's thanks to these image and text models so at this point you might be thinking okay I'm somewhat overwhelmed right now because you just showed me kind of this wall of models you've explained the different purposes but I don't know right now which one I should use and say I go with one of these all models because their general purpose how am I really going to be able to pick them I mean you said the all mini LM is good so maybe I'll just go with that okay now is a good time I think to show you this same information but from a different perspective if we head on back over to hugging face they have done a lot of really good work in the embedding space with a project called mteb or massive text embedding Benchmark and as they talk about in their blog post here mteb is a massive Benchmark for measuring the performance of text embedding models on diverse embedding tasks so literally the entire purpose of this project which they have written a paper on is to evaluate embedding models and this is amazing for people like you and me who aren't necessarily interested in going through each of these one by one trying to understand what their purpose is supposed to be and and measuring their performance against some standardized data set hugging face has already done this for us this is amazing of course the most interesting part of this is their leaderboard so let's open up their leaderboard and if you don't remember anything else from this video probably the mteb leaderboard is maybe the one thing you should remember this is a really great reference for different embedding models and also I believe they're continuing to update this right so over time as inevitably new models are going to come out they're going to be scoring them and placing them on this leaderboard and you might notice right off the bat here we actually have openai's text embedding 802 right in here so this leaderboard isn't just those open source models I was showing you earlier this leaderboard actually has kind of all the models that the public is aware of including closed Source ones like open AIS texting but you needed to and this is super handy right because now you can actually see wow we actually have five models that potentially have better performance than open AI so some things to notice with this table is we are tracking the embedding Dimensions right dimensions is the output size of our vector here so those of you who have experience with openai this might look familiar right texting by United 2 has 1536 dimensions and you'll actually notice compared to some of these other ones that's a little bit more on the high side we're actually getting more around 1024 768 and even down to 384 for some of these smaller ones okay next thing to notice is sequence length sequence length is talking about input tokens right so again coming at it from the open AI perspective just because I think so many of you guys have that background here if you've been able to wrap your head around this idea of tokens which is basically you know they're not a one-to-one to a character but they're also not a one-to-one to a word what is a token it's like usually a part of a word so if you had a sentence and you were to feed that into any of these models well these models have no understanding of symbols and characters on their own rather they have an understanding of tokens and so probably one of the best ways to visualize this is open ai's tokenizer right so if you were to type in in a piece of text I like him sandwiches can never smell sandwiches right you can see here that open ai's tokenizer has broken down the words into these tokens which funny enough is actually basically one token per word so it kind of sounds like I'm lying right now I'll show you another example but if we look at the token IDs that map to each of these right so that I maps to 40 like maps to 588 and so on these token IDs are what these embedding models can actually understand right we need to tokenize them so the point I was trying to make is you're not always going to get that one-to-one mapping which I got wrong here sometimes you'll have what's like a less common word I like grilled cheese and bacon oh I'm on struggle sir right now okay where is Waldo okay close enough so as you can see here uh Waldo is actually broken into two tokens how did it decide to do that well there are different tokenizers out out there essentially the idea behind a lot of these tokenizers is you want to take the most commonly seen character sequences and generate a single token for that because in the world of language that word comes up quite often or that sequence of characters rather comes up quite often so according to this uh Waldo itself isn't seen as often as just walled combined with something else right so the algorithm used here decided that that should be broken there and like you can see if I add two question marks it's actually considering these two question marks as one token so instead of having what order that used to be for one question mark it was a 30 so instead of having two thirties in a row uh when I had another question mark it's actually a whole new token for two question marks right so this kind of proves that tokens have nothing to do with specific characters it's all about a sequence of characters and how often they're seen right open ai's model must have seen double question marks quite a bit I mean if it was trained on the internet I would expect that let's add three yep that's another token for another token five yet another token six okay now as soon as I get to six is now broken up into a four and a two there's a very specific algorithm behind the scenes that does this without going too far down the rabbit hole I always say this followed by going down the rabbit hole the algorithm used for openai is called byte pair encoding which you might have seen abbreviated as bpe or diagram coding and essentially uses a lookup table to do that mapping this video is not going to be about byte parent coding but there are lots of great videos actually that describe this so feel free to to look that up and check those out now it's worth noting that bpe is not the only type of tokenizer you have another one here called wordpiece tokenization and wordpiece essentially different models use different tokenizers right as you see here wordpiece was actually the tokenizing algorithm created by Google for the original Bert model and it's used for Bert and a bunch of variants and mpnet so back to expert all mpnet bass V2 or the Mini version these would use that wordpiece tokenizer as you can see here it's very similar to bpe in terms of how it was trained but the actual tokenization is done differently if you're interested to learn about exactly what those differences are and how the algorithm Works definitely check out hugging faces guide here on wordpiece tokenization but for our purposes that is the gist of what I wanted to cover here so back to leaderboard again those tokenizers matter when it comes to the input sequence length and one thing you might notice here is all these models are around 512 except for text embeddonated 2 by openai and they have this massive 8191 and at first that might seem like well what the heck everyone else must be so far behind but in reality I think in many embedding generation use cases there's a good chance you're going to want to keep your input sequence relatively low anyways I've talked about this before in another video but say you have a massive document say an article or a blog post or a piece of documentation or a PDF document and you want to generate embeddings on it in order to do some search later on it's not necessarily the best practice to try to generate embeddings on that entire document as a whole like in the case of this model you might actually be able to accomplish that usually it would be better practice to try to split that document into smaller pieces before generate embeddings and the reason for that is well a couple things number one if there was a subsections within that document that matched now you can actually link the user directly to that subsection rather than just the page as a whole depending on your use case maybe you don't care about the whole document maybe you only care about actually extracting parts of that document for your use case like if you're building like a chat GPT style bot and you want to provide it context typically it's better to actually have subsections instead and then number three I'm not sure there's a lot of data on whether these massive input models like text somebody needed 2 can actually represent all of that information in a single embedding accurately I think there's a bit of debate on even though this model can take an massive input size is it really creating a meaningful representation of that document when you get past say 512 tokens anyways right so I would say don't let that number there stop you from considering models other than text embedding data too okay and moving on what are the rest of these columns talking about well this is talking about those different tasks right so we've already talked about some of them we've talked about search which is what hugging face is calling retrieval we've talked about semantic textual similarity which is STS but we also have some other ones here so we got clustering when would clustering be useful well what about related information right if you're already in a blog post and at the bottom you want to have other posts related to this one and you want those to be generated automatically you could generate embeddings on the post title or maybe a bit of the content or a summary of the content and use those embeddings to find other posts that have similar content again I find the best way to think about this is on a chart right if they have a bunch of embeddings plotted in a cluster together they probably have similar meaning and you could pull from any of the other ones close by as related content there's another one here summarization so this is a model that intentionally tries to match a long document with a summary of that document and how similar those are so this would be a use case where if you for some reason had both a document and a summary of that and you want to know if they're related this is the metric they use to score that type scenario so I'm not going to go into details on every little one I will show you how to find more details on these because if you're like me it bothered me that I couldn't figure that out up here at the top they actually break them down right by text mining this is the language based ones retrieval STS these are all the same things I don't believe these are actually ranked in these tabs so don't get confused that all mini l6s number four like I don't think that's true here but I couldn't find anywhere here that actually explained like what do each of these really really mean right thankfully if you go back to to their blog post they have the original paper that they wrote on this the massive texting bennion Benchmark and if we open that guy up and scroll down we can see that they have a nice sections on the task and the evaluation so they go through each one buy text mining classification clustering pair classification re-ranking retrieval STS summarization so there you go if you're interested to actually learn what each of these really mean then come here and check those out actually I would recommend it because say that you're you're most interested in retrieval because you're building the search right what you might do is you might rank this now by those that scored best for retrieval specifically come back over here and unfortunately this rank doesn't change this is still based on the original one so you're just going to manually count it but now you can see okay well interestingly now zero or sorry one two three four five six interestingly Texan button 82 is still six but maybe you had your eye on this E5 small V2 which only has 384 dimensions and we're going to talk about this in a second but that's actually a good thing in my opinion this is now seventh whereas if we go back and sort by the average of all of these which is what they consider their all-time rank this E5 small V2 is actually eighth here right so definitely worth understanding exactly what your use case is coming in here sorting by that use case and evaluating which model you want to use based on that okay Greg you've talked a lot but can we go ahead and build something already I mean this is what you guys signed up for when you are watching a video from this channel but yes let's build something okay I mentioned earlier there is potentially two different ways we can generate embeddings one way is using an API which means we're going to connect to some server somewhere that will actually perform the actual embedding generation and then it's going to return the result back to us or number two we can actually run the model locally on our own machines and generate it that way so let's start with the first API based approach and naturally hugging face being the Hub of all these models is the one providing these apis so in the documentation we're going to come down to the hug and face JS docs where they've actually built a JavaScript library that we can just use directly that will essentially connect to their hosted API run a very specific model with whatever inputs we want along with parameters for them and then returned a response and this approach is going to feel very similar to open ai's approach because it's essentially the same thing just more generic right open AI is of course limited to the models that they support So GPT 3 3.54 embeddings also Dolly which is their image generation service like stable diffusion whereas hugging face essentially you can run well any model that they support as an operation right so they've they've wrapped all the common tasks if you guys remember the tasks we were looking at earlier under models tasks each of these has a function so things like translation between languages text to image just using this you can actually use stable diffusion to generate an image from a prompt which is pretty awesome now real quick in case you're not using JavaScript or typescript they do have documentation on the inference endpoints themselves so again this is just a regular web API so using whatever language you want essentially you can just check out their Swagger API docs and they list out all their different endpoints for all the different inference logic but we're going to come back to hogging face.js now no here real quick they have two libraries here they have one called Hub which is actually talking about connecting to this Hub itself so think of it more like a management API if you're trying to like manage your models and data sets that's not what we're interested in right now we're interested in their inference API which we can install through that package you can even use these with Dino and from there we can just go ahead and call these functions they make it very simple now just like open AI you're going to need to pass in a access token right since this is an API hosted by hugging face naturally they're going to need to rate limit this now it is actually free let's go look at the pricing real quick okay so actually from their inference Dash API page if you scroll down here and they do mention that they use the inference API shared infrastructure is free so as of now I think you're going to be able to hit this API for free especially like for development purposes you'll have no problem however as soon as you want to use this in a real prod application you're definitely going to need to be using a dedicated instance right because right now you're just using shared infrastructure which has no guarantees you're going to want to be on dedicated infrastructure which if we come here their new page is all about that so essentially you're selecting your model choosing your cloud provider and they'll basically manage the instance for you hosting your model which is quite convenient I mean traditionally you'd have to do all that yourself right so this is a huge win of course remember we're talking about their inference API we're going to also talk about another approach where we can run this ourselves so stay tuned for that okay let's start by installing the hugging face inference API and we're actually going to install one more dependency here called dot end likely if you're a JavaScript developer you have heard of dot m.m will essentially allow us just to save our hugging face token into an environment variable file and then load that in to the code because we never want to hard code our API Keys directly into our code that is bad practice okay installation is complete let's go ahead and do that now let's go new file call it literally dot m and actually let's dot m.local this just indicates it's a local environment file because in reality if you were to build an app that would run on production you have a different set of API keys for that so let's follow best practices here we're going to create an environment variable called hugging face token and here is where we're going to paste that hugging face token to get that you'll need to just sign up for an account once you sign in you'll you'll see your profile picture here click on that you can go down to settings and and you'll see a tab for Access tokens click on that and then go ahead and click new token by the way if it asks you for a roll for your token go ahead and just select read this is just gonna be a read-only token once you have that you'll come here and just paste that directly after the equal sign and then real quick here I'm going to add our dot m to our get ignore just because we don't want to be committing those files okay now from our index let's go ahead and import the inference package so let's go ahead and copy these two lines of code the inference package exports a HF inference class which will create a new instance of this is where your access token will go so with node this will be process.m and then we call that our hugging face token of course you can call this whatever you like now which of these tasks corresponds to embeddings is there a function called generate embeddings and the answer is no actually there's a different name for it hugging face calls this feature extraction and if we come back to our models here you can see that's actually the first one here it's under the multimodal category which is appropriate basically it means feature extraction can be used for different types of media right in our case we're focusing on text right now so for us let's go ahead and come back to our mteb leaderboard and let's choose one of these models to generate embeddings for now one of these that stands out to me is the E5 small V2 you might remember me mentioning that the 384 Dimensions being small is is a good thing why is that well if you're generating embeddings you're almost always generating them because you're looking to perform some sort of similarity search and to do that search you're either going to use dot product or sometimes called inner product cosine distance or euclidean distance and you could run these algorithms either in whatever language you're using so say we're using typescript we could generate them there which we will actually do today just because it's easy but alternatively especially in production most likely you are running these calculations in a database and ideally a database that supports Vector operations so if you followed my other video on this you'll know there's a extension for postgres called PG Vector take another quick look at that so quick recap PG Vector is an extension you can install onto a regular postgres database and suddenly you now get access to this brand new data type called Vector the size of the vector represents the number of dimensions and it's just another column in your database another data type for our columns right so you would create a column for your embedding store all your embeddings in that column and then now you can perform a similarity search using one of the three operators it supports including distance inner product or cosine distance now if you've already gone down this rabbit hole then you may or may not be facing limitations with performance right I mean once you store enough information it might not be practical to perform the similarity search without an index right because without an index it's a sequential scan meaning you're literally comparing your embedding to every single Row in the database in order to perform that query right whereas with an index you're not necessarily doing that now when it comes to Vector indexes PG Vector implements one such index called an IVF flat index IVF stands for inverted file index and this deserves its own video in and of itself actually if you're interested in this specific type of index I would highly recommend James Briggs he has a really amazing video describing inverted file indexes and some nice visuals around that I'd highly recommend watching his video on that anyways reeling myself in trying not to go too far down this Rabbit Hole the point I want to make here is at scale you're going to face some challenges with embeddings right and indexes aren't necessarily A foolproof solution here right first of all indexes when it comes to IVF flat indexes you're actually trading recall for Speed which means as soon as you add the index you actually will sometimes minor but you will get a different query result once you have the index versus without and that's just because that's part of how the index optimizes it it's not searching every single other embedding anymore it's actually scoping that to a smaller group and that's by Design and on top of that when you have enough information in your database creating your index can actually be quite a long operation it can take potentially a very long time to generate the index in the first place it can also take up a lot of memory so you might actually find out that you're you know after waiting for a long time find out that you don't have enough memory on your postgres instance in order to generate the index so definitely some challenges you can face here right all these challenges starting with speed of the query where this all starts from the first place that's why we're creating the indexes and even if we do want indexes the speed and memory footprint of creating the index these are all proportional to our Dimension size right think about it forget indexes for a second if we're just doing a single comparison between two embeddings and we're doing let's say dot product between them dot product is essentially just multiplying each element of one vector with the corresponding element of another Vector right it's literally that simple so if we add two vectors in this case it'd be like a three-dimensional Vector with three values each two seven one for the first and eight to eight for the second doing the dot product is literally just taking two times eight seven times two and one times eight adding them all together and then the results being the dot product so you can imagine the larger these vectors are so in the case of embeddings the higher the number of Dimensions will result in more multiplication operations that need to happen right hopefully that's pretty straightforward so if you think about it from that perspective picking a model with less Dimensions should actually be super desirable right without an index these are going to be much quicker to calculate and then with an index it's going to use less memory and can potentially generate that index quicker now intuition might say okay if there's less dimensions and it must be like a worse model right it must not perform the same as a model with higher Dimensions but as you can see here like as long as we trust hugging faces evaluation here then a model like E5 small V2 is is eighth and look how many other ones are are below this right and many of these have a much larger Dimension size yet according to these benchmarks E5 small V2 actually still performs quite well right so I would not be using the dimension size as the the way you're evaluating performance at all I would be using a proper Benchmark like this okay so enough on that rabbit hole I've talked up E5 small be too enough let's go ahead and use that guy but just before that I'd like to take a moment to quickly thank the sponsor of this video brilliant brilliant.org is a website that teaches you computer science and math and I thought this sponsorship is actually quite relevant for this video based on some of the courses that they offer for me the specific course that I was most interested in was their introduction to neural networks with all the AI stuff blown up these days personally I'm the type of person who really wants to know what's happening under the hood and I really don't like when there's too much magic happening so I want to know how it actually works under the hood and I found that the neural network course actually did a really good job of breaking down neural Nets at a very simple level and then kind of expanding on that I'd say that's one of the number one value ads that brilliant.org has to offer is just like their nice visuals and simplicity also I find they wrote the lessons in a way that kind of forces you to read them step by step rather than just like overwhelming you with a bunch of information and I'd say like neural networks specifically aren't the easiest thing to understand so the fact that they can make that simple is impressive worth noting that more lessons than just AI they have data science tons of math if you're into that if you are interested in trying out brilliant I do have a URL you can jump to which is brilliant.org rabbit hole syndrome they said that the first 200 people will get 20 off their annual subscription so if you're interested jump on that and if you did miss it they do offer a full 30 days for free so again if it is something you're interested in feel free to give that a shot as well I will throw that link in the description thanks again brilliant let's get back to embeddings okay back to our inference API so let's go ahead and copy the feature extraction function we'll paste that here now real quick keep in mind I'm using a top level weight here basically top level weights meaning awaits from the root of the file is only supported when you're using ES modules so in my package.json here I have a type module so just FYI there in case you're getting an error if you can't use type module then just wrap this kind of function traditionally I would call this async function main do my weights inside of here right and then just call me at the end but we no longer need that in this case Okay so model we're going to use our E5 small V2 let's go and open that in a new tab this is what hugging face calls the model card just information about the model it's like a readme and you can get the name of it so it's just like GitHub right you got your like your organization or your user followed by the name of it let's go ahead and copy that come here we'll paste that here as the model and then we can pass in our inputs and we can test that out by running a npm run Dev now notice this actually is going to take a little while to load and the reason for that is hugging faces inference apis using shared infrastructure so literally under the hood it has to spin up some sort of container or VM probably a container for this model within an environment you know under the hood to support generating these embeddings and then return that to us so there's such a thing as a cold start versus a warm start basically what that means is a cold start is it actually has to spin up that environment on the Fly which is why it can take so long sometimes but then subsequent times will be more or less immediate as you can see here because it's already still running and I don't know the exact timings they use but you know if you wait long enough then it will spin back down again unless people are constantly using it of course if you're looking to use their API and you want it to always be warm then you would be looking into one of their paid offerings so let's take a look here looks like we're getting a bunch of numbers this looks promising but hold on what is going on we're actually getting like multiple arrays of embeddings it almost seems why are we getting so many and I actually intentionally let this happen because there's a good chance you might come across this as you're looking to use inference API and I want to show you how to get past this so if we actually look at the output length actually it's a three-dimensional array let's just take a look I think it's going to be one okay let's take a look at the first result right typescript doesn't know if it's a number an array of number or an array of an array of numbers since I'm just trying to prove a point I could just cast this but let's do things properly let's grab our result as the first element of our output and then we can check if right so we're just using a type card right at this point it doesn't know if we're a number or an array of number or an array array of numbers as soon as we check this at least now narrows it down to those two let's run that and we didn't save it okay the array length is seven interesting where is seven coming from and if you recall it almost looks like seven embeddings and we were expecting well one embedding right you give it a sentence or paragraph or whatever and it should give you a single embedding with well in this case we would expect 384 Dimensions so why seven so it's seven because the way these models work under the hood is they're actually generating embeddings for each token within that sentence so this input which I think is using Wordpress tokenizer right now be good to double check this is likely producing seven tokens and when you run those seven tokens through this model it's actually generating embedding for each of those and these single sentence embodies that we are used to is actually kind of like an aggregation operation on these seven embeddings typically it's this operation called mean pooling yeah if you come over here you can see they're using this average pooling algorithm also known as mean pooling that is actually taking those in this case seven embeddings and using this pooling algorithm to generate a single embedding from those so what's going on here do we actually need to be performing some sort of extra step well yes another step does need to happen but I'm going to say it doesn't necessarily need to be done by yes there's some things we can do to get hugging face to do this for us and just to show the kind of the expected outcome let me show you another model real quick let's come back to the leaderboard and let's choose this all mini LM l6v2 that we were talking about earlier notice it's also 384 Dimensions so we'll copy that model come back here paste it we'll save we'll run and let's just remove this and we'll just go back to our regular console.log output and there we go pretty much what we expected from the start there is our single 384 Dimension embedding perfect so at this point maybe this is all you needed you're like finally it took you long enough thanks for showing me that see you later and if that's you have fun it was a pleasure but I like to explain right now real quick why this embedding model produced it quote-unquote properly whereas the last one didn't and I had to dig into this a little bit actually and what's happening under the hood here is if we pull this guy back up our first model that wasn't really working as we expected you'll notice that we have a couple different what I'm going to call tags at the top here right we have feature extraction which is the task that they've tagged on this model right and this is intentional we have some other miscellaneous tags on here but notice compared to our all mini L6 V2 which has this sentence Transformers Library tagged on there our E5 small V2 it doesn't and that small detail is actually what change use the functionality as I've figured out so under the hood I'm pretty sure what's happening here is hugging face with their inference API they've designed it to look at these tags as little hints and pointers on how to actually execute the inference and understandably right you know basically what hugging face is trying to do is take like all this large group of various different models that are designed for many different purposes and can be run essentially in many different ways and try to kind of unify that into a single API and to their credit that's a somewhat challenging task right so I think what's Happening Here specifically with the sentence Transformers is under the hood when they see the sentence Transformers Library tag that indicates that they should be running this model directly through the sentence Transformer framework AKA our good old expert framework and use a default configuration that will produce that single embedding as we expected so under the hood it's actually automatically performing that mean pooling I'll algorithm for us whereas without that my guess and I am guessing here is that we're actually running this model through the Transformers python Library which is different more generalized which will result in a set of outputs that look more like that first set that we that we got and then on you have to run this average pooling or mean pooling on top of that manually so if you're like me and you're thinking well dang I would have liked to use this E5 small V2 model ideally I'm not performing that mean pooling myself is there any way I can tell hugging face to run this model as if there's that sentence Transformers tag and the answer to that as far as I'm aware today is no however there is a solution what I've done here is I've essentially forked this model cloned it and pushed it up to my own repository with basically the only difference of adding this sentence Transformers Library tag and if we go ahead and copy that paste it into our code and run it you'll see we're now getting a single one-dimensional array as we would have expected awesome so to me naturally the next thing to do here is well let's test this out how can we essentially just get a feeler that this embedding is doing what we expect so of course to test an embedding we basically want to have two embeddings to compare with each other we can run a similarity algorithm to observe how similar these two sentences are so let's call this one output one and two very uncreative names and assuming these embeddings are normalized let's go ahead and create a DOT product function just because that's the easiest one to implement and also the quickest to run so go function dot product we'll give it a vector a it's a number array and B also a number array and if you recall dot products is literally just multiplying each element from one vector with the same element from the other Vector so we could probably use like to reduce here or something but you know sometimes I think there's situations like this where the best type of loop is a classic for Loop so we can say let result equal zero let's do a good old classic for Loop oops let I equals zero and we'll just add to result the product of a at that index with B at that index and we'll return a result that simple and in production we might do something like if a DOT length does not equal B dot length right these need to be the same length in order for this to work then maybe we throw an error okay and then we can do a similarity score equals the dot product of our two outputs ah yes once again typescript doesn't know what kind of type we're dealing with let's do a type guard so we need to check the first element actually I'm realizing the type guards here are a little bit trickier just because if you look at the return type here feature extraction output it's a number or an array of numbers or an array of array numbers and that itself is an array so we need to verify that the elements within the array is essentially not any of these two and only this type okay I'm going to really quickly create a user-defined type guard here this is definitely on the typescript advanced side don't dwell on this if you're curious you can check out this code on GitHub to see exactly what's going on awesome there we go we'll console log our similarity and let's see what we get ah rate limit reached please log in or use your API token right so after all this work of us creating our hugging face token I'm just realizing right now that we're not actually loading this token properly if you remember I got you to install the dot m package but I forgot to actually use it let's go ahead and do that right now so we're going to import config from dot end and before we do anything else we call it config and we can pass in the path to our dot m file which is dot m.local so this was our missing piece as soon as you call this it will automatically load in the environment variables from this file into our process.m let's go ahead and run that again and we get quite a strange similarity score in fact it's negative and really I expected them to have a similar score of well basically one right this is a happy person this is a happy person they're the exact same input so in theory we should have had like a 100 similarity what's going on ah well here it is I'm not actually doing a DOT product I am summing the two elements together instead of multiplying them whoops let's correct that come back down let's give it another go and 0.99999 that is good enough um essentially that's one just that can be attributed towards rounding errors perfect well this has proven a couple things for us just now this has proven that we're able to actually truly get embeddings to work using one of the models that we've chosen from the mteb table this also proves that this API is producing normalized vectors because if it wasn't normalized then our DOT product would not work as we expect here and now I think the true test is going to be comparing two different sentences that are somewhat similar and then again that are quite different so if we said this is a sad person what's that mean I mean let's just see what happens Point interestingly it still scores quite high and I think that's fair right because I mean similarity isn't just based on sentiment right oh happy is one set is zero or however you might interpret that right you have to take the full picture into this is a happy person like the whole context of the sentence these are actually very similar right but if I said that is a happy person then I said the roller coaster is red let's go ahead and compare these we get a lower score now one thing I would keep in mind when it comes to embeddings is don't let the absolute value of this score trip you up too much I think every model has its own kind of threshold with regard to what you might consider similar for example I've noticed with open AI I can feed open ai's embedding model two very different you know very dissimilar sentences and it will actually still score quite high on an absolute scale like say in the 70 range but it'll never get to like say above 0.8 something like in the 80 range unless they were truly similar right so again don't don't let the 77 confuse you too much what I'd say is do your own testing on what you expect things to be similar on find where that threshold is is it 0.85 is it 0.86 six whatever it is and use that as your measure to actually determine whether things are similar and I think every model is going to be different as well just for fun maybe let's go ahead and compare the all Mini model interestingly that one is way down at 0.22 let's try our sad person example 0.65 change it to a happy person one so interestingly with the all mini that one actually varied quite a bit more so your mileage may vary go ahead and test out your content on different models see how they behave as each of these models is truly trained on different data sets that may or may not be closer to your content so I think that just about covers the hugging face inference API approach to generating embeddings there's definitely a lot more to talk about with regard to the inference API in general around the other models but I think that can be explored in another video so now for the moment I'm sure many of you guys have been waiting for let's actually generate these embeddings locally from node.js directly on my machine and with this approach we can essentially run this completely offline so what I'm going to do real quick here is I'm actually going to convert this project into a mono repo just so we can keep track of each of these different approaches hold tight for two seconds so real quick all I did here was I took the project that we had before put it into a subfolder here under apps and just created another root level package.json which just represents our project root and keeps track of our workspaces that are all going to be under apps for now and this is great because now I can create another folder here called embeddings Transformers and as always have a simple Emoji hello world which is working as expected so we're calling this project Transformers because if we head on back to the hugging face documentation we're going to find our way over to the transformers.js library and if you've been involved at all whatsoever in the AIML World from the python side you are no doubt familiar with the Transformers framework for Python and transformers.js aims to be the as you'd expect the JavaScript equivalent to that now there is a lot of work to get a future parity there not sure if we ever will but the fact that this is a thing is quite incredible so what really is transformers.js as they say here state of the art machine learning on the web run directly in your browser with no need for a server crazy now with that said we can run this in a server and node.js server as we're about to do right now if we want to and the reason why we can do both is because if you come down here you'll notice that transformers.js uses What's called the Onyx runtime what is the Onyx runtime Onyx runtime aims to be an inference engine that works in many different languages JavaScript included and as you can see here they have bindings for node.js for the web which is accomplished by using webassembly under the hood and they even have honest runtime for react native So in theory you can run these models in your mobile react native application and back to the web runtime if you've been following the web GPU stuff that's been coming out recently Onyx is actually already starting to Target web GPU which is huge so this is the engine used under the hood transformers.js essentially does the job of bundling the most common tasks and use cases into an easy to use API just like the Transformers library on Python and we'll use the Onyx runtime under the hood to accomplish that if we scroll down here you can see the parity between Python and JavaScript right python Transformers Library revolves around this idea of this pipeline which the JavaScript version mirrors with just the the JS equivalent syntax here's where you can see which tasks are supported today not all tasks are but thankfully in our case feature extraction is indeed supported so let's give it a shot so we'll come to installation here and we're going to install the xenova Transformers package I want to give a quick shout out to Joshua the guy behind this Library my understanding is that this library was created just earlier this year 2023 just under six months ago by Joshua and Joshua actually just recently joined hugging face which is awesome congrats really awesome work on this library and I am super stoked to see future iterations on this all right so we're going to go ahead and copy this install that package and then we'll come on down to the pipeline API the way this works is you'll first create a pipeline based on the task that you're looking to accomplish in our case feature extraction it'll give you resulting function that you can now call with different parameters based on the task that you're looking to do in this case they're showing you how to do sentiment analysis classifier and this is how you would run it and like hopefully this is inspiring to you guys check this out automatic speech recognition we're using the open ai's famous whisper model which is actually in fact open and using transformers.js we can literally transcribe audio directly in JavaScript whether it's the browser node.js wherever we want using this Library which is amazing as you can see we can also pass in additional options so let's go ahead and do this we'll come back here to the top and we're going to import pipeline and we'll go ahead and create a function I'm going to call this generate embeddings and this is going to be the pipeline for feature extraction and if you're unsure exactly the spelling here I would just come back to the overview here and you can come down to the supported tasks once again and there we go feature extraction that's the ID for it okay and then for actual usage we can come back to left hand side here and come on down to pipelines and there's examples for each of the pipelines in here so if we just search for feature extraction come on down to here and we've got a couple different examples the one I think that's closest to what we're interested in is for calculating embeddings and if you look at this this model actually might look familiar to you the good old all mini lmv2 so I'm actually just going to go ahead and copy this note I actually got this wrong earlier it should have been an await in front of the pipeline that's a asynchronous operation and then we can go ahead and run the embedding generation I called mine generate embeddings we'll just have to replace that and there's a couple parameters here I'll explain these in a second but let's go ahead and just run it amazing look at that so the output format is slightly different um you'll notice it it's actually this what's called a tensor and this tensor comes from that Onyx runtime under the hood and it's just essentially a wrapper around that same embedding array that we would have had right so we just have a couple additional things like the number of dimensions are shown here the type yeah the data format here is actually a flow 32 array and then we have the size so if you're following along first of all congrats you've potentially generated your first embedding locally on your own but there's a couple things we need to talk about with regard to what's happening under the hood to make this work so number one the model how did we just run that model right theoretically if we generated these embeddings locally this model must live on my computer somewhere right so are all these models like bundled into this package no and thankfully no because that would be insane what Transformers JS is doing under the hood is there actually fetching this model from hugging face on the Fly and actually I believe caching them take a quick look here yeah see under the Nova Transformers node module they're creating this dot cache folder which is caching our all mini LM model so this is important to note right because when it comes to deploying this application you need to keep in mind how you're Distributing these models right you need to ask yourself am I okay with transformers.js fetching this model on the fly from hugging face at runtime or would I prefer to maybe prefetch that model and distribute that with my application like say you're creating a Docker image for your app you may or may not consider actually just embedding that model directly into your Docker image now your decision on that may depend greatly on how big that model really is thankfully all mini LM is quite tiny relatively so it's in the matter of tens of migs versus hundreds or gigabytes so potentially this could be something you can embed right into a Docker image how do you do that well if you look into their documentation they have this section on custom usage that explains it you can actually download these models manually so you'd literally go to the hugging face model and you would come in here and you would download these files locally and the files you have here are actually important so talk about that in a second but the idea is you would download the model yourself set this special configuration to point to that local model optionally disable it from fetching remote models from hugging face at all and go from there so that's the first thing to note second thing to note is you might notice that this model actually lives under xenova's hugging face organization and in this particular case the reason for that is because in order to run your models through Transformers JS today they need to exist in a special format that can run in the Onyx runtime and by default that's probably not the case for most models right if we come back to sentence Transformers for example Apple check out the files you can see that this is actually a pi torch model and it's around 90 megabytes actually looks like they have a couple different ones they also have tensorflow down here pytorch is is the most common one you see today which naturally must run within python if we look at the E5 small same thing it's a pi torch model this guy is 134 Megs Etc so let's go ahead and take a look at the zenovo one here it is if we look at the files okay we don't have any PI torch model we actually have this Onyx folder which contains our model built for the Onyx runtime and by default this model is 90 Megs unless you quantize it which I'll talk about in a second and if you use the quantize model you can actually get down to 23 Megs which is insanely small so hopefully it's clear that this is a very important detail you can't just take any model like we've been doing and pasting them in here let me just do a quick just to to prove my point we get an error that says cannot locate file Onyx model contest right so it's looking for that special Onyx folder along with the model in the Onyx format so at least today in order for your models to run they will need to exist in that format and this is kind of the compromise right in order to get it to run this way this might not always be the case but it is today now if you're thinking to yourself well dang like that must severely limit me just to models that someone like xenova has created and thankfully you have a bit more control than that if you come back to their documentation you can see there's a section here on how to convert your models to Onyx formats so it's actually relatively straightforward transformers.js comes with a python convert script you can check out which can convert a pi torch tensorflow or Jack's model to Onyx and under the hood it's using hugging faces Optimum tool to do that so just like I did earlier where I more or less forked one of these repos just so I could add my own tags you could do something similar with the exception of a little bit more work you just need to run the model through this conversion script push that up to a model and you should be good to go also worth noting real quick these models are actually just get under the hood if that wasn't already obvious we have branches we have Commit history it it literally is just get under the hood with the exception of it uses gets large file system extension which allows you to store like very large massive files within git which was previously taboo bad practice something you really never want to do but the lfs solves that problem and makes it actually functional within git so you'll need that git extension if you don't already there is instructions on how to do that if you need to okay the last thing here you might notice this quantize flag here or down here we have our model and our quantize model what does that mean quantized think of quantizing as kind of compressing your model compressing is maybe not the right word here it's basically reducing the Precision of your model so if you're familiar with how information is stored on disk most models will use a floating Point 32 which means it's using 32 bits or 4 bytes to store that on on disk and with quantization you can reduce that to most commonly and down to eight bits which is one byte instead so you can reduce the size by up to four times by using quantization and if you're wondering well we're losing Precision so that must be worse right the answer is yes it will be but you might be surprised sometimes how well the quantized models can still perform I wouldn't say by any means that reducing the follow-up size by four times means it's four times worse by any means I mean you still have that model trained on the same data it's just less precise when it goes through its inference so test out both models use whichever one works best for your application you might find that the quantized models are really great for embedded systems or when you're running them directly in the browser at the end of the day you'll need to load these into the browser somehow right so 20 make File versus a hundred Mig file can make a big difference we're talking about a lot of things right now but this is because we are doing a lot of this work more manually right okay the last thing we need to talk about here is these parameters we have a pooling and a normalization parameter hopefully you can make a pretty educated guess by now what these are doing you hopefully you've recall me talking about mean pooling earlier where we need to take the separate embeddings generated for each token and aggregate them into a single embedding and most commonly we do that through a process called mean pooling so transformers.js actually has that algorithm built right in and so all we need to do is tell it to use mean as the pooling method and it will do that for us and second we're actually asking it to normalize it as well which means they'll all end up as unit vectors and we can use something like dot product to compare them so without further Ado let's mirror our setup that we did in the other project if you recall these outputs are actually tensors now so we need to compare the dot data we change this to that is a happy person and we should expect a similarity of basically one let's try it oh we forgot to change this model back boom and if we go ahead and do that test this is a sad person what do that end up as we're using the All Mini model I think that was around like 60 or 70. 65. I'm pretty sure that's nearly exactly what we had before maybe a minor Precision difference so let's just know real quick what would happen if we didn't normalize these 15 right so we wouldn't be able to accurately use dot product here without that normalization and then without that mean pooling let me just do a quick console log of the first one here you can see our size is very different and actually What's Happening Here is this is like the raw underlying data on disk so it's actually it's still 384 Dimensions but that's times seven and if you recall before that is a happy person that was actually also tokenized into seven tokens and just like in that other demo when we didn't have mean pooling happening we had something similar to this right and then with that enabled we are getting our our proper 394 single Dimension array so hopefully that's clear what's going on there why those are necessary okay so we have a proof of concept of using Transformers from node.js now let's finally do a quick demo on how we can do this right in the browser and spoiler it's not much different than what we've already done here now typically in these situations I would probably spin up a react project I would probably use something like next JS to scaffold out a nice front end but I think what I'd like to do today is actually just go pure vanilla HTML and JavaScript just because I want to show you literally how simple it is to run this in the browser these days of course what I'm about to show you will absolutely translate if you're in react you might be using nexjs or Veet or you might be coming from The View side or even maybe the angular side or really anything else so hold on for two seconds as I quickly scaffold out our third app all right so what I've done here is just like before I've added a new project in our mono repo this time called embedding's browser and this project literally only contains these three files our HTML itself our main Javascript file and some CSS and again what I'm about to show you will absolutely translate to any other front-end framework but my goal here is to show you just how easy it is to do in literally a vanilla setup okay so real quick I've already scaffolded out this HTML just so we're not wasting time with layouts I'll show you what it looks like this is it here so I was trying to keep things literally as simple as possible I got two inputs and a button here called calculate similarity and the idea is similar to what we were doing earlier we will generate embeddings on two separate inputs then use dot product to calculate the similarity between them styles.css is just so that it doesn't look ugly again not putting a ton of time into this but come on it's got to look presentable right and then finally we have our main.js which as always we're going to test out with a console log now worth pointing out real quick the way I'm loading this script is import important we're going to take advantage of a new browser feature called es modules which allows you to essentially do modern JavaScript so things like Imports top level away to all that kind of stuff and no I'm not using any kind of bundler right now like webpack or whatever this is truly vanilla JavaScript right now another thing worth noting with this script is I am using the defer attribute which basically says wait until the entire Dom has loaded before running this script and that's important because we're basically going to get access to these different inputs within the Dom and that will only work if the Dom has already loaded so in the past you might have used something like the Dom content loaded event where you say document dot add event listener for this Dom content loaded and you would only run your script after that now we can just use defer instead okay so moving into our main.js Let's test out our wave Emoji oh and by the way the way I'm loading this HTML file in my browser is using an extension for vs code called live server live server is a super handy cool just to quickly spin up a local web server to serve your HTML other front-end Frameworks like nexjs have this built right in when you run npm run Dev but if you're just looking to spin up a quick live server like what we're doing right now this live server extension is is super handy so would recommend so I just have to right click and say open with live server that will spin it up on the Fly and I'll essentially get access to my live server right here okay so we'll refresh the page check out our console there's our Emoji perfect we have a starting point let's get going so making our way back to the transformers.js documentation we're going to come back to installation and you may have noticed before in addition to the npm install method we actually can use this modern es module import and modern Imports import directly from URLs fun fact this is exactly how Dino works as well Dino is a back-end alternative to node.js but what Dino tries to do is it tries to follow the browser spec as closely as possible so the idea is if you can run this in the browser hopefully it will also work in denotes now we're not using Dino at all today but there's a fun fact for the day okay so I'm going to go ahead and copy this import I'm not copying the rest of the script because I've already done this right um in my HTML with type module did I mention that maybe I didn't mention that in addition to everything else I said you need to add the type module attribute to your script in order for browsers to treat this script has a modern es module script so I've already done that I'm just going to copy this import and and we're not doing it in line like them we're doing it in a separate file I'm going to paste that here so just like before this will import the pipeline function and then before I go ahead and copy over all the same functions that we did in the Transformers project we need to do one more thing real quick and that is create some references to our HTML elements so we're going to need access to each input so input one and two we're gonna need access to this button and then I actually have another it's invisible right now but another div down here that will contain our output so we'll need access to all four of those things those exist right here I have an ID for each one so input one to our generate button and our output div let's go ahead and do that now okay there we go and again we don't need to worry about first checking if the document has loaded because we used the defer so we can guarantee that the document has loaded we'll get access to each of these okay next I'm going to go ahead and just copy over the same functions from the Transformers project the first thing I'm doing is creating this Jenner embeddings pipeline just like before and we're using ES modules so top level weight is fine unfortunately we no longer have typescripts we'll need to get rid of our types and I'm just copying over the dot product function again for now okay and this is where we'll have just a slight difference where instead of going ahead and generating those embeddings right away I only want to generate them when someone clicks to generate button let's go ahead and add that click Handler okay we have our Handler we can go ahead and copy our other generate logic into that we're using a weight so we'll just need to convert this to an async function and then instead of hard coding these values we'll need to just grab these from our inputs that we created earlier and then finally instead of console logging we're just going to go ahead and take that similarity and we're going to insert that into our output div perfect let's give it a shot oh you know what I forgot to do one thing this button I disable by default you see here button is disabled the reason why I did that is because this async logic to construct our embeddings pipeline actually takes a little bit of time and as I'll show you right now it actually will take the longest the very first time because it has to load that plugin face model from hugging face but then what Transformers JS actually does for us here is they will cache that model in the browser so that every subsequent reload that model is actually already available which is amazing much needed for any machine learning related application because these models can be quite huge right even all mini LM V2 as we talked about earlier is quite small relative to the other ones especially the quantized one is 20 Megs but even from a browser context 20 Megs is still quite a bit right those who have worked in a web environment know that you know every kilobyte adds up right and depending on the website that you're building you really want things to load as quickly as possible so anyways that caching is is important so because of that what I'm going to do is after our pipeline has loaded now is the time I'm going to go ahead and re-enable our button of course in a production application you would most likely have some sort of loading indicator that would be a much better user experience but this is just a demo after all okay first thing to note here is we're actually getting some 404s in our browser what is going on and if you take a closer look at these you'll notice that the Transformer JS library is actually trying to load the model from our Local Host you can see here right it's looking for a model on the slash models the Nova and blah blah the models path and what's actually happening here is it's basically first checking to see if it's already cached that's what these calls are all about and if it's not then it goes ahead and as you see here these 404s it didn't exist locally so what it is is it went ahead and fetched them from a hugging face instead and I'm really zoomed in here so apologies that everything's kind of getting cut off so compromise when trying to make everything readable here and you don't really need to know this but of course I need to show you this anyway if you're wondering where it's actually cached coming over to application in your Dev tools and then over here where you might have gone to say local storage in the past or cookies there's another one here for cash storage expand that out and you can see that the Transformers GS has created a cache entry here which in this case has well you can't see them here let me just expand this out there we go so it's cached um just the config file the tokenizer file and the model itself and that means now if I refresh again look how quickly that loaded the the button became enabled that means that it was actually able to load that model now that quickly which is amazing if I go back in here and delete this entry notice how long it'll take this time a little bit longer alright now for the moment of truth let's see what our similarity is we have the classic that is a happy person on both so once again with the Mini model we should be expecting a score of one or as we found out what the Precision difference uh just over one awesome and then just like last time say we said a sad person we know now this should be around 0.65 let's give it a shot boom 0.65 and like look how fast that is every time I click calculate remember I am literally generating an embedding for actually two pieces of text and Performing the dot product so that is quite Speedy okay so that's just about it guys if you're still here thanks for following along to the end I know this was a longer one once again I will be pushing all this code to GitHub so you can reference this as well and feel free to clone this repo and try writing it yourself now there is one last thing I need to finish with because at the beginning of this video I promised that I would be talking about the future of embeddings and what's coming up on the horizon so what is next so at the beginning of this video I really touched on the multimodal image and text embedding models specifically this clip model which stands for contrastive language image pre-training and this is actually from open AI again this one is open source and as we talked about earlier what these models allow us to do is generate embeddings from two or more different media types so in this case images or text but the resulting embedding is produced in the same joint Vector space and this last piece here that is in my opinion the incredible piece right if it wasn't clear up until now every single embedding model will produce embeddings in different Vector spaces right I can't compare an embedding from one model with an embedding from the other model it would be meaningless the meaning of those vectors would just be completely different for each model so the ability to generate two different things into the same Vector space meaning performing a similarity between those two embeddings is actually meaningful is huge it's in my opinion this is massive between two different media types and I don't have time right now to go deep into how they're actually able to accomplish this but what I will do is point you to their paper learning transferable visual models from natural language supervision check out their paper here now their paper actually focuses more on the idea that previous vision in image based models were trained on a fixed set of predetermined object categories right so like my example earlier training a model with pictures of a hamburger but then I need to actually add that label hamburger to train it what that is right they're trying to come up with an approach to be able to label things without actually training the model with each label corresponding to each image as I say here it restricts their generality and usability since each label is basically predetermined right so what they're actually trying to do is come up with the generic way of image captioning without like this manual fixed set of labels and as we talked before we call this zero shot so if you're like me you're wondering how is this actually possible what they're doing here is as opposed to the traditional way which is training an image image feature extractor with a linear classifier to predict some label they're instead jointly training an image encoder and the text encoder and they're maximizing the cosine similarity of the image and text embeddings that were produced while minimizing the cosensitivity of the incorrect parents and so they're basically using to train the model to produce embeddings in the multimodal space and since this paper there's been more work on this there's another one called embed everything a method for efficiently combating multi-modal spaces this is in October 2021 and for these guys the message they're trying to get across is with AI blowing up these days and more and more companies using AI in their businesses we're going to need a good way to as I say here interpret operate on and produce data in a multi-modal space this would include audio images text anything else and as they mentioned here lots of success in the uni mold in space but when it comes to multimodal techniques are somewhat underdeveloped right it can be potentially pretty expensive and not straightforward to train these multimodal models and they're proposing a new approach to help generate these co-embedding multimodal spaces that is more cost effective utilizes existing pre-trained models Etc so if this is something you're interested in specifically feel free to read this paper as well really cool stuff but this is just something I wanted to add in here because I think this captures quite well what's coming up in the future right I believe there's going to be a big focus on these multimodal spaces and the ability to generate embeddings that work in all of these spaces right it's already very useful to generate embeddings for just text and being able to determine the similarity between them but just think of the possibilities once we're able to actually compare audio with an image with text and they're all in the same Vector space the things that computers will now be able to understand I think will open a lot of doors all right you made it to the end congratulations give yourself a pat on the back this was a long one I hope you learned something new today on embeddings I'm really excited to see the kinds of things you guys will build with embeddings maybe now with some of these open source models thanks so much for watching and I'll catch you down the next grab a hole
Info
Channel: Rabbit Hole Syndrome
Views: 155,730
Rating: undefined out of 5
Keywords:
Id: QdDoFfkVkcw
Channel Id: undefined
Length: 84min 41sec (5081 seconds)
Published: Sun Jun 25 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.