Vector Search and Databases at Scale / Ashot Vardanian

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome welcome let's continue with this so I really hope you enjoyed the conversations coffee cookies uh and and in general the the whole atmosphere here in Belgrade so this is our database corner here uh we have another talk on on specifics and internals of vector databases and their scalability and I'm intrigued to invite here on stage ashburtonian from Unum cloud all right [Music] so the stage is yours and please take it from here sure oh wow Epic so thank you all for coming today we are going to talk about Vector search and databases at scale the topic that is very dear to me personally I am the founder of company called Unum cloud uh aside from this I used to do some Physics in the past but most of my life I lost 15 to 20 years I'm coding in sea-like languages currently I also run the C plus user group of Armenia so if you are ever coming to a wonderful country just let me know would be happy to host you as speakers or participants so at the bottom there you have two GitHub profiles one is my personal and the other one is my company's profile and we'll cover a lot of Open Source projects that you can find over there so it might be easier for you just to follow the links immediately and while you do so I'll warn you that there will be a lot of new memes in this presentation well maybe not that many but as any responsible CEO I had to put some memes into the presentation so without further to do let's split our talk into a few parts I'll start with a vector search then we talk about scale kind of jumping towards the end after that we're going to the topic of databases and last but not the least if we have time I'm gonna I'm going to share a cool project that I've been doing for the last few weeks I'm pretty sure you're gonna love it so first of all what's Vector search it's a very simple problem to state so we all know basic math we know coordinates like XYZ in machine learning most representations that neural networks produce are actually High dimensional vectors so they can be not three-dimensional but 300 dimensional a thousand-dimensional and the problem of vector search is a problem of finding K approximate nearest points to the points present in your data set so quite simple to understand lately people have been using it for all kinds of stuff so people use it for Transformer memory Transformers are those weird new AI architectures that everyone is talking about like the ones inside of GPT people use Vector search engines for recommendation systems and people use them for semantic search this last one is the simplest one to understand so let's dive into that one so a semantic search engine even though it may sound complicated it's something that everyone has ever has definitely used so Google and Bing are two famous examples many of you are familiar with Yandex the core organizers of the conference which are also an example of a semantic search engine so a semantic search is a search system that doesn't search for matching words or tokens it matches for similarities in the meaning of content so let's say you can have a textual query searching through images and the only way to accomplish it is if you have ai decent enough to compare textual representation and an image representation and to say if they're similar but you are not running AI on every pair of text and image in your data set you produce an intermediate representation a vector you put them into a vector search engine and afterwards every time you search you encode the query and search within the data set so there are a few models that are used for those embeddings producing the vectors most commonly encoders would be bird-like architectures and decoder-like architectures like GPT general purpose Transformer Vector search engines are also numerous the ones that people often use are called scan and face scan is from Google and phias is from Facebook or meta these days and you can compose any neural network and any Vector search engine around any data set to build up your own semantically searchable data Lake which is kind of what we do but the question is why now so I've been working on those problems for the last seven and eight or eight years and no one really cared until a few months ago people started talking about AI the neural networks are getting better and better they're getting bigger with size comes increase in accuracy and once the accuracy is becoming better people start using Vector search engines to navigate the vast data sets of Transformer memories or representations for different forms of content and just for companies that I could remember have raised a total of 200 uh how many 28 million dollars over the course of three months last three months so this is not a full list you can probably find more companies so if there is like so much money being poured into this industry let's kind of understand what makes up Vector search and how to build one so the problem of finding nearest Neighbors in some low dimensional High dimensional space has multiple Solutions the oldest one of them is space filling curves it originates in the time of piano in Hilbert uh trying to prove a conjecture but more recently people have been using stuff like this in systems like postgres as extensions to actually search for nearby two-dimensional points or three-dimensional points so the old school indexes for latitudes and longitude could have used space filling curves most commonly not the ones by Hilbert and piano but the ones that are Z curves and similar ones a more recent thing is k-dimensional trees that's something that people have been using a lot maybe since 80s when they started developing video games because uh like a special case of K dimensional trees to a certain extent is an OCTA tree which is a way of indexing a three-dimensional space and once you are trying to do some basic Ray tracing or try to index the points forming up the scene a 3D scene in a game you would need an index like this a more modern approach is like quality sensitive hashing but we are gonna talk today about the navigable small world graphs which is the last but probably the most popular these days it was originally popularized in 2011 but as it always happens but the original implementation may not be one found in everyone's gadgets so what happened in five years or so people designed the hierarchical revision of this original algorithm it's kind of common you know like in computer science when we Face a problem the first thing we do we throw a hierarchy into it so like you're not working with a array you're working with a tree and then here you're not working with a graph or proximity graph you're working with a tree that essentially goes through multiple layers of graphs to imagine it we essentially put a few points into a data set we find a few neighbors we connect them together into a proximity graph and the idea of the hierarchical algorithm is to actually build up layers upon layers of those graphs the bottom one is the fine-grained one the top one is the coarse grained one you start from the top once you find a really good closed cluster you dive into the bottom layer and you repeat like this until you find the lowest level quite simple to understand it may not be strictly small world uh graph to be honest but we are here to talk more about engineering not the math part of it so let's compare a few Vector search engines and understand like how they differ on this slide I've actually outdone myself I have not one but two mistakes on the same line and if you're listening closely you may already find them so there is a vector search engine by Facebook called face there is a vector search engine by Google but it's not called sptag that's the one by Microsoft the Google version is called scan and the Google version and the Microsoft versions are not implementations of the algorithm that we've just discussed so they shouldn't be on the slide at all but I guess this is the artifact of preparing the presentation last moment and sorry but in AI we have to do it because every day the state-of-the-art changes and if I have prepared this talk a week ago maybe by now it would have been irrelevant so uh there's an original implementation for the algorithm which is the hnsw lib and there is one by me as well called you search so we're going to compare it by performance first but then we'll cover a lot of other topics like functionality bindings is of use and ease of integration so scale performance what's that so for Vector search or actually like for any search there are three primary metrics that describe the quality of your system the first one is construction speed how fast you can index your data set the second one is the search speed and the third one is recall so recall is the measure of quality because this is an approximate search system you kind of have some ground truth in terms of like what is the closest point to your query in your original data set and you run the algorithm a few times on some sample points from your ground tours data set you compare the results with your ground truth and the share of the results that we're successful is essentially your recall and people draw those charts like the one on the left that are completely incomprehensible so like when I look at this even after so many years in the industry I have no clue on how to compare those systems I'm a simple guy I prefer tables so there is another metric called ngcg which is the normalized uh cumulative gains there are two repositories that Implement benchmarks for those and if you just run the docker container and specify which kind of engines you want to run and compare on whichever data set you will get charts like this I'm not going to show you any more charts like this in this talk instead we're gonna look into logs so I took a vector search engine I took some sample data set I took the perf utility common in the Linux Community I run the binary and what I see is a log like this that shows me that the vector search Problem has the same issues or actually the solution to this problem has the same issues as most graph algorithm implementations which is kind of natural because we are searching through layers of graphs nah obvious so the problem is when you look at this you see 2.4 percent Branch predictions missed 2.4 percent may not sound like too bad but it's actually not even remotely good because the modern CPU is a complicated Beast it has a branch predictor at the very top at the beginning of its conveyor so if at some point it sees a branch or an if statement it guesses if it's true or false and then it continues executing the next instructions that are coming in line and if the prediction was wrong it will have to invalidate some of the work it has already done so if 2.4 percent of the time you have to actually go back and redo some computation that's not good but probably a much bigger obviously significant part of the Slowdown is coming from the back end Cycles being idle so actually the Stalls come from the back end of the CPU which means that we are wasting time waiting for information to be fetched from Ram into the CPU registers okay let's imagine we understood the problem which I'm not sure that I until now I understand myself but if we were to guess the mon the main causes of the bottlenecks would be coming from three places especially the memory oriented ones a vectors are too big so if you take a gpt4 like neural network it will produce one and a half thousand dimensional vectors or like thousand dimensional vectors if every one of them is a float32 number it's a four byte value so you end up wasting like four to six kilobytes just to store the vector itself for every point and every time you Traverse the graph you have to recompute the distances between your query and every point that you visit within the graph so this is a lot of computation and a lot of memory to be fetched second part is Neighbors list are too fat and by fat I mean that whenever you store a proximity graph you actually have to keep together the identifiers of the neighbors of every node and if those identifiers are too big again like four byte values eight byte values depending on how big you take those you will be using different amount of memory and again you'll be wasting different amount of time waiting for data to be fetched into registers last but not the least and not probably the most generic one that I try to promote is the problem that people use memory allocators in effect inefficiently and they do too much object-oriented programming so we've been training for many many years like object-oriented programming is good maybe functional programming is good abstractions are good but in my work in my day-to-day work I actually have to remove abstractions most of the time and actually just like get down to the bare metal Implement something in assembly avoid every kind of abstractions so if I were to think about the solutions to those problems uh it's hard to for me to solve problems unless I have emojis in a solution so bear with me there are some things that no one does for example no one stores vectors in doubles because doubles are eight bytes and unless you are keeping like some super high resolution latitude and longitude no one in the machine Learning Community uses double precision a lot of the people still use floating Point numbers which are four byte values but in reality there is not a single large scale AI lab that we know of and we know of a lot of AI labs and we work closely with almost every one of the big names almost no one uses four byte values anymore so it's kind of considered a bad tone for the last four years even when you do training of neural networks you almost always enable mixed Precision if you're using a recent version of Pi torch but once you get out the vector representations you can almost always downcast them so almost all the labs that we know of actually are using brain float 16 or float 16. I would generally suggest to use the float 16 instead of BF 16 because it's IEEE regulated format uh sometimes it's more complicated to use because some of the let's say Intel or x86 CPUs do not support F16 but support bf16 for specific kinds of operations like the dot products my favorite and most often forgotten kind of down casting is downcasting from from float 32 to int 8. so int 8 can only encode values from minus 127 to plus 127 give or take one which doesn't matter in this context but as we'll see here downcasting even just like to those 200 unique values with equivalent intervals between them is sufficient enough for you to get good performance most of the time in terms of accuracy but some people are more crazy than we are they just go all the way to infor which is not the way I recommend to be honest so the next thing is like the neighbors list are too fat what do you do in this case well some people would be using un64 to store the identifiers of the Neighbors The Simple Solution is to use un32 which is kind of obvious well but we always go one step further and do the weird thing so we have a weird type that takes only 10 20 lines of code to implement which is 0 into 40. and once you start doing Vector search on sufficient scale you realize that it's almost impossible for you to fit more than one trillion vectors on a single node so realistically almost all the labs outside of fang that use our systems they almost never achieve on a billion points Fang goes beyond the billion and we as well we build systems with billions tens of billions hopefully soon hundred billions vectors but a trillion is unreachable so unit 40 is 256 times larger than human 32 it allows up to 1 trillion values and it's only another 20 memory consumption compared to the other solution and then probably the most obvious once you start refactoring code you see too many mallocs you see too many structures so if someone was to ask me like how to start getting into Vector search I would recommend looking into a couple of repositories the algorithms are fairly simple so when I was implementing usage as like one actually it's probably 10th Vector search engine in my career but it's the first Open Source One so I looked at what kind of solutions other people open source and I immediately saw that whenever we go through every layer of a graph we go one layer down and in every layer during The graft reversal we build up a new priority queue and given the way it's coded this means new memory allocations every single time and the number of those memory allocations is not obvious to predict so when I see this kinds of lines in any code base any we know that we can rewrite it and make the performance at least noticeably better let's see how much but before we get there one of the other things that obviously pokes your eye is the fact that whenever you have concurrent accesses to the same graph structure you have to keep some logs or mutexes around every one of the nodes so you disallow concurrent updates that will frag that will corrupt the structure itself and again once I look into the code basis of existing projects I immediately see arrays of logs and mutexes the C plus standard library mutex is at least 40 bytes depending on the implementation and even like the code base itself once you read through it it says let's say a linked list of logs but in reality it's just like a continuous array so there's a lot to be refactors and that's how I started usage when I did so I threw away almost all obstructions instead of using objects I just have tapes or just regions of memory that I populate myself to achieve maximum data locality so it's very old school C plus plus I would say people consider C plus plus as C with classes that's not the way I treat it for me c plus is C with templates and I've built this you search so if you take five years and you take your search you start comparing and benchmarking them there are two kinds of operations that you perform you add a vector or you search for a vector in a system and you can perform this in single entries and batches or in bulk operations so when it comes to bulk operations the performance difference between us and phias is let's say 30 40 on a data set that is only one million vectors but once you perform like small batch operations apparently the implementation of logs in the underlying phase library is so inefficient that it requires you maybe like to create separate thread pools so there's a lot of like this additional book cap bookkeeping that is happening which makes the overall system really throttle on small batch operations so on small batch insertions this is 356 percent faster so like four point uh six times then we downcast so instead of doing float32 we can go down to float 16 and believe it or not we went from 99.2 percent accuracy down just to 99.1 so obviously it's not that hard to like Implement float 16. the problem is that if you want to have a compatible library and you want to ship it to every computer out there the question is how do you compile it and once we compile we do not Target for machines that exclusively have F16 support sometimes we have to backport this functionality through third-party libraries and this actually sucks but we'll get back to this when I start talking about jit we already had a few discussions here around jit but Vector search can also benefit from it the last but not the least is int 8 so int 8 is the best type it's supported on every kind of Hardware even the oldest one so before the computer supported in 16 before they support in32 they supported int 8. int 8 is fast it's small so you can build up a system that does 300 000 search queries per second which is in this case twin 2.6 times faster than the previous solution and your accuracy is still better so actually like the accuracy for INT 8 for some weird reason on this specific benchmark ended up being 0.1 percent higher than for float 16 which is weird but sometimes you get those kinds of artifacts when you do approximate benchmarks on like really large data sets okay I've shown you state-of-the-art performance the question is can you do better of course we can so uh the usse library comes with a few standard uh metrics or similarity measures out of the box most of those are templates so they can be tuned for your use case but aside from this I have a tiny Library which I use for experimentation to like put into weird places this one is called Simpson as similarity measures implemented with a single instruction multiple data assembly intrinsics the reason I coded it is that compilers do not really support a scalable Vector extensions that are new to modern arm CPUs so if you use something like graviton 3 on AWS you will most likely not be able to like just benefit from those directly so if you cannot just request your compiler to generate them we can code them it's not that hard and in reality sometimes at least on micro benchmarks I notice insane performance improvements so with sve I was able to perform dot product similarity search at 40 gigabytes per second on a single core so this is roughly how many times like eight times better than the code emitted by the compiler without extra manual vectorization pretty good if you ask me so this is the QR code it's easy to find but I've had so many things on my list like functionality bindings ease of use ease of integration functionality wise you can always extend your search with your jit compiler of favorite flavor so like everyone prefers different compilers most of the people use llvm so there's a system built on llvm which is called Numba and Numba allows you to implement custom kernels in Python you just add a decorator on top of it and the kernel that you implement in Python will be jit compiled and you can pass it down to use search which is implemented in C plus so even if I Implement and compile you search for old outdated Hardware the hot path which is the similarity measure computation itself can be ingested ad hoc from a user side whenever you use your library on the target target Hardware for people who are a little bit more low level there is cppyy which is a cool project it is built around kalang and Kling is a fork of clang which is a front end of llvm again it comes out of CERN a physics project the idea is that with this even in your python code you can define a string inside of the string you have a kernel implemented in C plus or in C you use cppyy to compile your kernel implemented in C plus plus inside of a python code and you pass it down to the python Library again and last but not the least a couple of weeks ago I was meeting my friend who is one of the other low-level developers there are not too many of us around the globe and he's responsible for a ton uh sorry for my French a lot of different libraries implementing low-level optimizations let's put it this way almost every time you use any machine learning library you use his code so one of his previous projects is called peach pie and I wasn't familiar with this project until he showed it to me again as like his university project or something and in peach pie you can assemble an assembly kernel on the python side and inject it again ad hoc into usage so this project supports more forms of jit compilation or like ad-hod assembly injections than probably every other Library I've ever seen a really good experiment and this Emoji you will not find anywhere sorry this meme you will not find anywhere else so try to remember it so I've talked about uh features but then the question is the ecosystem so when you take phies is implemented in C plus plus it's a relatively bulk project it has bindings for python okay great what's next in our case the whole implementation for you search is just a couple of thousand lines of code for the core implementation it's super easy to compile it doesn't have any dependencies it doesn't need blasts it just implements everything internally so the cool part is that it's a really fun project for you to let's let's call it to practice some programming Linguistics or whatever it means so for many many years I haven't had the pleasure of programming many other languages and now I've configured all the CI around this project to actually emit bindings in different languages so the first release cycle of usage which took about two days to implement a couple of months ago had C plus 11 core implementation bindings for Python 3 JavaScript and Java then a couple of months ago I had another free weekend and we coded rust Objective C Swift and Wolfram and now uh what we've added or like what we're adding is c99 binding golangs and webassembly and webassembly is probably the trickiest because the default runtime uses 32-bit environments and this system was kind of designed for 64-bit architectures but will not be too hard to implement hopefully this will be ready next week but more is not always better sometimes more is worse so a lot of the big Frameworks they bring gimmicky stuff that I don't really appreciate people especially novice to Vector search of often adopted and the most commonly used uh thing out of those is probably ivfpq which is a special kind of quantization so the core algorithm has just three hyper parameters and this solution using quantization is just adding a system on top that will make your vectors smaller before you pass them into the vector search system which is kind of the way to go if you need to solve the problem here and now but at on a real scale problems arise because you are actually adding an extra part of your pipeline and it's impossible to track problems such as uh distribution drifts in the point clouds that you get into your vector search engine so yes you get some performance Improvement but you get a lot of problems and as we've seen you can already get to the same level of performance without any learned quantization so the result is a tiny implementation this slide is actually a bit outdated the core implementation is 2000 lines of code not the 1000 by now uh no dependencies a lot of functionality just pip install you search but I promise to talk about databases so databases has been like a big broad term these days people have been putting Json into SQL databases now they put it into Vector databases which makes me cry so a database in reality is like more than an index it's a more than just a static file it's generally like metadata it's always the rising question if it's like an embedded database or is it like a standalone server are you going to do backups are you going to do replication do you want to do shorting and at some point we get to the same problem from different sides so we have a lot of vector search companies that try to become databases and we have a lot of database companies that want to become Vector search companies so the first group is companies like pineconzilus and kildrant which will now be adding all kinds of Json functionality logs mutexes protocols apis whatever you name it you probably know more terms than I do and then the second group database that we already know like redis mongodb and clickhouse and many others would be integrating Vector search as additional form of indexing into their existing functionality my idea is that those things don't work on real scale and by real scale I mean like handling tens of billions of entries on every node a petabyte plus of data managed on a single node like 400 500 800 gigabits worth of networking on a single node so this is truly Funk scale for you to be able to design infrastructure that scales this far that is this future proof we decided that we want to build everything in a modular way so that once something is outdated we can replace it with a better solution so it's not a monolith it's not like a mongodb where you just have like a vector search functionality edit last week and you cannot pull it out you cannot replace it you cannot modernize it by itself in our case it's four separate libraries all of them are open source and the apache2 license one of them is you store it's like a database for you to throw all the stuff that you have into a single place the second one is you form it's a family of our pre-trained Transformer models like the gpts and the birds of the world but in our case they're fine-tuned actually not really fine-tuned we're pre-train but we tune the design the architecture itself towards search applications so instead of producing a vector representation that is a thousand dimensional thing our representations are only 256 dimensional and we'll get back to it in a second there is usage and there is equal so you store I've actually already covered on the previous High load which was a lovely event about half a year ago so you can go and learn more about it like from the recordings or in GitHub heal form uh we have never covered on those conferences so they kind of replace openai clip clip as a Transformer architecture with two towers one is bird-like another one is vit Vision Transformer and the idea is that you train a neural network to align two Vector spaces one is a vector space of images another one is a vector space of texts clip is great it's good for general purpose stuff but if you want to do search at extreme scale clip is too heavy our network is maybe like four times smaller four times cheaper to run four times faster it's embeddings that three times smaller and they're much easier to compress further so we can actually like really reduce the size of our embeddings by a factor of 12 compared to what most other companies are using then the last but not the least like once you start packaging this into Services you want to add an API and the first thing that python developers and like machine learning partitioners often do they use fast API because fast API has fastened the name so people assume this will be a good future-proof solution it's wrong uh so one evening okay uh I'll tell the story one evening this January I think I'm sitting in a room in Yerevan and I'm building up like an inference server around our neural networks and our neural networks again can do let's say 7 000 inferences per second on a single consumer GPU this is not Enterprise then you take fast API and you realize that the system has a latency in this case which is like uh 1200 microseconds but when you run it on a MacBook not on a server it can go up to 6000 microseconds or essentially like six milliseconds the problem is that once you start like doing the math in your head you realize that six milliseconds is almost enough for the signal to cross the whole country Armenia reach nearby country and come back so if this is as much time as it's taking to cross the border and come back versus just talking from one part of the machine to the other Port then there is a problem and it has to be optimized and for me it's like a red flag so I've spent a weekend trying to optimize fast API and replace like this interface with something else any guesses on how fast this can work so we have 3 000 requests per second on a single CPU core on AWS c7g instances which is graviton 364 core uh bare metal machines I got to down to 20 to 40 microseconds 230 000 requests per second on a single CPU core 50 times lower latency 70 times the bandwidth and this is just because I used the recent Linux kernel stuff so a gents Oxbow in the Linux kernel 519 introduced a feature where they manage and control a pool of file descriptors on their own side so if you know which apis to tap into the Linux kernel you can actually bypass a lot of like slow parts of the Linux kernel and actually implement this the right way I used a few different dependencies all of them I seemed accelerated obviously but all in all we have a library that looks and feels almost the same and is like 70 times faster to run which is kind of enough in this case to do our inference at the maximum possible speed so problem solved last but not the least as you search the library that we are covering today for the vector search engine it has a very simple Constructor it has essentially three primary interfaces one of is the Constructor another one is to add points another one is to search for points so as easy as that there are a lot of optional Constructor arguments that you can use to tune the index for your use case but all of those even though they are modular they have like different problems that are not problems of the implementation there are fundamental problems of a different domain so when you try to cram everything into a single solution for a common case you are ending up accumulating all the problems of every single solution and just multiplying them onto each other that's why when we try to deploy something in a new domain at the scale we don't take a general purpose mixture we take the four pieces that we've designed before and we assemble them in a different proportion or let's say if we're building up a cluster and we can control what kind of Hardware we are buying and what kind of Hardware we are renting on AWS let's say or some other Cloud vendor what we can do we can take different number of machines depending on which part of our pipeline consume the most time so for a vector search at the best on small collections you can do like 1 million search operations per second even with our software with all the kinds of assembly optimizations to make this run fast you need machines with ddr5 or maybe soon with hbm package package memory for a database you need really fast and vmes as these obviously in our case we now also try optains for neural networks obviously you need good gpus and for networking well ideally you want to have small packets to share and you want to be running on the recent version of a Linux kernel which is also surprisingly related to Hardware because sometimes you cannot upgrade to a recent version of Linux kernel as the hardware vendor doesn't support this in their driver so this is also a hardware problem even though it may seem as a software problem so we employ all kinds of dirty tricks to make this run faster something that you shouldn't do but you call originally implemented Json RPC jsons do not handle binary data so one of the weird things that you can do if you are if you have the crazy Twisted mind or similar to mind what you can do you can down custom float32 to int 8. you can drop all the negative values just take integers from zero to a hundred and you can actually fit them into the visible range of ASCII symbols so instead of converting into base64 some other binary format and exchanging in jsons you can actually essentially make your neural network emit ASCII code that will be used in Vector search so we love dirty tricks like this you have to escape some characters for example like the backslash character I think it might be the simple 60 but like you put it out of the range and it works okay now we got to the final part which is me telling stories so I'm sad when I see people just using the technology I want them to abuse the technology I want them to misuse the technology put them in weird put technology in weird places apply to the tasks and the problems that they were never meant to solve a few weeks ago actually like a usage I released a few months ago and then a few weeks ago I published an article called abusing Vector search for text Maps and search and it kills me that all other Vector search companies just do AI okay AI is fun I've been doing it for a few years as well but there's so much more that we can do this is a generic search structure and it's not really Vector search that I'm talking about today this is similarity search for any two objects for which you can Define similarity even if they have different number of Dimensions or like vectors one of them is like six dimensional another one is 60 dimensional you can still Define some form of similarity measure let's say like Jacquard similarity that compares the number of matching Scholars and like two Vector representations and this would work so if you want to apply it to text search one of the things that you can do you can build up levenshline distance that will be operating on strings which are essentially again ASCII vectors you can tokenize and perform search on arrays of tokens if you have time series you can search Through Time series with Dynamic time warping or similar metrics if you have just latitude and longitudes if you have like common problems you can still use average sign distance to apply usage to your domain and one of the cooler stories is about tanimal the distance so essentially recently I was chatting with a friend who's doing some Natural Sciences chemistry and he was telling me Ash I'm searching through molecules but I cannot search through too many molecules like my system crashes and like it's way too slow can you help me I pull out the laptop and I kid you not it took 10 minutes to implement a custom template overload in C plus plus that would adopt usage to a chemistry problem so now this week you're actually the first ones to hear that I'm releasing publicly the largest chemical data set with the largest chemical search ever built with 7 billion entries in it uh it will be like shared publicly by the end of the week and if any one of you is a natural sciences I hope you'll check it out but if I were to sum up like this whole presentation I would split my recommendations and some of the ideas into two parts the first one is search specific as you can see by the first word on every on every line so search is a memory bound problem no matter how much assembly we write or like how cool of assembly instructions we can use like to optimize the compute part if you have large vectors like there's probably nothing we can do like Beyond a certain point it's great that our performance is like so high especially like on smaller collections but once you have an index that is let's say 600 gigabytes or like terabytes or two terabytes in size in memory there will be essentially next to node performance difference between us and phias because everything is just like memory bound you're just waiting to mem for memory to be fetched so for this to be solved you have to find the right hardware for us it's like ddr5 or hbm another thing is that uh down casting is oftentimes at least as good as quantization don't trust the ads do downcasting don't do quantization uh at least that's my idea and this is kind of debatable the two things that are kind of generic they they're not just about search the first one is about my personal fight against obstructions so I want the software to be short concise and understandable and if the abstractions do not help me make it a lot shorter then I would rather not have another level of indirection when I'm reading through code just to understand what every single line of code is doing so this is a common anti-pattern but another one that is probably even more common is that scale breaks things and I've seen like way too many projects that I built like monoliths and it took me maybe like five First Years of building this company to understand how to split our abstractions into separate projects so today I've shown you four of our projects we actually have closer to 25. uh none of them like most of them are not open source yet but we have a lot of cool low-level technology to be released in the upcoming years so the ones that we already have are available on GitHub again under Apache license so uh go check it out and abuse it find use cases for it that I've never found thank you guys [Applause] excellent thank you to the master abuser it always intrigues uh to hear the stories of people who put uh poor Hardware Beyond any reasonable reasonable limits right so with this uh please rate the sessions uh that's important the feedback is really needed and now we are moving into uh the questions and answers please any questions for ash right uh gentlemen very persistive gentleman at the very front please go ahead try again I'm wondering what's the difference between the benchmarks on ddr5 and ddr4 Hardware just how much of a drop performance wise is it for you it can be drastic so essentially if I take a channel ddr4 machine and a channel ddr5 machine even though we often like ddr4 systems that we use are often x86 and like ddr5 systems are often arm in our case uh we often see like 2x 3x performance difference so this is what I would expect like maybe two and a half X on average and does that drop basically brings you to the same performance level as face as in skin so you both become memory bound or does it not so or I mean what I mean to ask is would users perform even noticeably better on the ddr4 hardware which we need still to run for legacy Hardware which we still have anyway in nitura So the faster than the medium in which your index fits the bigger the gap between our solution and second best solution would be most likely or like maybe there is a new solution that I'm not aware of that is even better so let's say if your index entirely fits into CPU caches the difference can be drastic and uh this is actually like a really interesting topic bringing us towards not just indexing but uh clustering and then indexing so let's say if you really need to squeeze maximum performance that makes sense for you to Cluster all your data points into different charts and make shards as big as the level 3 cache on your machine and let's say if you're looking into next generation amds and CPUs you'll be having like a gigabyte of caches so this is big enough for you to fit a significant number of points into the cache itself I hope it answers the question so it does thank you thank you excellent any other questions right here at the front well thank you very much for a great speech so I have a question about the hack with uh int 40 sounds like the uh we were talking about question about and won't end 40 cause some online reads and so cache line misses probably some values will be splitted between Turkish lines won't it affect performance uh yes it will affect performance but uh uh not noticeably and sometimes not in the way you expect so there's like an entire debate about cash associativity and like how you have to organize your memory addresses at least the ones that you work with a lot in order like to make the system and like the CPU prefetch them and like fetch them fast enough so in our case when I'm building a very very large index I'm trying to fit the whole vector and all the neighbors into a single tape that fits let's say into two or three or maybe four catch lines at most and when I have like this simple and regular memory access pattern where like if I'm reading even like even if I have like a split line load I will anyways read the next line as well so like yes this specific load will end up being slower but it will inevitably bring the next cache line that I'm also going to use so this is not a performance penalty in this specific design pattern thank you sounds great all right excellent uh any other questions uh again well you are not only persistive but you are you put your own words into my mouth yes which yes sorry I didn't notice that right there's a lady over there in Federal um hi again um it seems to me that you understand how your projects work we need to understand a little how your your words not mine crazy Twisted mind word works if you once it seems like you once you face an obstacle or any like uh decision someone else made you see it in a different way that rather than normal people do what's just the question you ask yourself or is there something like this um so I don't know how to answer like but I've been doing this maybe like long enough and too long to see that we need to solve a problem uh we want to be able to build Advanced Smart Systems like intelligence systems of the scale that no one else has ever tried and we're doing it with a lot less resources and let's say thank companies so for you to be able to achieve the same level of intelligence or like equally smart and big systems working with large data sets that data sets and like so on and so forth like you need to rethink the whole architecture from the very beginning sometimes we would like see a solution that seems like it's good enough but it obviously has problems here and there it doesn't allow us to use like custom memory allocators it doesn't allow us to take control of all the hyper threading like uh we just start adopting a new library that kind of implements the functionality that we need and after a while maybe a few weeks in we realized that oh wow the problems are bigger than we expected and it would be easier for us to re-implement everything from scratch after Gathering experience for seven and a half years on how to do it properly then to actually try to patch somebody else's system so this is a bit of a weird culture it's well known that this culture doesn't suit well to many big companies and it causes people to rewrite completely everything which becomes a hell to manage a hell to maintain but if you are like a small deep Tech startup who's passionate about achieving state of the art this is the only way to go thank you excellent well and now that persistent gentleman who now pretends to be disconnected from the audience right your turn for your question please use the microphone but it isn't there so all right I've just checked the GitHub page and uh your benchmark in F32 implementation of files and so on and so forth yeah the question is I think that there is a mode in face which does the quantization and quantizes the vectors down to eight bits just as well and uh I'm wondering whether or not you Benchmark that as well uh so we took a few different modes of working with fees we haven't tried like every single one of them most of the quantization systems that they provide are learned quantizers that we generally try to avoid because when you reach indexes that is let's say billions of entries in that case you train on a million and then apply that to a billion well the problem there is that once you reach index that is this big everything again breaks at scale and once we have one more part of pipeline to debug and understand like where the problem is coming for we wouldn't want this to be a quantization related bug that like the system or like trained mechanism for quantization caused our distribution shift let's say within the points that we want to add into our index so in order to avoid such problems we try to keep the systems as simple as possible this is actually one of the reasons why why all of our implementations are really short you call is like a few thousand lines of code you searches a few thousand lines of code you store is a few thousand lines of code even though sometimes they're competing with systems that are millions of lines of code especially like in the databases side all right but in any event uh what was the exact mode in face that you've benchmarking that's the index hnsw class which uses beneath the hnsw index sorry for naturally yeah yeah like but uh once you go through sources you see that they have a few classes named similarly it doesn't use any additional like uh whatever flat registry of labels to IDs or something like that so it's as bare bone as it can be to get like the raw performance right so that was the direct approach which basically does no clustering yeah yeah and like just just the graph exactly and like if you want to add clustering in our system we think that we want to be modular again so if you want to do clustering you can kind of retrofit it you can have like a two-step pipeline where you do your clustering or some adaptation first and then you use your search right excellent and then well you know where to look for that in the GitHub for the details and you also can continue this discussion uh at the hallway right so thank you again and now the last remaining task for you you need to choose the best question this is tough so there were a lot of good questions but you need to choose one unfortunately yes so I think we should value persistence in a sense that building up something like deep tacky required us to work for a very long time in a stubborn Fashion on a problem that everyone abundant and for seven years until Investments no one cared about it but for the same reason uh we I think we should Grant the person in the first row for the range of the questions that he had that's your choice and your decision all right so uh gentlemen gentlemen at the front row gets the price uh some sometime right and and that's it right thank you once again so when with this a round of applause for us please thank you guys

Info

Channel: Tech Internals Conf

Views: 1,027

Rating: undefined out of 5

Keywords: HighLoad++

Id: UMrhB3icP9w

Channel Id: undefined

Length: 52min 24sec (3144 seconds)

Published: Fri Mar 15 2024