LPUs, NVIDIA Competition, Insane Inference Speeds, Going Viral (Interview with Lead Groq Engineers)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

are you describing an llm can actually provide multiple outputs iterate on those outputs before it even presents you with the final output 100% I I didn't even consider that that is so cool Gro makes the fastest AI chips out there they're called lpus and today we're going to talk about how they work I was fortunate enough to interview two of their incredible Engineers Andrew and Igor and they are both hardware and software Engineers who know exactly how grock is is able to achieve 500 600 even 700 tokens per second inference speed and I'm going to interview them and we talk about everything from manufacturing to the differences between Nvidia and grock chips to some of the incredible benefits that you get from having insane inference speeds which even surprised me and I didn't think about so make sure you stick around to the end because what they told me absolutely blew my mind in terms of what you can do with that kind of speed thanks to grock for setting this up and sponsoring this video let's get into the interview now all right we have Andrew and eigor from the grock team two engineers and I'm super excited to talk to them today we're going to learn all about the hardware side of things and learn how grock ended up where it is today at the fastest inference speed that I've ever seen so uh we have a lot to get into but first um Andrew I I would love to just hear a little bit about you how you joined grock and um a little bit about your career yeah definitely I you know I started in this space really through my Graduate Studies University of Toronto uh you know great school for computer architecture machine learning uh did a PhD really kind of this really esoteric area like compilers and fpgas um and then spent you know more than a decade just building compilers I would say at various semiconductor companies um starting out with just like you know C++ compilers but then that evolved D particularly roughly like 2016 is you know when like Alex net and all these things started taking off um kind of like my team kind of evolved into a machine learning compiler team and that's roughly when I got wind of Gro you know grock started 2016 um I knew a bunch of folks within the company more on the business side I actually didn't know the founder Jonathan at the time um but a lot of my colleagues ended up at Gro uh because there was a lot of sort of of uh similarities to what I was doing and the business that I was in and what grock was doing really accelerating and Reinventing silicon um so that's really how I ended up here I just kind of got pulled in from uh a mutual Network and then I've been here roughly four years and it's been incredibly satisfying and exciting to really see see the company mature yeah and we're going to talk about what it's like today because grock is everywhere uh but eigor I I want to hear the same thing from you how did you get into Hardware what was your career and then how did you end up finding and joining grock yeah yeah so uh I think we've started at the same school so I went to UFT as well University of Toronto um and then my first serious job was at IBM microelectronics this is in the Vermont area um and uh I kind of moved through the ranks there uh that uh Asic organization that I was working with was an organization that was building custom chips at that point we were doing custom chips for a lot of networking applications um and then that those networking chips ended up looking very much like AI chips as we started moving lots of memory lots of series and things like that um I became the CTO of that organization as it moved to Global foundaries um and in fact that's where I met Jonathan so when when I met Jonathan in 2016 he was just starting grock uh so when I met him it was uh about seven Engineers on the grock side uh in a tiny little office we had to steal chairs from across the room to actually have our our first meeting uh and that led to the first chip this is the chip that we're discussing today this is the uh the the lpu um and this is the chip that we're kind of running on these workloads so um after that I kind of transitioned through marll I was the CTO of the Asic business unit at Marvell um then I moved to Google so was leading the custom silicon effort for uh the the TPU so this is their their AI accelerator from Google so I was working on optimizing uh the physical design for that accelerator um and at that point Jonathan pinged me again and he was saying hey we're doing this next big thing um it's time for you to join uh so um I've kind of stayed in touch with Jonathan throughout that so it was kind of an easy jump I think there's some really cool uh architectural features of this design that are kind of unique and they're really like a silicon architect's dream so um I couldn't say no to that yeah so I started I started back with grock two years ago so I've been with grock for two years now awesome awesome yeah and there's there's definitely something unique about the hardware and I am a a novice to say the least on all Hardware topics so I'm going to be trying to ask intelligent questions but I want to learn more about what is is so special about the hardware but before we get there let's start with what traditional gpus traditional Hardware to run inference looks like so how how do you describe it like what what from the highest level what does it look like so I think visually uh what you see on the left is the graphics processor unit so this is the the GPU and what you see is really a state-of art silicon right you see the largest uh silicon inter interposer the largest die that you can manufacture that's the size of a reticle effectively over 800 mm six hpms these are uh 3D stack dram um kind of memories that are really high bandwidth or at least as high bandwidth as you can't uh if they're off chip uh and it's implemented the core is implemented in 4 nanometer which is really kind of pushing the limits of What mo Li is giving you in return right so really complex design and what what does the nanometer actually mean what does that give you when it's smaller yeah so nanometers have become more of a marketing thing right now but they're effectively supposed to be the minimum feature size that you are implementing on that chip so think of transistors they're so tiny now you can actually fit tens of thousands of them in a single red blood cell uh that's how tiny they are and that 4 nanometer is kind of like the smallest feature of that transistor at least that's what supposed to mean um they've kind of lost a little bit touch of the actual physical size but uh uh roughly that would be the the information that you get with that 4 nanometer okay and um eigor you mentioned a couple other terminologies maybe you can just like very briefly describe what those mean and and kind of how they contribute to the performance of a traditional chip yeah sounds good so if you look at inside I don't know if you can see my mouse but if you can see here this is the core die so this is the compute die that's implemented in the latest and greatest kind of process node and that compute die is effectively doing a lot of the heavy compute work next to it you see these boxes those are high bandwidth memories so those are effectively multiple dieses like these stacked one on top of another um in implementing dram which is a form of a memory Dynamic uh random access memory and what they are there to do is to store a lot of information a lot more information that you can store on the chip the downside is that memor is significantly slower than what you can actually Implement on chip right and the bandwidth to one of these memories could be hundreds of times hundreds of times slower than the bandwidth that you get on the chip but they they're kind of like a really good storage unit that you have along the side of the chip so if if I'm buying a gaming GPU and it describes it as 16 gigabytes is is that what you're describing right now maybe if it's a really really high-end graphics card typically those would be um a lower bandwidth memories so gddr or something like that but yes you could have hpms that could be providing some of that data yeah so that would be the dram the number that you specify that that is a dam uh capacity got it okay thank you yeah so that's the GPU it's kind of pushing the limits on on biggest reticle biggest silicon interposer most number of hpms that you can Implement at that time when h100 was kind of uh the most advanced kind of uh piece of silicon a piece a piece of s so on the right you see the lpu so this is the chip that uh we implemented first at grock so this is in 14 nanometer so 14 nanometer means it's about three nodes older than than the 4 nanometer that the GPU is using so uh just to give you the notes from 14 you would have moved to 10 nanometer 7 nanometer 5 or 4 nanometer so technically you're getting significantly less transistors than you would have in 4 nanometer so technically you're supposed to be significantly slower uh or less amount of work with a 14 nanometer chip than you could with a 4 nanometer chip because the devices are bigger and you can put less of those devices in this in the similar uh actual silicon space does that make sense yeah it it does make sense and I'm sure there was good reasoning behind the decision to use larger or kind of couple Generations older nanometers and but I I would love to hear why and and then H how did the like how did you decide on these tradeoffs yeah so part part of the decision was made because this chip was manufactured many many years ago basically like three or four years ago so at that point 14 nanometer was near the most advanced node that you can use and that was part of the decision the other part of the decision we wanted to have silicon that's manufactured right here in the US um so to kind of maintain a a supply chain that's in the US this silicon was actually manufactured in Malta New York in global foundaries it was packaged in Bromont Canada and then um as uh Andrew Andrew is on the call here uh Andrew's theme actually is doing all the compiling in other words moving these algorithms into into this silicon to kind of do specific uh to implement a specific workload so those were the kind of the two major decisions that led to a 14 nanometer silicon decision understood so uh to to clarify the grock chip as we know it today was actually designed a few years ago that's correct yeah okay and it seems like and and we'll we'll get to this but something has occurred in the last few months where people are realizing just how powerful it is now I know you guys already knew it but let's let's come back to that of like what what was it about the state of the market the state of inference that kind of made everybody look at this and say oh this is actually super valuable let's let's start switching to Gro um but in the meantime I I see non-deterministic for the traditional GPU and deterministic for the the grock chip I would love maybe you could just give a brief definition of what deterministic means and then also like how how does that actually affect the Chip's performance so maybe we'll we actually have like a little animation here that kind of takes you through that so what we mean by non-deterministic is that uh when a task is supposed to complete uh in the compute um portion of the chip um it it it might complete very quickly if the data that that processor is working on is nearby in a Thing Called cache like a local cache but it also might take a lot lot longer time if it needs to access that data from an HPM so you can see this gaussian that shows every time you uh contact the HPM you might be able to get the data that you're working on as in as few as 300 nanc that's uh 300 billions of a second or it might take you a microsecond uh to get that data so what that means is that you never quite know how quickly you'll be able to get the job done it's non-deterministic basically got it and H how does that affect the Chip's performance and was that one of the key decisions that led the lpu to be so unique and to perform at the rate that it's able to with its inference speed yeah so I'll cover the hardware piece and then I'll pass on the benefits of the turbo ISM to Andrew cuz on the software side right awes on the hardware side um the on the left side what you see is many cores that are making up this chip so these are kind of think of small processing units that are all executing on their own time and then they need to combine all these results to to kind of get the global job done but if you have one core that's waiting for the HPM to provide data then all the cores are kind of waiting for that data to be compl complet it so you're kind of slow as slow as the slowest core in the group and not just that because AI is now not only a single chip problem but it's a multi-chip problem and the more non-determinism you have in the system the harder it is to get the job done uh uh in a very expeditious way yeah so you know as as more on the like I'm more on the software side if I'm doing a job asynchronously and everything else is waiting on that job then everything else is as slow as that job it is and then that brings complication on the software side so maybe Andrew can kind of touch on that piece uh the complication uh and then we'll switch to the lpu so we can explain to you how we uh improve this problem or eliminate it on the lpu side your your intuition is spoton Matthew like if you don't know how long something's going to take you just have to be excessively conservative and that's effectively what you're seeing with like multicore CPU graphics cards effectively the compilation problem is conservative and that's why it's so hard to compile to these devices you know I mean I think it's an Open Secret that like you know the big tech companies they really don't have automated compilers anymore they are using armies of people to hand tune things to actually map these ml workloads onto the Silicon can can you elaborate on on what that means exactly like when I when I hear they have armies of actual human beings doing this that that blows my mind can you dive into that a little bit no it's true it sounds a little bit counterintuitive because you you think about like all the automation that we have all the compilers that we have like isn't this a solved problem it's not this is the hardest problem in computer science we've effectively I mean the industry in general has effectively given up on making a you know the Holy Grail automated vectorizing compiler that gets Peak Performance so perfect example uh Intel they have the math kernel library right M mkl that's actually written for the most part by hand I mean they have templates they have some level Automation in that but for the most part they're they're sort of templated handwritten libraries by these extremely smart very talented you know mathematicians and computer scientists and they are the best in the world that can do this but they have hundreds or thousands of these Engineers to do this and the compiler the compiler is just really invoking these Library calls under the hood and that's how you're eing out all the performance you see this in all Industries like in fintech all the folks in fintech who want the Peak Performance they've kind of you know kicked the compiler to the side and just started writing these things by hand if you take a look at what they're doing under under the hood for all these Finance applications so extremely tough uh problem the strategy that the big tech companies are taking makes sense right they have the capital to do this we could not do that right we're a small startup you know we don't have infinite amount of money we had to do something different so that's really the core of why a our chip looks so different and B how we're getting this amazing performance out of it in fact Jonathan uh who started the company I think the first six months he only thought about the software he he literally did not think about the underlying Hardware that would run this he was he thought about in more like a decomposition and Gra problem of okay given a ml workload you know here's the general structure how do you actually decompose the problem into its primitive form and then from that he devised a hardware substrate to execute those operations so he kind of went backwards like he started with the problem and went down finally ending up to the Silicon I think to the point like he tells the story I don't know how true it is but like he banned whiteboards in the company because everyone kept on writing block diagrams of Hardware blocks and he he's like no stop thinking about the hardware let's just understand how we compile this thing in an automated way because we cannot afford to hire a thousand Engineers to write kernels so the the traditional workflow is the hardware Engineers come up with a new design something that they believe to be cutting edge on the hardware side then they essentially pass it off to a team that really had very little to no input on the hardware design and then they they are tasked with building the kernel is that correct yeah yeah yeah and and that still happens quite a bit you've heard the key words now about software Hardware codesign right and I think Jonathan was doing this in 2016 because if you look at the whole system we're kind of talking about the chip right now but it really have a vertical optimization across the stack all the way we joke around from Sand to Cloud sand being silicon all the way to our Cloud so like Andrew's team works very closely with my team they're embedded in our Hardware uh like meetings and my theme is embedded in Andrew's theme so we actually make decisions that are really optimized across the stock and that's been kind of uh the go-to right from the get-go um yeah yeah I guess that's one of the benefits when you're coming to a problem completely fresh with you know trying to really see it through different lens start from the ground up you get to make certain decisions that I guess maybe other larger incumbents just took for granted yeah there's a beauty to start with a whiteboard and you have unlimited capabilities there so I think it is a um I think that's how Jonathan has approached this problem and I think it's yielded like a really um a lot of benefits not just on the hardware side but uh as we move to the kind of the next slide here and kind of look at what how grock is kind of uh working through the hardware problem so on the right you see the lpu it's a 14 nanometer chip pretty Scrappy no HPM so no s no HD RAM on the sides no silicon interposer probably cost about 12th of the chip on the left so it's very affordable um and it's very regular in nature so you kind of see almost something that look looks like a single core and the lines that are kind of showing up is showing like roughly how the data moves in a very predictable manner on it so um Andrew steam can reason exactly where the data is on the chip at any point in time so at every nanc Andrew's team can say this is exactly what functional units are active these are the memories that that're being accessed and this is how the chip is working so that not only kind of uh enables some superpowers on the hardware side but it actually makes the software problem significantly easier and I'll pass it on to Andrew maybe to kind of talk to that piece yeah 100% like ultimately because we have 100% transparency into every single component in the hardware you know oddly enough it actually makes a scheduling problem easier and that's not true with you know these traditional architectures like you don't really know how the cach is going to be behave there's Branch predictors things like that we have none of that no reactive components like it's 100% controlled by the software the knock on effect of that and I don't know eigor if you'll have a picture of this but effectively we can now scale up the problem so here e showing one chip but we can actually now combine these chips together like Lego blocks and use the compiler to schedule that larger problem onto multiple chips and that is how we're actually getting the Extreme Performance that you're seeing out of the the device this is very hard to do on multicore CPUs and graphics cards because you don't really know when you know chip a is done versus chip B and again it goes back you have to be overly conservative you have to deal with this Dynamic routing between the chips we have none of that so we're kind of treating all these chips as one monolith substrate that we schedule onto and then that's how we get the sort of Blazing performance out of of the device if they're all working as one there there seems to be two principles which I try to live by that have shown to be true here as well uh number one Simplicity typically wins right the the less complexities you have the easier it is to reason about and thus you could eek out a lot of performance benefits and number two Innovation often times comes from constraints so the the fact fact that you were in a huge team the fact that you didn't have unlimited funding to hire thousands of Engineers forced you to think about the problem in a different way and thus this Innovation came to be so I I I I love it I I absolutely love it yeah and the the big companies are being totally rational on how they approach the problem right like they've invested so much in their kernel libraries they invested so much in the existing silicon what is the rational thing to do at that point well you incrementally update based on your existing worldview you incrementally update that architecture to be a bit faster for you know let's say linear algebra because that's so pervasive with machine learning and so what are you going to do you'll maybe widen the vector lengths you add a few more compute units but the architecture itself is still at the end of the day a graphics card or a CPU like you're still kind of in that domain and then like eigor said like we kind of had the luxury of just starting from scratch it's like really what would you actually build if you're solving this problem like you you don't have the inertia or Legacy decisions that you're tied to you know you're you're essentially able to sort of start from scratch and that's really enabled us to get out of that local minimum that so many of the other companies are stuck in yeah and the inertia is kind of the key piece right once you invest into armies of konel riders are you going to start from scratch and open up a brand your architecture and do the investment the software investment again or you're just going to keep marching forward and trying to squeeze out any little bit of uh Moors law that's left any bit of uh kind of uh Network performance that's kind of left I think that's the challenge and I think the Simplicity Matthew as you touched on it not only starts at the chip level which we showed but now we move to the network level so at the network level we have taken um what is a ventional Network on the left uh which could kind of like it has like three strata where you have compute which as we described earlier is non-deterministic that's basically being kind of routed or switched with networking which is also non-deterministic right the network is kind of trying to manage these packets that are moving from processor to processor and they're trying to make locally optimal decisions they're not optimizing the overall problem they're just trying to minimize congestion in a specific and eor can I if I can interrupt for a second let's take just a quick step up from here when you're talking about a network when you're describing a network what do you mean exactly yeah so what I mean is that you have these processors that um are executing some portion of the task AI is no longer a single chip problem AI is needs to be tackled by hundreds in some cases hundreds of thousands of chips that are basically all working together on a s on a common problem so in order for us to do this you need to connect these chips they need to communicate with each other and that is the network and the network is usually plagued with routing challenges how do you make sure that chip a talks to chip B and at the same time chip c talk to chip X for example so there's like there's there's um uh collisions that are taking place in these switches and routers the routers are trying to decide who gets prioritized uh how do they maximize the bandwidth from One processor to another processor and that usually results in a in a high latency but also non-deterministic latency again that packet might arrive fast or it might take longer and now software is left with the really difficult task of trying to schedule a really well- behaved algorithm like a data flow algorithm into non-deterministic mess of Hardware underneath it and this is where you see these large number of Kernel writers and phds that are trying to optimize how those kernels are kind of designed got it so it's it's not only that the chips themselves are non-deterministic but then when you put them in you know layman's terms basically a big server room to talk to each other then all of a sudden there's additional complexities because each one is essentially operating independently non- determinist non-deterministically and you have to write software to account for all of those complexities and all of that additional software also causes latency is that is that all correct what I'm saying that's exactly right and when you have a variation what do you do you bound it at every level and every time you bound it you accumulate that margin at every single spot in order to be able to kind of execute a specific workload very cool I I'm I just love the fact that all of this came from the fact that you just didn't have enough money to hire a bunch of Engineers like I I I love that that is how Innovation is born um very very cool okay so then like switching over to how grock operates what does that look like so what we've done is we've totally removed this networking layer so what we've done our chips are not only an accelerator an AI accelerator but it's all they also are a switch in one so we do not have top of rack switches so we don't have these very complex expensive bombs for Building Systems our chips simply talk to our own chips basically and they're really tightly connected in these Global groups and then they're also connected to other local groups uh from each of these chips so for example if this chips want to talk to this chip for example you might have to do a local Jump Then you got a global jump and then you go back to the local jump but the beauty of this is not that we have less hops that we need to go through which automatically improves latency improves the bandwidth between chips the beauty is that software gets to orchestrate this whole system because the even the system level is also deterministic so we have actually created almost like a mega chip made up of many small chips in a big system and now the software gets to schedule not only how the compute is being executed in inside one of our chips but also how the communication gets executed throughout the Chip And and Andrew can probably add a lot more there than I can yeah and and Andrew I I I have a question for you as as you're describing what eigor is uh talking about I'm guessing there's been a ton of software written for the conventional type of network and so now you don't need that and I you know there's a ton of software written for the traditional GPU did you have to build all new new tooling even though it's very simple did you have to build new tooling for yourselves what what did that process look like yeah 100% this is a like a a very uh different software stack the problems are different the approaches are different I mean we're still using some common compiler infrastructure out there you know at the end of the day you don't want to sort of rebuild everything from scratch but ultimately the approach is very unique to our silicon so it goes hand inand it's effectively it's like a entire platform you can't just have the Silicon on its own it's sort of bolted with the the compiler and how the compiler Maps these machine learning workloads onto both the silicon and its ability to do this you know as is shown here this software schedule direct Network so the software is sort of orchestrating everything you know you kind of think of it and this is where that determinism Point comes into play if you know exactly how long every single Chip is going to take to run you know some basic block XYZ you know exactly when to send things over those lines like the problem becomes actually quite tractable at that point and if you don't know yeah as eor says you have to sort of bound it and be extremely conservative yeah I really like this picture that you're showing here Igor I don't know if you can explain it yeah so I mean this is I was just trying to help with your explanation but if everything is pre-schedule Matthew and you know exactly the latency that's traveling it's almost similar to you taking your car to work uh in the morning and somebody telling you Matthew leave at 8:00 a. exactly drive at 40 miles an hour don't stop at intersections don't stop at stop signs just keep driving along this route and you're going to get to work at exactly 8:23 a.m. for example and this is how our network works right everything is prescheduled uh each of these cars could be a tensor that's being sent through the through the network the beauty of it is not only can you make sure that these tensors get to the location fast because everything is pre-scheduled but you can actually get more utilization out of your roads which are are chipto chip connections these are the wires that are connecting chips together right so we can schedule because nobody has to stop there is no traffic jam so you can actually move this really quickly right and on the left you see the conventional Network which typically if there's a congestion there needs to be a whole bunch of decisions made back pressure on the on the kind of Link that's coming in so you're telling him hey you got to slow down with your packets or try to figure out a different adaptive route to get to your destination ours is orchestrated on the right on the left is kind of like making a decision on the spot uh kind of like you're trying to kind of make decision so I think this is kind of uh along the the explanation that Andrew kind of gave yeah yeah yeah and to your point Matthew about like you know how did you think about the problems like it's so different than your traditional architectures it's kind of funny you mention that because we did have some folks who were experts within CPUs you know they they spent years at these big semiconductor companies very successful by the way building you know CPU compilers or you know solving the multi-core problem and some of them actually struggled looking at this because they kept on trying to map the complexity of that world into this world and it just like there was friction there it just didn't work and they really had to like forget everything they learned and like start from scratch it's a totally new framework in how you think about these problems and how you map these sort of HBC and ml workloads onto silicon it kind of flips things on its head which is interesting so I I I want to talk for a minute about use cases so like completely different and unique architecture obviously has applied extremely well to inference are there other use cases developers or anybody should be thinking about with regards to how the grock chip works and are you planning on expanding the uh applicable use cases in the future anything you can share would be great I mean it's interesting like you know inference is by itself a pretty broad category but yeah definitely Beyond llms we have definitely seen very specific use cases where the grock architecture shines and it comes down to the fact that it it wasn't obvious from the images but internally there's actually a ton of memory bandwidth on the chip so why does memory bandwidth matter well you got to feed the beast so you have to be able to suck in data and feed the compute units on the device as fast as possible otherwise you're they often call it starving for data you know otherwise you're kind of just sitting there doing nothing just waiting for your data to come in because we have that superpower within the hardware there are a ton of different areas and you know eigor is listing a few here related to drug Discovery things where there's a lot of recurrence in the application so I mean there's other types of uh deep learning models like lstms rnn's graph neural networks we actually shine extremely well for all of those things as well on top of the llms that the lpu was designed for so it's this really nice super set of problems that we can tackle because the architecture is so different and we can take advantage of that those characteristics are unique to us this may be a a naive question or just lack of knowledge would we ever find a grock chip in consumer Hardware the the architecture is very um organized and very regular so if you look at the chip the vertical lines effectively on this chip are effectively simd structure so these are structures that you can feed specific information and you provide an instruction from the bottom of them so they're kind of replicated so you can just tile them right next to each other to build the chip so uh we can very quickly tile a version of this chip that is um something that can be embedded on a piece of other silicon it could be taped out as a chiplet or it could be a kind of a chip a standalone chip that we have kind of shown up to this point got it so you know one one thing I'm particularly excited about in the future is being able to run really powerful large language models locally and especially on you know maybe even my phone or other devices that are mobile is that is that potentially something I can look forward to with grock powering it I think the beauty of grock is that the latency is so low Andrew that um that you can actually uh run that uh you can run large language models on your cell phone right now right so I don't know if we want to do this on the spot here but we can try something uh let's see how this is going to work out uh let me ask let me know what question you wanted to ask for example do you have a question no go for it eigor go for it okay can you tell me more about um artificial intelligence uh hardware and the biggest challenges for it sure in the context of artificial intelligence Hardware some of the biggest challenges include power efficiency scalability and computational capabilities yeah so that that's that's crazy CU it's very low latency but you're still hitting an a server is there an is there a world in which it can be completely local the actual models running on device yeah so the the question and Andrew probably can talk to this more intelligently than I can but typically with the larger number of parameters you get better quality of the model technically we can map smaller models into our Hardware or maybe a version of our Hardware with maybe some Dam on the side but that those are the tradeoffs so as I mentioned these are multi-chip problems is at this point even if you're using gpus you still need multi chips to tackle the biggest large language models so it becomes a um a quality versus a size uh uh kind of tradeoff but as we move in the future and we're trying to do more and more integration and 3D stacking yeah perhaps we can we can get there yeah the way I look at it is like you know you would not have the biggest and baddest llm running on your phone like you wouldn't have like the sort of gp4 or you know llama 3 that's coming out soon on your phone you probably have a distilled version of it and potentially tuned for a very specific area so it'll be much smaller but you'll be fine-tuned for like you know getting your groceries or something it'll be super and that'll be the one model that'll run locally and you wouldn't need need sort of the the offload if if that becomes pervasive that's the way I would look at it there'll be kind of these sub classes of smaller models that are much cheaper to run but probably derived from the massive server scale models that are running you know within the large companies today so I I have a selfish question now you currently with with grock's API and grock chat support llama the Llama Llama 270b as well as mixol you know the two best open source models out there are you planning on expanding that list what does the process look like to bring a new model to grock's architecture and uh when can we expect another model I mean good question yeah so I mean ultimately the answer is yes right like the the the technology and software that we bu built is not like you know llama to specific or anything like that so as the models evolve we do digest them through the compiler like ultimately what we often have to do is we'll take the pie torch model there is some massaging quite frankly because sometimes they're specially tuned for like you know they'll have like Intel Primitives in there or some other company's Primitives and we have to sort of make them agnostic to the vendor and then from that then we can suck it into the compiler and then we just map it on we'll run it and we do this often for some of our internal customers where they have proprietary models and they want to see how well it runs uh and we'll do that for them so they can Benchmark so that's kind of roughly the pro the process where we we'll take those customer models or open source models kind of massage the front end of the description to be a bit more agnostic to a vendor and then push it through our proprietary software stack and run it and that's about it got it very cool okay so two two last things I want to talk about first I want to talk about the manufacturing um I know nothing about silicon manufacturing maybe you can just briefly describe what the traditional silicon manufacturing process looks like and then how it might differ uh with Gro and you also mentioned it's all done in the US which is awesome so would love to hear about what that process is so this is probably the most advanced technology Humanity has ever built especially if you combine artificial intelligence with like uh actual semiconductors like I mentioned we're like putting thousands of transistors in the same size of a single red blood cell right like that's how tiny these things are they're effectively switches on and off switches and they've gotten to the point where they're so complex to make that even um the the printing initially was done with really just light shining through a mask that would be projected on a piece of um piece of silicon that had photo resist and then you would etch away different features until you got exactly the features that you wanted on the chip and you would have many of those layers close to I don't know 70 or more layers in a in an advanced chip right now that were all printed like that one after another now this has gotten so complex that um normal light is no longer the wavelength of the normal light is too big for the the feature size that you're trying to print crazy so now we've gone to extreme ultraviolet and on top of that we're doing double patterning which means that we're only printing where the the the the wave superimpose to actually create the tiny feature sizes that are found on the latest and greatest chips right so this is why this has been such a big um challenge uh for companies like Intel and Samsung to catch up to tsmc tsmc has been kind of pushing leading the pack here but we're seeing more and more investment into it we saw Sam Alman asking for7 trillion dollar to kind of build his own Fab uh as Jonathan puts it I think if you have a clever architecture like ours you should be able to do it in 100th of that uh kind of investment uh but uh this is a really complex technology and uh you can see it it's kind of the Forefront of all the geopolitical discussions that that we're having at this point yeah but hopefully that gives you a little bit yeah absolutely and and then like Are there specific qualities about the grock chip that made the manufacturing process different or or did you have to invent something CU you talked about using ultraviolet is that unique to Gro or is that unique to kind of the more modern chipset and and what is unique to grock specifically yeah so I think uh we have followed the normal logic process that is part of the Fab so we haven't had to modify anything in the Fab um the one thing that's really nice about the grock chip is the regular nature of it so because you can see it it really looks very much um uh kind of you can see just different shapes that are kind of replicated uh from left to right uh and regularity is kind of critical as soon as you push the the limits of uh of the process deeper and deeper more regular structures will scale a little bit better especially right or you can be able to squeeze a lot more transistors in in an area if it's very regularly behaving like an SRAM will have a higher transistor density than just a normal logic block that that would have because everything is regular and perfectly organized um the beauty of grock's Chip also is that we do not have a lot of control Logic on the chip so the instruction uh kind of uh portion the instruction controls of the chip only make up about 3% of the diey so the rest of it is devoted to really what matters compute and memory on the chip in a typical uh GPU or a CPU you will see the control logic taking upwards of 20 30% of of the Silicon area so that's really just trying to kind of uh introduce these Dynamic decisions the hardware is making on the chips you need a lot more control a lot more networking on chip and so on so we've just replace that with the the transistor that actually do the work rather than uh kind of control the rest of the chip all right that's that's helpful thank you um I I want to switch gears now completely so I think just a few months ago most people probably hadn't heard of grock and then all of a sudden grock is everywhere so I I want to ask two things one what was the energy like and I guess still to this day what what is it like and then what do you think clicked for people what what do you think was that moment that inflection point where the the broader engineering base developers kind of realized grock really has something special with their insane inference speed and the different use that that unlocks so what was the change like and energy inside the company and then what what was that moment that you think people really started to notice how valuable grock's inference speed can be yeah I mean in a lot of ways it feels a little surreal right you've been working on this problem for so long and quite frankly like to give him credit like Jonathan Ross saw this day coming since the beginning you know and people just kind of you know they're like oh yeah that's interesting we we'll keep on doing it but you know he he was right right he might have been a little bit early in his prediction but he knew in his core that this architecture because of how he thought of it would achieve this level of performance and the team is extremely proud in in in being able to see this happen and internally like it's it's kind of funny like when you're working in the space you don't really realize what has changed you're kind of like dayto day you're doing the same thing over and over again just making things a little bit better each day and then like I think really that push you know like Mark heaps Jonathan they make it real show the world that this is not just an idea make this thing real and accessible to everybody and that's really the decision uh that we made uh you know roughly late 2022 early 2023 that led to this moment ultimately like just sharing the technology I mean there's a lot of Engineers within Gro and we were like no this is way too early we're not ready but we kind of like you know got a little bit of kick in the bum and uh just pushed it out there so it's been super exciting yeah so grock is one of those eight-year overnight success stories exactly you know people people see it and they say wow came out of nowhere but know you guys been grinding for a long time and then was was there a particular marketing event was there some kind of I how long has grock chat been up and functional and then like when when did people start realizing and when did it really go viral I guess is the right word I mean I think October 5th of last year right roughly I so yeah I think it was llama 2 that kind of kind of showcased the capability that we had on the hardware so we had the capability we have kind of used it on number of workloads success successfully but I think it was kind of uh metago open source on the large language models that kind of gave us the ability to showcase the the capability of this hardware and I think as Andrew pointed out I think that was kind of like the Tipping Point uh on our side yeah yeah and since then it's been hard to go to sleep yeah so I I'm particularly excited about taking the inference speed and plugging it into AI agent Frameworks that is just one area that I'm I'm really really excited about that and and coding use cases are there any use cases that you guys are particularly excited about that get unlocked because of this inference speed so this is a little bit non-intuitive but the behavior of the model because we have such fast inference speed actually lends itself to better output and so what do I mean by that so there's this notion of conscious of thought right you basically get higher quality answers if you stream in successive answers back in and rephrase the question it's almost like you're teaching the model on the Fly you're kind of giving it more information I think that is really unlocked with the grock architecture like you're going to see these models supercharg now you know there's going to be less hallucinations there's just going to be higher quality answers because you have faster inference this is something that's nonintuitive because people generally just think oh I'll get my result faster no you're going to get a better answer so that's the thing that excites me yeah that that is so cool I I don't think I thought about that specifically I mean it's kind of adjacent to AI agents right because if they're working together if they're checking each other's work it's kind of the same thing but are are you describing an llm can actually run or provide multiple outputs iterate on those outputs before it even presents you with the final output 100% that's exactly I I didn't even consider that that is so cool so are have you uh implemented that with the models that are live on grock chat today I mean we have some poc's with some of that but no we haven't exposed anything uh at a commercial yeah but but anybody can try that manew I mean you can play with any of the large language models and keep feeding your information asking for better suggestions and then implementing those suggestions for an answer and kind of uh try to see if your answer is better right very cool I I have so much building ahead of me um all right I know we're over time I want to thank you Andrew eigor this has been so fascinating I'll probably have other questions I know a lot of the viewers who watch watch this we'll have a ton of questions as well maybe I'll I'll send some follow-up questions um and uh but I I appreciate your time so much this is this has been awesome awesome awesome thanks all right we'll talk soon

Info

Channel: Matthew Berman

Views: 51,052

Rating: undefined out of 5

Keywords: groq, inference, ai, openai, open-source, interview

Id: 13pnH_8cBUM

Channel Id: undefined

Length: 51min 11sec (3071 seconds)

Published: Fri Mar 22 2024