Apple M3 Machine Learning Speed Test (M1 Pro vs M3, M3 Pro, M3 Max)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone Daniel Burke here machine learning engineer as you can see I've got four MacBook Pros here these are the M3 variants the brand new ones from Apple M3 M3 Pro and M3 Max this is my M1 Pro 14in that I've been using for the past 2 years and of course I'm a machine learning engineer I want to figure out are these new machines good cost to Performance against three other machines I use regularly so I've got a deep learning PC upstairs in Google collab we got five machine learning tests we're going to Benchmark all of these and see how they go now a quick overview of what you're looking for when you're buying a machine learning PC uh whether you're going with a MacBook Pro or something else these are standard across the board one processing power now if you're wanting to build deep learning models which is what you're seeing of all the AI Revolution it's all based on deep learning so what you want there is GPU processing power and memory so or Ram so or GPU Ram if you're not using a if you're using a Nvidia GPU however since we're focused on MacBook Pros these all have integrated Ram so the GPU CPU Ram is all on one chip now the M3 base model has 8 GB of unified memory and starts at $1599 USD and then we have the M3 oh this is the M3 Max this is the M3 Max has 36 GB of unified memory and this is these are all the Baseline models of the variance so M3 M3 Pro M3 Max M3 Max starts at$ 3999 USD and then finally we have the M3 Pro which the Baseline model you you can of course upgrade all of these this starts with 18 GB of unified memory so the more GPU cores you have the more numbers your computers can Crunch and machine learning algorithms are all based on just finding patterns and numbers and the more memory you have the bigger the model you can store in in memory and crunch numbers with and the bigger the model you have the more parameters it can use the more patterns it can find in data now that's a big broad overview but essentially more gpus more memory equals better so if I was to place a bet on which one I think would perform the best out of all these three of course it's going to be the M3 Max but we're probably all not surprised by that we're going to find out with the test so what tests are we going to run well we've got three M3 variants we got my daily driver which is an M1 Pro I've been using this for the past 2 years got a deep learning PC upstairs I use this in collaboration with that to build nutrify train computer vision models almost every day on that and then of course we have to compare them to everyone's favorite Google collab we have two pie torch tests a computer vision natural language processing test and we have two tensorflow test a computer vision and a natural language processing test and then of course something I'm really excited about a llama 2 test so we're going to test the Llama 27b variant which is a large language model for text generation we're going to see how many tokens per second each of these machines can generate now hopefully if I've got all my code correct it's going to run across all the machines let's do [Music] it so let's check out some results but before we do I have two small things to show you number one is if you'd like all the code to the test that I ran in this video it's all available on GitHub I'll put that link in the description so you can set up your own Mac and try to run the test yourself and then if you'd like to learn machine learning particularly if you're a beginner and you'd like to learn right from the start in a code and beginner friendly way I teach machine learning from python to pandas psyit learn and numpy in my beginner friendly machine learning course to deep learning with tensorflow and Pie torch in fact all of the code we run in these videos are taught in these courses so there's two links to those there they'll be in the description but without any further ado let's get into some results we have first up pytorch resonant 50 cfar 100 batch size versus time perog lower is better let's see how all of these machines went there we go out front with the M1 Pro then we have the M3 and the M3 Pro M3 Max Nvidia Titan ITX did outstanding and same with the Tesla v00 far out now the first train you probably notice here is that as the batch size increases we get lower and lower Epoch times now why is this well because Modern Hardware especially modern gpus or Hardware accelerators these are all this is what we're focused on by the way is GPU power is that's what's going to influence our Computing uh or our times here the most modern gpus are very good at crunching numbers so the more data you can fit on the GPU in other words a larger batch size now something to keep in mind is that CI 100 is quite a small data set 32x 32 images so with this batch size of 16 to me that's actually not that practical I kind of just had that there to to start off with you're not really going to see this in practice with a small data set on Modern Hardware it's only really once we get up to these larger batch sizes that it's it's it's more of a practical example ignoring these first three batch sizes you can you can inspect them if you want but from about one 28 onwards I see the M3 series basically all has the same performance in fact the M3 Max slightly performs worse on many of the smaller batch sizes now I'm not sure exactly why but my intuition is that it's it's got a larger surface area on the chip it's got more GPU cores uh more CPU CES so it actually takes a little bit more effort to move the data around but then once we get up to the higher batch sizes uh it really does even out out the M series although they're capable uh the Nvidia dedicated cards perform 5x maybe even 10x in this case better but let's move on we'll go to the next one now this is a larger data set this is food 101 100,000 images of size 224x 224 it's actually the same image size that I use for training the computer vision models that uh power nutrify at the moment that is so this is a much more practical example Le especially for me so we want L's better here and we see right from the start the M1 Pro actually how performs the M3 and that is just keep in mind because I upgraded my M1 Pro when I first bought it so if we go to Google or sorry if we go to GitHub and then if we go down here oh spoiler alert we have yeah my M1 Pro 14in 2 years old has 16 core GPU whereas the m 3 the Baseline model the Baseline M3 MacBook Pro you can get has only a 10 core GPU so if we jump back into here uh so it outperforms it there but then we get uh a little bit of a trend as we increase the M3 Pro faster as we go to the M3 Max faster again but then same as before the Nvidia chips perform the best especially here on the Titan RTX performing almost three times as fast here than the M3 Max and these little fire emojis we got one here for the M3 of a batch size of 64 means that the ram capped out it couldn't hold 64 images uh with resnet 50 and the size of the images 224 224 in memory so that's something to keep in mind we'll get to that later on but if you want to train larger models and or at least use larger models and then perhaps even train them with a larger batch size you'll definitely want more RAM than the bass M3 has to offer and you see here even for a higher batch size we're maxing out the M3 Pro so if we go quickly back what did we have here M3 pro has 18 GB of RAM whereas my M1 pro has 32 GB of RAM just something to keep in mind if you're looking for a new machine Let's jump into now this is probably the biggest model that we check tested um throughout these experiments so in this case higher is better keep that in mind samples per second so distill BT is quite a modern natural language processing model IMDb is a bunch of IMDb reviews and this test was text classification so classifying something into uh was it positive or negative with a sample with a vector size of about 200 I believe so higher is better here we have the M3 so the M3 and by the way this is a Transformer based model whereas reset 50 is a convolutional neural network based model so that may influence uh Hardware implementations so if we go to here M1 Pro performs quite well M1 Pro actually outperforms the M3 Pro but just be in mind uh keep in mind that that I did upgrade that what's the difference here in C gpus M3 Pro oh so we have two more gpus on the M1 Pro this is training so rather than um inference this is actual training so it wasn't the whole model it was the top two layers plus one of the top Transformer layers the M3 Max improves upon the M1 Pro and the M3 Pro however as we saw before the Nvidia gpus go right ahead and just crunch the numbers um I would say 50 or so and then we go uh maybe this maxes out 140 one closest to so nearly 2.5 to 3x here um the M3 Max versus the Nvidia chips and then we see the little flame emojis the Apple M3 bowed out at 32 batch size 64 didn't make it there 128 of course it's not going to make it there and then once we get to a really large batch size of 256 we start to see other ones drop off M3 dropped off way ago but then M3 Pro drops off here and then the Nvidia Tesla V100 with its 16 GB of vram drops off there and then none of the machines can make the 512 batch size next up we have tensorflow and something to note with tensorflow is that uh versus pytorch I in my experience have noticed that tensor flow is slightly faster um controlling for the model and the data set however this is likely because of my poorly written pytorch train training Loops I mean not poorly written uh they could just be probably optimized a bit more um like I'm sure tensor flows have so what do we have here we have have a again a very similar Trend to what we saw this is a CI 100 data set we saw this before with pytorch as the batch size increases all the time goes starts to go down before it starts to saturate up here now why is that because we want to reduce the amount of data that we're copying back and forth from memory we want to pack the GPU with as much data as possible so it can leverage parallelization I'll practice saying that for next video um but then we have a similar story The M3 series is on the decline here as in we have the the bigger the chip the faster the time is right across the board um the M3 performs probably the slowest across all examples almost except for the start here the smaller batch size I'm paying most attention to 64 and up on this smaller data set maybe even 128 and up but then of course we get um same story The Nvidia GPU use just Leaps and Bounds ahead or actually not Leaps and Bounds in this case the M3 Max is quite quite uh on par with the Tesla v00 however for uh the Nvidia Titan RTX that just rips ahead on C5 100 for resonet 50 now we have food 101 which is again a larger data set and what do we have here batch size 32 that's actually not too bad um M3 the M1 Pro performs outperforms the M3 Pro across the board now that is pretty darn good for a 2-year-old machine mind you and then it's not actually that quite of a jump here from my 2-year-old M M1 Pro to the M3 Max so that just goes to show is that you might have seen a lot of benchmarks and um uh geek bench scores and whatnot of the M3 Pro and M3 Max of course they're going to be improved at the M1 Pro however you're probably not going to want to train um neural networks with over 100,000 images that's how many images are in food 101 with the M3 chip um you could write up code and then train it somewhere else but if you want to actually train models I'd be going M3 Pro and above but the trend continues Nvidia chips here perform the best not not as big as a gap as there was before uh but definitely still uh possibly 2x better performance with a dedicated GPU finally we have tensorflow IMDb data set small transformer this is a Transformer Network attention based I've just it's just one layer so just a relatively small model just really to do some benchmarks uh again as we increase the batch size we slightly get uh an improvement in average time per Epoch but same story here across the board is that we get an improvement as the M series chips go up we get lower lower is better here we get a lower time on the M3 Max compared to the M3 Pro compared to the M3 compared to the M1 that makes a lot of sense this is a 2-year-old machine after all and then we see here at a larger batch size we've got the fastest performance from the Nvidia chips and this one that might as well be let's call that 15 15 for the Titan RTX I could have put numbers here that would have been much easier but it looks nice and aesthetic though so 15 here and then the M3 Max what's that about 115 maybe so almost a 89 performance boost on a Titan NX again this is a relatively small model uh we did see a bit of closer performance with larger data sets um and more information packed into those gpus so let's finish up these results with the Llama CPP python now this is the python uh version or bindings of llama CP P I'll put all the links in the in my GitHub in the description and whatnot and then we have a llama 27b chat model Q4 which means Quant quantized for so really minimized GF so this is a sample uh or a format that has been specifically designed to run on Apple silicon or almost anywhere so we have samples per second higher is better so we can see straight from the start that the trend here is that the more G U cores you have in this case my Apple M1 Pro that I bought 2 years ago uh it was worth upgrading those cores because now it's performing faster than an M3 Pro which is has literally just come out um it does have two more cores GPU cores but again they are 2 years old so this is I'm really happy with the M1 Pro uh M3 kind of knew that samples per second this makes a lot of sense and this line this trend line here makes a lot of sense as well the Apple M3 Max with those 30 GPU cores uh it's just able to generate on average about 48 tokens per second on the M3 Max 35 tokens per second so the the test here was I asked llama to 7B chat 100 different questions and then let it generate tokens uh up to about 500 tokens per question and then measured the average uh token generation speed across a 100 different questions so yeah pretty exciting results there and then we have here geekbench ml scores now this is I've seen geekbench for almost everything else like film editing and Cinema and all that sort of jazz but I haven't quite yet seen it for ML but there is an official geekbench ml app and I ran it across all of the M series machines that I have and just keep a note that it was with v 0.6.0 so these results May Vary as the the version increases and all of the test on the geek bench ml uh by inference only so it's not training so as we can see here the M3 Max has the highest CPU score the CPU scores here weren't actually that different across the board uh from the M3 series that is too like they're within 50 points of each other despite the fact that uh the M3 Max has 14 cores versus the M3 Having Eight cores however we're going to pay most attention to the GPU score because that's what we're focused on for modern machine learning and we see the trend that we've seen throughout the the rest of the results and this makes a lot of sense more GPU cores higher the score however my M1 Pro had has more GPU cores but a lower score than the M3 but as we saw in the training examples on tensorflow and Pie torch the M1 Pro performed better uh or on par than the M3 and sometimes even better than the M3 Pro so keep that in mind I always take these geekbench scores um with a grain assault I like to try uh things that I would practically actually do and that's why I included the tensorflow and Pie torch tests because that was training actual models writing actual code and then we have the neural engine score um not too surprising here to see that the M1 Pro is the lowest they all have a 16 core neural engine uh the m 3 8399 was slightly below the M3 Max but then I'm not sure what happened here the M3 Pro had the highest neural engine score and I ran these two uh a bunch of different times to see M3 Max M3 Pro I thought they'd be basically the same but they're not um I ran them three to four times who knows what exactly is going on there um but yeah GPU score-wise the M3 Max just Leaps and Bounds over everything else as expected but neural engine to be honest I don't think you'd notice much of this in practice the neural engine is mostly for inference on cormel as far as I know however um there may be one thing coming in the future that may Aid the use of the neural engine but at the moment the neural engine to me is still a little bit of dark magic let's jump into a discussion now I've hinted at some of these things uh throughout the video however for the m chips they're good for entry-level ml tasks if you're learning machine learning they're fantastic you're going to they're they're great machines I use an M1 Pro I've used it every day for 2 years it's excellent for everyday use and as we've seen from the test it's quite future proof my M1 Pro 2 years old is still performing on par with brand new machines uh the M series chips as well they seem to be maturing in terms of software this is really important I didn't have any hiccups during setups so if you go go to the GitHub with all the code I ran this setup process across four different machines and it worked every single time uh so no hiccups there of course it took me a bit to refine this but uh this will get you up and running and running code especially machine learning code in about 10 15 minutes so that's really good to see uh more RAM and GPU cores equals Benner we kind of knew that from the start machines with larger RAM and GPU CES equals better on larger data sets that makes a lot of sense larger Ram means you can store bigger models use bigger models larger match sizes more GPU Calles means you can crunch numbers faster and in my experience I've noticed tensorflow is slightly fast faster than P torch with uh again I'm not training super large state-of-the-art models but I am training pretty sophisticated models um but this is likely because of my own hand coed pie torch Loops finally a dedicated Nvidia GPU is still still going to perform best this is my deep learning PC I use an M1 Pro goes into that trains computer vision models on there all the time and it works like a charm finally recommendations for ML if you're looking at buying an M series chip please avoid the 8 GB unified memory if you're going to upgrade anything you could get the base M3 but please upgrade the memory because you're going just just going to run into a lot of headaches um trying to load modern machine learning models ideally you'll increase the ram in this order Ram first and then GPU cores Ram is just going to allow you to train larger models larger batch sizes all of the GPU cores across the new M3 series are are quite quick so if you have the budget upgrade the RAM if you have even more budget upgra upgrade the GPU CES however if you want to save some money on upgrading your G Mac if you just buy one of the Baseline M3 Pros with 18 GB of RAM or potentially upgraded save your money on that that and buy a dedicated Nvidia GPU because that's going to give you the best uh bang for your buck you're going to be able to train uh models much faster on a on a dedicated chip for now Apple May uh do something that improves their uh M series chips and the good news is if you have an existing M series chip uh you don't need to upgrade or you likely not need to uh I'm personally not upgrading my M1 Pro to an M3 Pro because as you saw my M1 Pro kind of already outperforms the M M3 Pro um so if you do buy a new M3 series and this goes for M2 as well uh it'll likely be good for at least 3 to 5 years so um I think Apple went a little too hard on the M1 because it is still performing incredible and we kind of see the M3s and go oh I'm kind of expecting the the same Quantum Leap that we had with the M1 but this is not necessarily a bad thing because you don't need to upgrade your machine every year and then finally one little extra something to check out for the future and I've put a little thinking phas here because I've only just stumbled upon this and that is mlx an array framework for Apple silicon developed by Apple machine learning research and so potentially mlx might take advantage of all of the beautiful features of Apple silicon in the future I haven't tried it out too much but that may get us some further speed increases in the future rather than using tensor flow and P torch bindings if something's if a framework is developed specifically for a certain type of Hardware generally it's going to perform better so check that out and uh potentially that's going to Warrant some future speed tests but happy machine learning and I'll see you in the next video
Info
Channel: Daniel Bourke
Views: 150,113
Rating: undefined out of 5
Keywords:
Id: cpYqED1q6ro
Channel Id: undefined
Length: 24min 3sec (1443 seconds)
Published: Sat Dec 23 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.