LIVE: Parallel programming for training and productionization of ML/AI systems

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

uh hi friends a very good evening and thank you for joining in today could you please confirm in the chat window if you are able to see see my webcam and also listen or hear to my voice clearly of course i'll start sharing the screen as we progress through the session but just wanted to get a confirmation if everything is working all right so pretty could i'm also looking at the chat window on youtube so could you please confirm if everything is working as expected that'll that'll help us understand that there are no technical glitches and we can get started we'll start at 7pm sorry 7 p.m sharp we are still four minutes away from it because the scheduled time is seven so let's wait for three four minutes before we get started hi folks uh uh thank you for joining in and a very good evening to everyone who joined a little early uh i and some of our team members have been answering a few of the questions in the chat window in the last few minutes again thank you everyone welcome to this session uh as i was just saying we'll get started in another four to five minutes as soon as the clock ticks seven good yeah so no technical glitches and that's good to know so let's wait for a few more minutes and we'll dive into the session itself again i am not fluent in hindi somebody is asking me to say something in hindi i'm not fluent in it so i can i mean hindi is not my mother tongue so yeah i'm much more comfortable in english and i won't make mistakes or i will make fewer mistakes i should say cool again for this session itself i think we also made it clear in the announcement video i hope all of you have seen the previous session or at least know about multi-threading and multi-processing in general in python right so i hope you know that because we'll build on top of it uh for for the rest of this session right so cool uh can you expect a new course in data structures and algorithms in python from applied ai uh so wamsi uh so we have so our whole interview prep course has been done in such a way again we initially did it with c programming in mind but all of the code snippets all of the code explanations we've already added code in python because students requested for it so the interview prep course that we have covers whole of data structures and algorithms so we cover all concepts from a pseudo code perspective we explain code in c we also explain code in python right so we also provide code in other languages like java in case you want to use that but we try to explain concepts from a pseudo code perspective so that tomorrow let's assume you want to use let's say go okay or you want to use some other language like scala so you can use any language of your choice that's what to keep it generic we tried to use pseudo code again we tried to follow the book the clrs book introduction to algorithms in that but on students request we have added code walkthroughs we've added detailed explanations in python also so that you can already find there okay cool uh so uh okay sorry it was a tongue twister there so how much time do you need to spend to complete applied a course in one year from what we observe from our students an average student needs to spend about 10 to 15 hours a week right so again your learning will be slow in the initial one month or so but slowly you will ramp it up right if you are putting it again we have some people who put in more time during the weekend less time during the weekdays that's perfectly all right that's why we want to give you a number which is 10 to 15 hours again depends on depends on what again if you want to get into the best of best companies it's better to put in 15 hours per week which is roughly which turns out to roughly two hours a week roughly right sorry two hours a day i'm sorry two hours a day roughly and maybe a couple of hours extra over the weekend so if you can put in that effort diligently without distractions day in and day out it's fairly easy to complete the course in in about in under six months including the assignments okay cool uh uh so let's let's get started then it's already 7 pm so i'll just change it to my screen and the plan of action today is as follows this is a two-hour live session there is a lot of stuff to cover and i don't want to take this to one more session so what i'll try to do is i'll spend about 90 minutes or so 90 to 100 minutes covering how parallel algorithms can be used for training whether or for simple data science or machine learning or deep learning work and how they can also be used for productionization there will be few places where i'll do code walkthroughs but a lot of the code walkthrough has been done in the previous session where we explained how to do multiprocessing how to do multi-threading etcetera here it will be more about concepts okay of course there are some places where i'll do code walkthrough where i'll show you on my terminal how everything works but let's try to cover more by understanding how parallel algorithms can be implemented again this is a topic that that i personally love a lot because i have worked on this specific problems extensively when i was working at different companies and actually again this was pre uh great libraries so early in my career i built many of these things from scratch from the ground up so i've had some really interesting experiences on building parallel algorithms primarily for data science and machine learning applications so we'll work on that okay i'll discuss all of that in this session cool let's go into the session itself uh okay so here is my notes okay so i'll go through these whole notes first let's spend about 90 to 100 minutes going through everything then i'll open up the floor i'll try to answer as many questions as i can okay so yeah the screen is being shared yes then i'll try to answer as many questions as i can but please understand that it's almost impossible in a public live session to answer every question so anything that i am not able to answer feel free to email us at team appliedacores.com either me or one of our mentors will try to respond to it as soon as possible okay so let's get started today so today's session is a parallel programming for whether it's data science machine learning or deep learning applications and we'll focus a little bit on training and a little bit on productionization also so i'll do more detailed code walkthrough for productionization again we will do parallel programming based productionization using web apis so we have already done previous sessions where we talked about how to use flask apis in python again plus flask is one of the very popular ways to build web apis in python uh it's also used extensively in the industry and we've already discussed that so we'll build on top of flask apis and how to actually do parallel processing or parallel programming for productionization also so we'll do some code walkthroughs introduce a few new concepts here for most of training we will build on again in all of this we'll actually take a few problems so let me just show you that okay we are a big fan of trying to solve problems and introduce concepts as we solve problems right so in the previous session last week we learned about how to do multiprocessing in python and how to do multi-threading and what are the challenges of doing multi-threading in python and why multi-processing is used extensively we also discussed about libraries like job lib we did a ton of code walkthroughs last week so we'll build on top of this in this session we will design parallel algorithms for training machine learning models whether it is in data science whether it's in machine learning or deep learning we'll try to cover a few examples in each of them then we'll spend some time discussing how you can use parallel processing for productionization of models right so just let me keep the chat window open on my phone just give me a second folks i want to keep it so that if there is any technical glitch i will be able to see it in the chat window okay cool so yes so this is where we are cool again i'll come to the chat window i'll answer your questions after we complete all the concepts that we want to discuss all the code walkthrough that we want to do so i'll try to allocate 20 to 30 minutes at the end just for q a cool okay so now most importantly please realize that parallel processing is a very vast subject right you can spend weeks together learning about parallel processing it's a it's a non-trivially large subject okay there are multiple books written on it i have taken a parallel processing course as a graduate student so it's a vast subject on to itself please think of this as an introductory session where we will not be able to go into lot of details of parallel processing we will think of this as a introductory session of parallel processing for machine learning applications and the way we will learn these concepts is we will introduce ideas we will introduce new ideas as we solve a bunch of problems right so as we try to solve a bunch of problems we will introduce ideas i think this is the best way because it will help you connect the concepts to theory better now parallel processing itself is a vast area because there are many many models of computation okay this is also very important to make a note of because you could have a multi-core multi-threaded cpus this is one computational architecture on which you can do parallel processing and this is what we will focus on even in the previous session we talked about hyper threaded processors we talked about i showed you my computer which has six cores right my computer has six cores and it has hyper threading because it's an i7 processor so there are it can do hyper threading so there are 12 threads that we can parallely work with right but parallel processing need not be the general the general area of parallel processing if you pick up any book or read about parallel processing it's a very vast area right so you can think of computation on gpus also as parallel processing again we discuss about how gpus are used for deep learning in our deep learning videos of our of our course right so that's also one way that's also one architecture there are also fpgas which are custom hardware that are built to compute to to do computations very very very very fast okay there is also clusters of computers so we have done a few sessions on spark right one of the sessions is also publicly available where we talk about how spark works so cluster computing or distributed computing is another computational model under which everything works so there are many many computational models in which you can perform parallel processing we will stick in this session we will stick to a system where you have multi-core cpus with multi-threads okay we will not be able to go into gpus or fpgas or clusters in this or about about distributed computing or cluster computing we have discussed how spark works how hadoop works all of that earlier number one gpus we have discussed about them in our course videos when we teach about how to use tensorflow keras and how gpu itself works how gpu uses the thousands of codes that it has you know while using the video ram fpgas is beyond the scope of this course because this is something that that is fairly advanced okay cool uh okay sounds good so just wanted to give you a heads up another very important question that many people have here is what is the difference between parallel computing and distributed computing so very very often again these terms i've seen people use them in many many different ways so let there many slightly variational definitions of both of them also in general when somebody says parallel programming they're talking about multiple cores on a box they're either or they're talking about multi-threaded systems or they're talking about gpu based systems notice that in all of them the memory is shared by the multiple cores or the multiple threads or the thousands of codes that you have in a gpu okay so all of this is happening in one box or or literally one beefy computer where you have all the memory being shared on the other hand if you look at distributed computing we discussed we have done live sessions on spark and how to use spark for machine learning we've done we have done i think three or four live sessions on spark okay we've also done live sessions explaining how hadoop works internally right so we have done code walkthroughs of spark and things like that in all of these systems what you have is a distributed memory which means you have multiple computers here think of each of these as multiple computers all these computers are connected via a network it could be a very very fast network it could be like a 1gbps or a 10 gbps or even 100 gbps network but each of these computer has the ram internally so the the data itself is distributed across in a distributed memory which means the ram is split so there is ram on this computer ram on this computer ram on this computer ram on this computer etc so the suppose imagine if i have six cores here okay suppose imagine if i have six cores on this computer one right so this core can't access the data that is there in this computer's ram right of course if it wants it it has to fetch the data via the lan copy to its copy it copy it to its ram and then access it okay so distributed computing is typically used when you have large data when you have large amounts of data when you have large amounts of data typically distributed systems based distributed systems like earlier people used to use hadoop a lot but today nowadays most people use spark for machine learning and data science applications right so we have we have already covered spark in lots of detail in the course videos today we will focus on imagine you have one computer or it could it could be a computer that you personally own or it could be a cloud based computer if it has 6 cores 12 cores 30 cores if it can support multi-threaded code if it has gpus so we'll talk about parallel processing in this context okay not in the distributed computing context just wanted to make that clear cool so let's start with some simple problems okay so imagine you want to add two matrices okay suppose let's start with some very very simple problems okay suppose you have two matrices a and b and you want to add both of them and generate let's say c c equals to a plus b let's assume this is what you want to achieve now how do you do it standard matrix edition you know how to write it right it's just a simple twin for loop you go through each cell of a you go through the corresponding cell of b you add them and generate the so this is basically c i j equals to a i j plus b i j so you write a loop for i and for j i goes through all the rows j goes through all the columns it's just a twin for loop structure okay this is how you write it in a standard computer in parallel computer look at this you have multiple cores right so again a major assumption that i will be making here is that both a and b fit into the ram okay i'm making this fundamental assumption that a and b fit into the ram now imagine i have a which is an n cross m matrix b also has to be same sized otherwise we will not be able to add them b is let's say n cross m now you want to add both of them now again so one one simple way to do it again there are many many ways of doing this one of the simplest ways you can do here is this take n one rows take n one rows of a take n 1 rows of b add these n1 rows of a add these n1 rows of b in one core let's say you do it in core one suppose let's say you have six cores on your computer okay add so look at this because you have to add this cell to this cell this cell and this cell addition to to produce the answer you don't require other cells right so you're taking this whole data look at this you're taking this data you're taking this the first n1 rows so the first n1 rows of a and b you're going to send them to core 1 the next n2 rows of a and b right so next n2 rows send them to core 2. so on so forth okay then what happens look at this this is one simple way to create parallelism okay now what is happening while you are generating the first n1 rows of c in one core another core is generating the end the the next n2 course of c right so your final result is let's say c c equals to a plus b let's assume that's what you want to generate right the first n1 rows of c are being generated in the first are being generated in the first core this is happening in the second core this is happening in the third core and so on so forth very simple mechanism very this is almost trivially parallelizable again the technique that we are using here is called data parallelism why is it called data parallelism the reason it's called data parallelism is because look at this each core if you have six cores c1 c2 so on so forth c6 c1 is seeing a a part of the data you're taking the data you're paralyzing the data so c1 is seeing the first n1 rows of a and b c2 is seeing the next n2 rows and so on and so forth now by doing this what are you doing you're breaking the data you're parallelizing the data here so this is called data parallelism because each core is seeing a different subset of data right and it's operating on a different subset of data now again uh i hope all of you know what is a row major order okay so depending on the programming language and the library you are using lot of computers now you might say okay all this is cool why are we taking n1 rows and n1 rows here why can't i take columns why can't i take n1 columns here n1 columns here send them to core 1 or n2 rows here sorry n2 columns here and 2 columns here and send them to core 2. now what happens is in most programming languages each matrix that you have is stored in row major order what it means is even though what you have so what is what is the row major order mean so imagine you have this row one okay let me just change the color row two let's say row 3 let's say row 4 suppose you have these four rows how is this actually stored in the memory the weight is actually stored in the memory is as follows first it stores row 1 right then it stores row 2 then sequentially it stores row 3 then it stores row 4. so even a matrix is stored actually internally as an array by most programming languages even a matrix is actually stored as an array so this this way of storing is called row major order which means the first row is stored first then the second row then the third row then the fourth row and so on so forth in a column major order first you store the first column then the second column third column fourth column etc most modern programming languages and most modern libraries use row major order they use the row major order so what happens now suppose if i want the first n1 rows the first n1 rows will be consecutively stored in a memory right similarly the first n2 rows will be consecutively stored in the memory so accessing these rows is much easier because they are in row major they're stored in row major form right so that that's that's the core idea here right so this is this is this is a very simple thing matrix edition is one of the simplest of all applications now let's go and see slightly more complex example right so again uh okay oh there's one more important thing here before we go into other examples so what we have here is a plus b equals to c now we want to store a b and c all of them in the shared memory okay look at this imagine this is your ram all right imagine this is your ram you have a b and c right you want all abc to be accessible by all of your cores so you have core 1 core 2 core 6. you want look at this you want you want this memory that is that is allocated to abc you want it to be accessible to c1 up to c6 all the cores and all the processes that are running here right now how do you do that okay so there is a concept called as shared memory again we discussed about it in the previous session but let me show you a very simple code snippet again this is from the official documentation of python i will share this notes with you at the end of this session so that you can go through this right so let me let me just ah okay so let me just go here okay so how do you create okay so there is a very simple example here let me show you a simple example here okay okay so i'll just walk you through let me just zoom this in a little okay so this is a very simple example about shared memory okay very simple code walkthrough let me walk you through it step by step first we are importing numpy right we are creating a numpy array okay it could be anything remember it could be a numpy array it could be anything it could be a scipy structure it could be a data frame it could be anything it doesn't matter right next look at this from multi-processing multi-processing is a module that we learnt about in the previous session right so from multi-processing we are importing something called a shared memory okay we'll use this now now we want to create a shared memory space okay we want to create a memory space within the ram so look at this within the ram so one very important thing here is if you have multiple processes let's say your process one process two let's say process six let's say your process one is running in c1 process two is running in c2 and so on so forth process six is running in on core six each process has its dedicated memory that if that all the code in this process can access so each process has its dedicated memory but now what we want to create here is we want to create some shared memory so we are calling this shared memory right we want to create some shared memory that is accessible to all of these processes right and we will store abc in the shared memory now how do you create shared memory it's actually very simple in python okay so you just say so you create this variable called shm you say shared memory dot shared memory create it you want to create the shared memory and the size of the shared memory is a dot number of bytes because you want to place this remember you want this a you want this a to be placed in this memory okay so this is this is this is already there this is already in the memory the moment you create this numpy array it's already in the memory now you want to copy its contents to a shared memory location okay so so you want suppose suppose this process is created a okay so this process literature has created a so a will be there in its memory but now you want to copy this to your shared memory so you're saying i want to create a shared memory chunk okay whose name is shm whose size is a dot number of bytes simple okay so now now look at this now it says now create a numpy array backed by shared memory okay so what are we doing here now we are saying b equals to numpy dot nd array okay so we are creating a new array called b here what what is it it's a numpy array and its shape is same as the a dot shape its data type is same as the a dot data type but the space that it should create so what what this buffer gives you in numpy in mp.nd arrays this buffer equals to shm.buff what this does is it tells numpy that create create an array create a numpy array called b its shape should be same as a its data type should be same as a but create it in this memory buffer or in this memory location that we have already allocated space for so now what is happening here look at this so what's happening here is this let me use a slightly different color okay so so let's assume a is stored here now when as soon as this line runs okay so we've already added shm this is your shm which is our shared memory object right now the moment you run this line it says create another numpy array called b with all the properties same as a but create it here so this b will be created here since this is shared memory it can be accessed by multiple processes okay now what are we doing here look at look at this line what does this line say this line says copy everything from a to b so if you print b enough you will get all the contents of a just as b now here comes the most interesting thing so you can get the name of the shared memory you can say shm.name it will give you some some alphanumeric string like this right now using this alphanumeric string you can access see you got you got you got a unique name because shm remember shm is a variable name in this in this code snippet suppose suppose this code snippet is running as part of process one right then shm is actually a variable name so shm is a variable name in this process so you can't access it using shm so you get some general name like this you get a generic name which says okay what is the name of the shared memory as understood by python next you can start one more terminal look at this you can start one more terminal completely different terminal or you can start one more process in that other process you can access the shared memory using this name right if you if you start a new terminal right imagine you are using one terminal and you should use another terminal if you change the terminal a new process is created effectively so here also you just say so in in your child processes you just have to say from multi processing import shared memory here comes the most interesting thing you say shared memory dot shared memory you give the name of the shared memory and you give you give a variable name called existing shm now what happens is interesting now this process imagine this is process p1 this code is running in process p2 now p2 can access so where is our code okay so let's assume this is p1 and this is p2 right now process 2 can access whatever is here can access whatever is here because we are using this unique name right we are using this unique name psm underscore blah blah blah right now so we got this now we want to treat whatever is there in this memory we want to treat it as a numpy array because what is it it's just a bunch of bits and bytes we don't want to treat it as a bunch of bits and bytes we want to treat it as a numpy array so what do we do now very simple just create a numpy array give the size exactly as your a give the data type exactly as your a okay whatever is the size and data type use it again here a variable will not be there because a is a variable in process p1 you don't have a variable here all you have to say here is buffer equals to you have to give this name existing shm underscore buffer so what happens enough whatever is here it will now be treated in this process as a numpy rac that's it enough you can start modifying see if you modify c here let's assume process to modify something in c here that modified value will be observed everywhere okay so this is like this is how you create shared memory which can be accessed across multiple processes okay so that's all again the reason i'm showing you code walkthrough from the official python documentation is because if you want to read more you can read up this whole article okay it's it's again the whole documentation is clear all the function again this is from the most recent version of python python 3.8 so all the details are here i want to walk you through actual official documentation so that if you want to read more it will be easier okay so i'll provide this link i'll provide this whole document for you so all you have to do here is this a b c let's assume they're numpy and d arrays you just have to place them in a shared memory that's it that's all you have to do okay now let's go into matrix multiplication now you might say hey why are you doing all matrix operations let's go to machine learning models please understand that matrix operations are foundational to whole of data science machine learning deep learning look at any algorithms update rules whether it's a machine learning algorithm like logistic regression linear regression linear svm or deep learning algorithms all of them all of them are basically matrix operations matrix multiplications or you can think of them as matrix or vector or tensor operations right we'll come to vectors and tensors in a few minutes right they're basically multiplications and additions of matrix of matrices vectors or tensors that's what most of machine learning implementation is all about if you again in our course we have implementation of some of the algorithms right from scratch like we implement linear regression from scratch and that's also one of the assignments that we have right so in all of those things if you notice very carefully this whole thing is is a bunch of matrix vector or tensor operations and matrix multiplication itself is very very fundamental to many many scientific applications not just machine learning and deep learning lot of physics simulations right a lot of physics simulations a lot of a lot of mathematical simulations all of them involve ton of ton of matrix operations and especially matrix multiplication so there are many many variations there are tons of parallel matrix multiplication algorithms we will not be able to cover all of them there are actually full-fledged books which talk about just matrix multi-parallel matrix multiplication operations like full-fledged books okay so we will cover some simple ones okay cool so let's assume what we want is okay so let's assume what we want is matrix c which is a product of a and b so a matrix is n cross m b matrix is m cross k okay let's assume these are the two matrices that we have and of course c will be n cross k this m and this m should match otherwise it won't work very simple now there are many many algorithms let me introduce you to a couple of simple ones okay the the simplest one is as follows okay so let me use a lighter color let's use data parallelism so let's let's so on core one so always remember that on core one let's assume process one is running on core two let's assume process two is running and so on so forth okay we will we will multiply look at this we will multiply n one rows here with all of these with all of these columns what is matrix multiplication you take a row you take a column multiply that you'll get one entry in the matrix c right that's that's what it works right so here also we'll apply data parallelism and we'll say the n1 rows of a so let's assume process 1 will get n1 rows of a it will get all of b look at this it will get all of b the whole b it will get so it will read the whole of b but it will only use n1 rows of it will only use n1 rows of a so then what happens you can obtain look at this this row multiplied by this column so first row multiplied by first column will give you c11 first row multiplied by second column will give you c12 right because c is the result right then first row third column will give you c13 so on so forth similarly second row c21 you'll get c 2 2 so on so forth so the first n one rows of c the first n one rows of c you can obtain in process by process one which is running on core one right very simple similarly these similarly the next n2 rows the next n2 rows of a will be multiplied with the whole of b will be multiplied with the whole of b in process 2. so this and this will be multiplied in process 2 in core 2 and so on and so forth right so here also we are using data parallelism why are we using data parallelism look at this b is look at this b itself is commonly shared by p1 p2 all the all the processes but a only chunks of a only chunks of a are being shared so one chunk goes to process one one chunk is used by process two so there is some amount of data parallelism here also this is one of the simplest ways there are many other different ways also okay so another way again you can go and check this wikipedia link which says matrix multiplication algorithm there are a bunch of parallel and distributed algorithms one of the algorithms that is discussed on the wikipedia page is as follows so they say imagine you want to construct c equals to a into b break a into four sub matrices this thing is a matrix remember this thing is a sub matrix this is a sub matrix this is not one entry this is not the cell a one two this is a sub matrix a one two again another sub matrix a two one so break a into four sub matrices break b into 4 sub matrices then you can write c as a 1 1 look at this the first entry look at the first entry here the first entry here is a 1 1 multiplied by b 1 1 plus a a12 multiplied by b21 of course the sizes of these should match otherwise multiplication will not be possible okay so you multiply this sub matrix you multiply this sub matrix with this sub matrix whatever is the resultant that you get you add it to this sub matrix plus this sub matrix that will give you the first sub matrix in c right so this is this this this approach is a more generalized approach compared to here here if you notice whole b is fixed we have breaking only a into sub parts here both a and b are chunked both a and b are broken down into sub matrices and you can write c as this again this is fairly straightforward to see if you know basic matrix multiplication this is straightforward so how is this done now look at this this is slightly interesting right so one thing so first thing first you have to first multiply this with this you have to multiply this with this right then once you get all of them in the parent process you perform this addition first you give you give all of your processes in one process this product will happen in one process this this this multiplication will happen in one process this multiplication will happen in one process this multiplication will happen in one process this will happen in one process this will happen so each of these multiplications will be independently done by different processes these multiplication values will be will be put in a shared memory location then in the parent process we talked about parent and child processes right in the parent so all of the child processes actually perform this matrix multiplication in the parent process we will perform the final edition okay that's how you can do it again if you want to learn about more algorithms you can learn about it on this wikipedia link okay very very simple logic here now okay so all this is cool again there is a okay let me walk you through a small code snippet that is given here okay so again this is what i was referring to okay so you can also see the code here okay there is a very simple pseudo code written here right so here remember that fork so whenever you're reading this term fork fork is a term which is used in unix or linux computers fork basically means create a new process this is equivalent to saying create a new process that we have seen in python we already talked about join right in the previous session we talked about the join operation right so if you want to get a pseudo code of how to multiply by breaking a into four sub matrices right b into four sub matrices this is one way of doing it so it says first partition a b c t is a temporary temporary matrix then it says create one process to compute this create another process to compute this create another process to do it as i explained you then join which means wait for all of these jobs to complete once all of them are complete just add things just add them because at the end of the day what you need is this right so at the end of the day you need to add this you this you'll get it this you will get it you need to add both of them to get the final result that addition is what is being done here so what you see here is a simple pseudo code of what i just explained cool so again this is one way of doing matrix multiplication okay so let me move on okay another small topic that i want to touch upon because i want to tell you how i mean there is so much scope in parallel algorithms right so many of you may not know this whole architecture unless you have studied computer science so i just wanted to give you so whatever you have in hard disk so data is permanently stored in hard disk or solid state drive right so it is loaded into ram so many people think the data from ram is directly accessed by the cpu so let's assume this is your cpu let's assume that your cpu also has six cores core one core two so on so forth core six right now if if if if c1 c2 up to c6 all of them are six cores between ram and the course there are two more levels of memory there is something called as cache memory okay and there is something called as registers registers are basically there is again ram can run into few gb nowadays right few gb this can run into few hundreds of gb right you have cache and you have registers registers typically are just a few bytes okay these are just few bytes but registers are small memory units which are stored on each of the cores so they're extremely fast to access the reason cache is introduced is cache is almost 10 to 100 times faster to access than from ram look at this accessing data from disk is very slow this is very slow accessing data from from disk is very slow from ram is fast from cache is very fast so from registers is very very fast right so what happens here is even if your data is in the ram again your cache could typically run into few mbs of space again modern microprocessors like your i7 processors etc have multiple levels of cache called l1 cache l2 cache l3 cache etc now you might say hey why don't we have everything in cache because cache is expensive they can't put 1gb of cache cache is very very expensive compared to ram so in a in a in a strategy to cut costs they have gb multiple gb of ram they have few mb of cash and few bytes of registers right so now look at this now the most interesting thing again this is the standard architecture again lot of people don't know about cache that's why i introduced this concept again if you want to know more about cache you have to study about how computer organization works how operating systems work internally right but this much is enough for our context right now suppose imagine if our memory is in if if our abc matrices are in ram if they were actually in cache your code will run much much faster because the course can access cache 10 to 100 times faster if they were in cash than if they were in ram at the end of the day they have to access this data right they have to access this data and perform an operation right so there are because the cache is becoming a very very important part of modern microprocessors there is a whole area of algorithms called as cache aware algorithms right so there is matrix multiplication which is cache aware which tries to ensure that we use the cache as optimally as possible so that we can speed up the whole multiplication significantly right so there is a lot of complexity so if you really want to become an expert in parallel computing you have to know lot of computer architecture so you have to know lot of computer architecture right if you really want to pursue this topic in full depth but for for most data science and machine learning applications you don't have to know lot more depth okay even if you can apply some basic stuff that will speed up your workloads enormously right so okay cool this this is one aspect to it again please understand that i see the comment section yes there are many many algorithms for matrix multiplication there are recursive algorithms there are distributed algorithms there are ton of them just in the interest of time i am only covering the basic systems because i'm giving an intuition on how you can parallelize okay this area as i told you there are full-fledged books on just parallel matrix multiplications right so hence we will not be able to go into it so now the the challenge here is you can't increase cash to a few gb because we are still not there it's fairly expensive your computer will cost a few thousand dollars if you just increase the cash like crazy right so there are pricing constraints most modern computers that you buy have ram they have some ssd solid state drive or hard disk they have a decent amount of cash they have multiple cores and they have registers so there are also cache aware matrix multiplication algorithms that we will not be able to go into full depth again those of you who want to learn more about it about l1 l2 l3 cache again remember the way l1 l2 l3 cache works is as follows l1 cache is the smallest it is the fastest it is closest to your cpu then you have l2 cache then you have l3 cache then you have ram so you can think of this as a hierarchy of ram wherein registers is also part of memory right think of registers also as a type of memory so as you come closer and closer to the cpu the speed increases literally as you as your memory is closer and closer to the cpu the speed increases the cost also increases and the size reduces right so basic logic basic logic because you can't stuff 1gb of ram in each of the cores you can't stuff them inside this right it's not possible today right so cool so this is this is what you have okay so okay so just wanted to tell you about a whole area called cash aware algorithms i mean it's a huge research area again fulfilled books tons of research papers written on it okay cool so another question that some of you may have here is what we discussed about are matrix operations what about vector operations what is a vector vector is also an n cross 1 matrix right so whatever we have discussed for matrices also holds for vectors right some of you may say hey in deep learning we use tensors what about tensors imagine you have a three dimensional tensor like this n cross m cross k what is this this is nothing but it's a bunch of k n cross m matrices so all the logic that we have done imagine you have a three dimensional tensor like this so imagine you have a three dimensional tensor like this you can chunk it in whichever way you want so a matrix look at this a matrix we chunked it into rows right here you can chunk it into cuboids so this whole thing this whole thing could be one chunk right so it all depends on how you chunk the data and how you design the parallel processing algorithm but whatever we have learnt for matrices can be scaled to tensors also right imagine you have k n cross m matrices send each n cross m matrix to each of the processes if you can do the operation like that right so that's all it boils down to okay whether you have vectors matrices tensors the basic concepts that we learnt here are applicable everywhere okay again please understand that we will not be able to go and spend lot of time on lots of internals of all of them i'm just trying to give you an idea on how again this is primarily for this is primarily for data science and machine learning folks right okay now let's get into machine learning training okay again here i'll talk about some of the concepts and we will keep going into this see whether you have logistic regression whether you have linear regression or linear svm at the end of the day how do you write code when you have to implement this from scratch how do you do it you have let's assume you have a weight vector w which belongs to r which belongs to d dimensions which belongs to r d which means it's a dimensional vector now at each iteration okay whether you're using gradient descent or stochastic gradient descent or batch sjd whatever you're doing you're basically updating w nu see w i is nothing but the eighth component of w so i'm writing w i as the eighth component of w so in each iteration what do you do w i new equals to w i old minus some learning rate multiplied by the derivative of the loss function with respect to w i as computed at w i old basic logic again this derivative itself will change from linear regression to logistic regression to svms again this loss depends on the exact loss function that you are using plus regularization when i say loss here what i mean here is the loss function you could use a logistic loss plus regularizer logistic loss is nothing but cross entropy loss linear regression you can use uh you can use squared loss plus whatever regularization you want if it is svm if it's let's say linear spm if it's a linear svm what do you have you have hinge loss plus l2 regularization right so this l basically is loss function plus whatever regularization you have so whatever i am discussing here is applicable to logistic regression linear regression linear svm all of them okay now one of the most important things here is this part look at this chunk how do you compute this this loss typically has some summation over all the points okay whether you have logi again depends on how you're implementing right so you could you could you could have a loss over all the points so let me use j here just to avoid confusion with i here okay so this this loss itself this derivative of the loss that that again this derivative will change whether you have whether using logistic regression linear regression or linear spm this itself could be a summation over all points right or it could be a summation over some k points if you are using a batch size of k right typically typically so this typically involves a summation over j typically involves a summation over j this j itself could be 1 to n if you have n training points or it could be 1 2 k if you are using a batch sgd that's what it is it's very core again we explain the mathematics behind this in lots of detail in the course videos we've also explained it in some of the live sessions okay so we will because from an implementation standpoint this equation is what you have to implement at the very at at the end of the day now there are many ways of implementing logistic and linear regression one of the ways one of the ways is let's do task parallelization so what we sort of love is data parallelization so let us look at task parallelization so what we will do is this our w lets assume is a d dimensional vector ok so this update equation this update equation this update equation needs to be executed for each of the w i's so from w 1 w 2 to w d because w is a dimensional vector this update equation needs to be executed for each of the w i's okay now when you when you one of the simplest ways i am not saying this is the only way is one of the simplest ways is you can say some of these wi's you will update on core one another a bunch of these wi's you will update on core two a bunch of them you will update on core six again on core one you have process one that is executing on code two your process two that's executing on core six your process six that's executing now one thing that you have to understand here is the training data is used by all these processes because to update any wi because if you notice to update so to dub to update any wi you need whole of the training data right if you are using simple gradient descent you need whole of the training data right or even if you are doing sgd you need a sample of this data right so effectively what the solution that you want is so the way you're distributing here is remember that w1 updating w1 is once you update w1 look at this once you update so look at so you can okay the way it will work is let's assume you are in iteration 0 okay suppose you are in suppose you have w0 which is initialized then you did w1 the subscript here tells the iteration number the superscript tells the the component number okay then you have w2 so on so forth okay so first w w0 is initialized randomly okay that you can easily do okay w1 some components of w1 some components of w1 are updated in process 1 some components are some components are updated in process 2 and so on so forth once all of them are completed then you go to the next iteration and then you update them and so on so forth ok so the way you can tackle this problem you can do task parallelization here because what is your task your total task is to update from w1 w2 so on so forth up to wd so you have to update all of them your top you have to run the update equation for all of them you are breaking this big task into smaller tasks some of them you will do it in process one some of them you will do it in process 2 some of them in process 6 whatever it is right now here remember that the whole training data is needs to be accessible to each of these processes and the whole weight vector is also being updated so you will have your d train and w in your shared memory so the way you can again here we are using task parallelism because the whole of d train the whole of d train is accessible the whole training data and the w vector are accessible by all the processes and all the cores we are breaking the larger task into smaller tasks so this is the idea of task parallelism now here one thing that you have to be very careful here is imagine you are in iteration 0 which wherein you have randomly initialized ws okay random initialize now w1 after first iteration you are updating this so here here when you want to update the first iteration you call each process and you tell each process that process 1 please go ahead update these these dimensions process 2 please go ahead call the process to ask it to update these dimensions process 3 so and so forth now when all of them are completed then go for the next iteration okay then go for next iteration and so on so forth till the time you have convergence okay this is one way i'm not saying this is the only way of doing it this is one of the simple ways of parallelizing linear or logistic regression okay or any or any any any any machine learning model where the update equation looks like this okay cool now obvious question here is what if d train doesn't fit into ram what do you do then right it's a good question right what if your whole training data doesn't fit into ram obviously there are only a bunch of solutions if you still have only one computer what do you do you break your whole d train into chunks d1 d2 d3 d4 you load d1 into your ram you load d1 into your ram and build your again here you can use all the six cores update your w vector across these six cores once that is done then use these use d2 then use d3 and so on and so forth this is one way if you were to stick to only one if you have to stick to if you were to stick to using only one computer with multiple cores but if d train doesn't fit into ram if it's so large then you might want to use distributed computing like spark so in our spark videos we explain how logistic regression is trained on spark using spark internals using map and reduce commands right so we talk about reduce by key how map works how distributed memory works in spark all of that right so a better approach if d train doesn't fit into your ram or if d train let's assume d train is suppose your ram is 4 gb okay and this is let's say one tb of data of course you can use suppose if you have only one computer you can still use this method it will take more time but if you want to do this fast then you can just go to spark spin off a cluster with some 20 or 30 nodes spark nodes and you can use distributed computing there okay so that that's one of the approaches cool so what about so i just want to touch upon some more techniques here and then i'll go into the qna session okay i'm just looking at the comments to see what people are asking i'll try to answer them at the end of it because i don't want to lose the flow okay i'll surely answer as many questions as i can cool so next is decision trees okay how do you train a decision tree very very decent decent question right there are multiple strategies with decision trees also again i will not be able to touch upon every machine learning algorithm but since decision trees are uniquely different from the optimization based techniques of logistic linear and linear regression and things like that i want to touch upon this suppose imagine you are training a decision tree here okay so there is some dictionary above you are at some node okay you are at some node the data at this node is d1 let us say now at this node let's assume you have d features feature one feature two feature three up to feature d now you have to decide on which feature you want to split so as to have maximum entropy gain right so let's assume you you're trying to the standard decision tree building right so you want so at this node you have d features that you have to pick from each feature some features could be categorical some features could be real valued it could it could be anything right some features could be numeric but like integer valued whatever it is doesn't matter suppose you have if you have d features and you want to decide what feature should you split here with okay what do you do there now here one strategy is to use task parallelism okay what is the task parallelism say task parallelism says here we have to try feature one find the best split for feature one okay so how does this work i mean what is a brute force way of doing it so if you were to implement addition t decision tree from scratch first you will try feature one try to find the best split for feature one which gives you best entropy gain then you will try for feature two feature two let's try let's find the best split and and whichever split gives you the best entropy gain and so on and so forth okay you try all the d features and whichever feature gives you the the best entropy gain you will you will split this node based on that and it will create two data sets okay so this will go ahead and create d one one and d one two again you'll you'll start training you'll start sorry you'll start split so so let's assume it creates two nodes again data also will be split here and then you will construct this sub tree and you will construct this subtree that's how decision tree works right here we can use one strategy is to use task parallelism wherein you'll say okay some features some features send them to core one okay these features the best split and the best entropy gain for these features will be discovered on core one for for another set of features they will be discovered they'll be found out on core 2 another set of features on core 6 and so on so forth so what are we doing here we are doing task parallelism here because anyway what is our task here our task here is check out each of these features find the best split which gives us best entropy gain per feature and then amongst all of them pick the best pick the best amongst all of them so amongst these first pick the best imagine in core one you realize that okay feature two for some threshold let's say some threshold tau two has the best has the best entropy gain so here you realize that feature 10 where the where the threshold is tau 10 right has the best entropy gain and so on so forth now amongst all of them you will have to find whichever has the best entropy gain and use that here now remember that each of these cores or the processes that are running on this course need access to the whole data d1 right they need whole access to the data d1 otherwise they will not be able to compute the best split and the best entropy gain threshold for each feature right so what happens here is the best split and entropy gain per feature is compute is is being split across scores right so very simple again this is one strategy i am not saying this is the only strategy another strategy is as follows imagine you're building a decision tree here this is your original data d okay so on on one computer you discovered that okay so on one process you can say okay some feature 2 less than tau 2 okay yes no ok so what happens now so the moment you have a data set d here and you have let's say some rule like this this whole data d will get broken into two data sets d1 and d2 now this whole subtree now remember this whole subtree will be built using d1 this whole subtree will be built using d2 and the construction of this sub tree is independent of the construction of this sub tree so this whole subtree can be constructed on some cores let's say on core 1 and core 2 we can use for this sub tree we can use core 3 and core 4 for this subtree so here what are we doing here we are using data parallelism because what is seen look at this because the whole data d now the whole data d is split into data sets d1 and d2 so core 1 and core 2 now where process 1 and process 2 are running they only look at data 1 they don't look at d2 right well c3 and c4 will only look at d2 this is an another form of parallelizing decision trees right so they're broadly two broad approaches again there are many other approaches there are tons of there are at least 20 research papers that i have read on decision tree parallelization because i implemented some of this in practice now most people don't end up using decision trees in practice what they end up using is random forests or gbts right much more sensible models right so how do you train a random forest random forest is often referred to as a trivially paralyzable model so all you have to do is this imagine you have six cores and you have to train let's say some 100 base learners suppose you want to learn you want to train 100 base learners in a random forest model trivial okay so train tree one okay t1 will be trained on c1 t2 will be trained on c2 t3 will be trained on c3 again the training of each of these is independent okay t1's training doesn't impact t2 straining because you do you do you do sampling right you do sampling and then you construct your trees right row sampling column sampling all the ideas that we discussed in the course videos so what happens suppose if you have six cores t1 to t6 can be parallelly built once you finish constructing t1 to t6 then you can construct t7 to t12 keep doing it so on so forth so it is trivially parallelizable algorithm right so here here one more idea from a programmatic standpoint that i want to talk about is this there is a concept called as pool of processes i just want to talk to you about it okay because it's a very useful idea when you have a system like this okay suppose imagine imagine there is a possibility so the pool of processes concept is this so first you create a pool okay suppose you create p1 p2 p3 let's say zoom p6 suppose you create six processes okay on this process suppose you are creating some tree one on this process tree tree 2 so on so forth t6 now after you create these trees you want to again call the same process so you have a pool of processes here think of them as pool and each process is running on one core you can reuse this imagine you want to reuse these processes you want to call this these processes are running in the background okay after you finish t1 to t6 you don't want to again spin off six more processes reuse just the processes that we have already created right and make make something make something run on them right because creating a new process is time taking so process creation takes time right process creation takes time so in this in this pool of process concept what happens here is first at the very outset you create a bunch of process p1 which is running on c1 let's say this is running on c p3 running on c3 suppose you create a bunch of six processes you can call whatever function you want in these processes right run it will give you an output again give one more input it will give one more output and so on so forth right so that you don't have to keep creating the process because process creation is wasteful because it takes time so there is very nice documentation on pool of workers this is also called as a pool of workers okay again we will use this concept of process pool in productionization also okay so some simple examples here again i i'll i'll share the document with you you'll be able to see it i'm walking you through examples in the official documentation of python itself so actually it's very simple okay there are multiple examples here okay let me show you a more detailed example instead of a simpler example ah this is an example that i like yes okay so this is a slightly lengthy example but it shows lots of cases so all you have to do here is this from multiprocessing import pool right simple simple stuff suppose you have a function f of x which basically squares the function okay next this is my main function in this i'm saying this is a loop that i'm running sorry this is not a loop sorry sorry this is not while i'm sorry so first i'm creating a variable called pool look at this what am i doing i am saying pool equals to pool so this lowercase pool is a variable name the upper case pool is basically my class name okay pool i am calling the class constructor and i am saying processes equals to four the moment i run this what happens is this it will create a pool of four processes p1 p2 p3 p4 it's creating these processes once it will create these four processes each process will run on some core this could run on core one this could run on code two this could run on core six this could run on let's say core five okay they could run on different course the moment i have these processes i can start running something on the processes let's look at a simple example here what this does look at this function it says pull dot map okay what we want to do here is we want to run this function f so we want to run this function f and the parameters that we want to give are range 10 which means we want to run f0 f1 so on and so forth up to f9 this is what we want to run and what does it say it says pool dot map which means the pool of processes that i have from in in these pool of processes please run f with f0 so probably f0 will go to one process then f1 will go to another process then f2 will go to another process f3 will go to another process once these processes have completed and returned the values for f 0 f 1 f 2 f 3 then f 4 will go then f 5 f 6. so here you have 10 function calls which are being split across these four pool of processes okay so if you want if you want to compute something if you want to compute f of i and you have this i here this is one way you can split it across the four pool of processes that you have so this is one way of doing it right very simple logic now number one number two here is you can also say okay so this is this is the standard way of doing it or you can say pool dot imap unordered so what this does so this basically sends f0 to the first process then f1 f2 so this will actually print 0 square 1 square 2 square up to 9 square in the order what this will do here is it will send it in arbitrary order so when you when you actually run this you can get 0 followed by 4 followed by 1 followed by 16 followed by 9 any order you can get it right so imap unordered basically means whatever are the values that you have here so range 10 has 0 1 2 up to 9 right so they need not be executed in any specific order they could be ordered i mean first i could run f0 then i could run f2 then i could run f3 in any order does not matter okay so this is how you can use a pool of processes again there is another very nice syntax which is suppose you have a pool of processes let's say you have p1 p2 p3 and p4 suppose you only want to run let's say f20 there is only one there is only one that you want to run but you want to reuse these four processes that you have that are running on a bunch of cores that are running on four cores then what can you do you can actually say pool dot apply async so apply async basically means just call f20 on any of the processes that you have asynchronously don't don't worry about it i just want to run f20 on one of the processes that you have so all you have to say here apply async f and the parameter is 20. that's it so this will return a result so you can get the result by saying result.get rest.get it will say it will print 400 for you so you can run these processes you can run just one function call also in one of these processes or you can just give an array and it will run there that's an another way similarly if you you can do many many more complex things for example look at this look at this function so here if you notice we we imported os here so all the operating system functionality will get it because this module gives us operating system functionality look at what is being run here look at this code here what this code does is it says pull.applied async okay which means this whatever function you have here will be run on one of the processes what is the function that we are asking os dot get pid pid means process id so in any linux system or any unix based system pid means process id so what what if what would it do it would run os dot pid function without any parameters it would run it on one of the one of the processes so what will you get you'll get the pid of one process it runs only in one process that's important right that's one way or if you want to run it on all the processes you can again run apply async and you can say for i in range 4 right so then it will run on different systems so all these are again mostly you will use pool dot map in most real applications or pool dot imap or pool dot apply async so what this gives you is it gives you a pool of processes pool of processes that you can keep reusing so in the case of random forest you can say okay so you could have a function you could have a function that's implemented which says train the forest okay sorry trainer train a base learner all the parameters to train the base learner you can just pass it so probably suppose you have six processes process one will first run this process two process three process six again process one will run this process two will run this process three will run this and so on and so forth so pool of processes is very very useful when you have when you have to train a random forest again we'll see how this is useful when you productionize models little later the next very interesting thing how do you do it for gradient boosted machines or gradient booster decision trees how do you do it now now one of the again this is the update equation for gradient boosted dash entries right for those of you again i've taken this from wikipedia we also explained this in loads of detail in the course videos so what it what it does is this is the model after you build m base learners so the model after m base learners is the model after m minus 1 base learners plus a multiplicative term called gamma m which we discover by solving a simple optimization problem multiplied with the nth base learner this is the formulation of gbd right if you have to train a gbt model from scratch you keep you keep finding each base learner right again to train base learner you need to get the residuals from the previous from the from the from all the but from the previous models so to train hm you need residuals from fm minus 1x so what it happens is first you build your so first initially you have your h0x which is your which is your which is your zeroth base learner then you have your first base learner to construct the first base learner you need the residues residuals from the zeroth base learner then you construct h2x to construct h2x you need the residuals from the residuals based on these two models and so on so forth so if you think about it to construct any hmx to construct any base learner hmx you need all the previous h0 to hm minus 1x to be already constructed so it is inherently serial you can't parallelize this you can't construct h01 in one box h1 x in another box h2 x in another sorry in one core this is another code this is another code you can't do that because to construct hmx you need all the previous models and you need everything up to hm minus 1x and you need the residuals of fm minus 1x right so it is inherently serial but still gbt can also be paralyzed very simple so each base learner remember in gbdt each base learner is a shallow base learner typically two levels deep or so so each base learner why don't you construct each base learner again base learners in gbt is a decision tree so we have discussed how to parallelize decision tree construction just a few minutes back right so for gbdt you can say this is your hmx so parallelize the construction of each base learner but you can't parallelize this part okay because this this equation or coming up with hmx you have to have all the all the h0x up to hm minus 1x you need all of that so you construct one after the other but the construction of each base learner which is basically a decision tree you can do that you can construct this decision tree parallely either using task parallelization or using data parallelization that we just discussed the reason i discussed about decision tree parallelization dictionary parallelization is because both task parallelization and data parallelization mechanisms are actually used in gbt if you look at the source code of xg boost or any of the major boosting methods that boosting code bases that you see they all use these they all use these methods they use either data parallelism or task parallelism or a combination of both right so okay cool so now what about deep learning models okay so again remember that okay let me show you a simple example of a deep learning model right so imagine you have a bunch of layers here you have a layer here with five activation functions then you have a layer here with three activation functions and then something else after this so you have a weight vector here sorry weight matrix here which is w5 cross three let's assume it's a multi-layered perceptron right if it's a multi-layered perceptron you have w5 cross three right again you could have a cnn if you have a cnn what do you have you have basically convolutional kernels here right whatever it is at the end of the day you have some sort of matrix operation here so in forward pass look at this first you have to do see how do deep learning algorithms work you have a forward pass then you have a backward pass very simple right forward pass is nothing but matrix multiplication followed by some activation function that you apply right at the end of the day that's what it is right that's that's why when google wrote their code they called it tensorflow why did they call it tensorflow because what it's doing is basically a bunch of tensor operations and the flow here is basically the graph based flow the computational graph graph based flow okay so at the end of the day most deep learning is basically matrix and tensor and vector operations right so if it's a simple mlp your w5 cross three which is which it's nothing but so you have outputs from here you multiply them with this and then you apply an activation function that's what it is right so it all boils down to matrix multiplication your your forward pass is basically matrix multiplication along with some activation functions right that's all it is so we already discussed how matrix multiplication can be parallelized so all forward pass can be trivially parallelized now what about backward pass look at this when when we when we when we when the when the errors or the partial derivatives okay the back propagation is nothing but its chain rule with memoization as we discussed in our course videos right so you you get some partial derivatives based on those partial derivatives you have to update these weights right again this is also trivially parallelizable you get your partial derivatives in your back propagation suppose this is your w5 cross 3 matrix that you want to update okay simple on core 1 update some of these weights on code 2 update some of these weights on code 3 update another set of weights on core 4 update another set of weights right so again look at this each of these your w i j in kth iteration needs to be updated using an update rule right so let's assume each of this is w 1 1 w 1 2 so on so forth in the k iteration imagine all of them has to be updated some of them you update in core 1 based on the partial derivatives you get again your partial derivatives are nothing but function function calculations that you do right it you can implement that those those partial derivatives in just a simple function your dou l by any weight right dou l by any weight that you have is just an actual function depending on the activation function depending on the previous operations right so your forward prop is basically simple matrix multiplication your backward pass you can trivially parallelize by updating some of these weights here okay some weights on core one some weights on core two some weights on core three some weights on core four but for deep learning typically gpu based systems are much much faster again we discussed this in lots of detail when we discuss about gp architecture in the course videos and the idea is simple gpus typically have hundreds or even thousands of cores okay tpu gpus typically have thousands of cores these scores are not very powerful but because you have thousands of cores all of these weights can be parallelly suppose imagine you have thousand weights here okay you have thousand weight vectors here all of them can be updated in a single shot here right when you have thousands of cores so gpus are preferable than multi-core applications for most deep learning but you can also train it on multi-core systems i'm not saying you can't train it that's the key idea now the same logic applies see most of deep learning most of deep learning whether it is cnns whether it is transformers right any model that you have whether it is lstms anything that you have all of them boil down to simple matrix multiplication and updating your weights or parameters for every layer so whenever you have a bunch of bunch of weights or parameters whether it's convolutional kernels whether it is whatever weight you have right the weights the weights or parameters of an lstm unit or a transformer encoder layer whatever you have all the weights that you have in a single layer you can update them parallelly because each of the weights the core reason here is very simple look at this the these weights are independent of these weights so you can update them in one core you can update these in another core right similarly whether you're using a transformer model or you're using any model all the weights in any single layer that are independent suppose you have you have some weights you have a weight matrix in one layer right so you can chunk some of these weights you can process them on core 1 some of them on core 2 so on so forth so wherever there is independence whether there is no dependence amongst these weights try to parallelly process them and this holds true for most deep learning models now uh i'll try to spend some 15-20 minutes on productionization then then i'll spend time on q a okay so let's let's go to productionization first this is also an important topic so uh first and foremost we discussed about so we have done multiple sessions on how to build flask apis so a flask api is a web api this is one of the very popular flask is also very very lightweight very easy to build we have done tons of sessions on how to build web apis for productionization of models what do you do you give an x as an input here this computes your model let's assume your model is f of x so you give x as an input it will the flask api you can call this using web-based systems you can give the ip address port right so we discussed about some of these uh when we explain how to design flask apis and build them from scratch in the course videos and previous live sessions so you give x as an input it computes some function gives you output that's what it is right at the end of the day this function you can implement it in your code this function could be suppose you you have a logistic regression model right so you give x vector as an input the log the f x computes the y i so let's assume this is x it computes the y the class label it outputs that okay very simple very simple system now things become slightly tricky imagine you have multiple people who are simultaneously requesting so this guy is let's say sending x1 this guy is sending x2 this guy is sending x3 so on so forth now you might say where does this happen imagine our flask api is doing a recommendation system let's just say i'm just trying to give an example so this flask api every user sends some data related to the user then it has to return a bunch of similar movies or similar products okay so let's assume you have an e-commerce website let's assume you have an e-commerce website what happens here one user will have some properties so all of his properties let's say are in x1 he'll give x1 we have to give y1 back to him now it's not it's not always necessary that only one person is accessing our website right there will be thousands of people who are simultaneously requesting this system now this flask api remember this flask api takes this request and gives gives the output now can you run this function f of x on multiple cores of your computer so imagine you have core one code two up to core six right can you run this f of x so let's assume this x one this this person sends x one so can you compute f of x one on this score and give y one back this person sends x two so while x while x while f of x1 is being computed here can you accept this request and also process it on another code because your multiple course right so this is where parallelism in productionization is used so here there are many solutions to this problem one simple solution that is actually recommended by flask itself for productionization is they say use something called as a wsgi again this diagram has been taken from dzone.com very interesting uh repository of nice articles so a web wsgi basically stands for web server gateway interface so what it does here is so each of your users is a client each of each of your users is a client right who could be accessing our website from or who could be accessing our api right from a web page or from some program or whatever it is first they request something okay they send us a request they say hey here is my x1 please give me y1 this application executes our f of x f of x i let's just say this application code this application code is where our f of x i lies so a wsgi you can think of a wsgi as sort of like a web server like system what it takes is if multiple people are giving multiple requests it will call the application in a very ingenious way okay so it will take care of all of that for it will take care of the parallelism for us so whenever multiple requests are coming in it will call the application and it will say hey evaluate this once you have the response send it back i will send the response back okay so ws gi is basically a piece of software it's a web server gateway interface which enables us to be able to handle these requests in a server like setup again wsgis are typically fairly lightweight these are very often used with flask so one of the popular wsgis is called gunikon okay it's actually green unicorn so again you can see all the documentation of unicorn here i'll show you some simple code snippets for it it's a very lightweight very simple system so if you go to the flask productionization on its official documentation you will see unicorn as one of the wsgis that it uses so what does what does gunikon or green unicorn it's basically a wsgi http server which is designed primarily for unix-based systems remember a vast majority of production systems are linux or unix or equivalent systems right so guricon is very extensively used with it's extensively used with this whole uh it's extensively used with your flask apis so what gunikon does look at this so you have g unicorn so let me just show you how how it works actually then i'll show you the code okay so how unicorn works is this this guy sends x1 this guy sends x2 this guy sends x3 and so on so forth so unicorn what it does here is it has a pool of processes so it so it knows that i'll have to run this function fx it knows that so first what it will do is it will create a bunch of processes p1 p2 so on so forth let's say p6 okay it creates a bunch of processes whenever somebody requests it sees which process is not currently processing anything so then because it knows what function to execute it knows this it knows about this whole flask api what function to use all of that so gunikonov takes care of parallelization and orchestrating and parallely distributing so it sends x1 to p1 x3 to p2 and so on so forth it tackles all of that for you and if all the processes are busy unicorn will internally store these requests in a queue-like system such that as soon as the processor freed up it will again it will coordinate all of this for you instead of you having to write the code you can write this code yourself but why bother about it there's a very nice good open source code that lets you do that all right so that's one way of doing it right so as far as the code itself is concerned i've just pasted the whole code here again this is on a linux or a mac system on windows some of these commands might change slightly so you can install gunikon again if you want to read more about unicon you can just go to their website it's a very simple website very easy to read but let me walk you through the code snippet here okay first you install pip install unicorn very simple you don't have the very simple implementation so first i have written something called myapp.python okay again all of this document all of these examples have been taken from the flask official documentation and gunicon website itself okay so imagine i have my app some application that i'm writing so let's assume the application code is this okay what are we doing here we have we are importing flask we have a flask app that we created and we are saying if this is called i have a very simple function called hello world nothing fancy i have a very simple function called hello world not nothing more it just returns hello world this function could be anything our f of x we could just we could just implement it here and how to implement flask with parameters all of that we have discussed that in other videos where we spent almost two i think four hours and explaining flask internals and how to design flask apis from from from scratch so i'll not be going into the depths of flask okay but simply speaking this is a function uh that that i would like to call here i'm just importing and creating a basic flask app here i'm saying if you just run this just run the flash cap cool now if you run this flask app let me just run it on my terminal so that you can see it actually running okay one second sorry i'm run i was running something else so just just let me show you this um okay so what what is it telling you okay so it's called my app right so what is my present working directory okay okay so i have myapp.pi okay so if i just run myapp.pi sorry python okay okay if i just run this look at what it says it says the flask app my app is running okay but it gives you a warning it says this is a development server do not use it in productionization systems use a production wsgi server instead this is the official flask statement look at this what it says this is very very important flask itself says what you're doing is okay when you're trying to develop systems but if you want to productionize this please use a wsgi server instead gunicon is one that we are using now if i just go to http127 this on my web browser so i'm just running my flash cap right now nothing nothing more fancy than that okay if i just go to my if i just go here and run it it will just say hello world okay very simple system nothing i'm just i'm just calling a flask api but this flask api is only using one core right it's not using multiple cores of my computer so if i want to use multiple cores of my computer my size the best way is to use the wsgi which is gunicon in our case okay so the way we'll do it is this we will create a simple code snippet called wsgi.python okay in the same folder so let me show you that just quickly so that you will you will be able to let me look let me show you the okay look at this myapp.pi is in the same folder as wsgi.pipe okay so if you go to wsgi.pi all i'm saying here is from my app import app remember my app is there in the same folder so i will be able to import app whenever i run this function just run app.run and what is this app look at what is this app now this app is nothing but this app is nothing but this app right my flask app that i have here is what will run okay i've just created a different small code snippet called wsga okay so if you if you see the source code of wsgi all that i'm saying here is this so whatever i am running in wsgi whenever whji dot python is run just run app just execute this app that's all i'm saying now the unicorn command that i can give here is i can say gunikon hyphen w 4 this means create four worker processes remember this whole guni con itself is going to run on one one process let's call it the parent process okay so the whole guni con itself will run in one process right just run on some core it will create four worker processes or four child processes let's call them as p1 p2 p3 and p4 okay and this whole unicorn will be accessible this is hyphen b means binding so i can access all of these processes and this whole app.run i can access access it on 127.00.1 on port 4000 and what is it that i want to run wsgi app okay what i want to run here is wsgi app now the moment you execute this command look at this the moment you execute this command now let me just show you on the terminal control r one second okay so the moment you execute this look at what it says it says i'm starting gunikon and i'm listening for requests on this server look at this i'm i'm waiting for requests on http on this 127.0.0.1 on port 4000 and this is my process id this is the parent process id where unicorn itself is running and it created four processes look at these four lines now look at these four processes it said i am starting or i'm booting a worker whose process id is this i'm creating another worker here i'm creating another worker here i'm creating another worker here now even if 100 people now imagine if i get bunch of if i get a flood of requests okay imagine now imagine imagine imagine a situation where let's assume um okay let me use the previous diagram here okay so suppose imagine i have 100 clients here who are parallely who are saying okay this is my x 1 this is my x 2 please give me y 1 please give me y 2 y 3 etc my guni con is sitting here right it will take all these requests and it will use these four processes to send back y i's okay so this is one way that you can use parallelization in the production environment okay again lot of stuff that you want to do is handled by your wsgis okay like unicorn again there are other alternatives also so if you go to flash official documentation so flash deployment sorry flask okay so look at this so that again these are the deployment uh this is official documentation of flask they say how you can deploy it on heraku or google app engine or aws or azure right or if you want to use gunicon or other types of wsgi containers there is a lot of documentation here okay so what we covered is how to use gunicon as one of the wsgi http servers that we can use for parallelly executing this again i've taken the command from here itself all right so this it's the same command four workers right and this is the ip address and the port that is bound to this to the unicorn job and things like that okay so yeah so this is what i wanted to cover today uh again we just got it under 90 minutes so i'll go into the i'll go into the okay let me give a second okay i'll go into the chat window and i'll try to answer as many questions as possible from now okay cool uh let's let's get into it okay so uh biju good question uh your question here is isn't it same as load balancing it's not so uh there is uh so a lot of people load balancing is a concept in distributed systems a lot of people use gunicon with flask apis with engines okay engines is spelled as nginx njx is often used as a load balancer okay so a load balancer tries to balance between multiple computers also so here gunikon is only working like a web server interface it's not a full-fledged load balancer okay so if you want you can use a full-fledged load balancer again there are multiple load balancers engines being one of the popular options which is often used so in some of the other documents that you can see in the flask official documentation people do use engines that's that's again that that we are going into more software engineering distributed computers distributed computing and things like that that's why i didn't want to go into it let's focus just on the parallel computing part of it okay cool uh so abhijit has a good question uh uh why why don't we use dusk yes dusk is a very popular up and coming system i am not in favor of any one system or other system okay dusk is a much more low level api dusk is also getting very very popular but the problem with dusk is this so if i have to choose between dusk and spark i would actually use spark and the reason being spark is much more mature spark has been used in industry at much larger scale i have personally used it both in applied a course and in my previous company amazon at fairly large amounts of data to crunch data we use spark extensively right even to build some simple machine learning models we do use spark so spark is much more thoroughly tested dusk is a good system no doubts about it i think dusk will become very popular could become i'm not saying will always but could become very popular in the future you can use dusk also so it's just a call okay so probably if time permits we can do one session on the internals of dusk okay so is cuda used in machine learning yes yes yes so when i say gpu programming we're talking about cuda again for those of you who don't know what kuda is kuda is basically like a like a bunch of functions it's basically a library that is released by nvidia to do parallel processing on their gpus so whether you take tensorflow you take pi torch keras all of them whatever code you write gets converted into cuda code yes cuda is used extensively in deep learning it is also used in machine learning for example using tensorflow you can train a logistic regression model a linear regression model a linear svm model all of that you can train right right so uh yes it is used uh when we have so trojan has a question which is when we have sub matrix inside our data matrix how do we fit them inside our box i mean we have dedicated memory but what if they overflow trojan that's what i said right in my discussion if let's assume your data is so large that it doesn't fit into your ram then can you run your algorithm again most modern libraries what they'll do here is they'll get chunks of data from your disk to ram process that throw it out of the ram again get another chunk of data that's how they process it internally most libraries do that already for you but if it will take more time because writing and reading from disk takes more time right reading from disk to ram is very slow even if you're using ssds it's quite slow compared to reading from ram or reading from cache so what happens very often is in such a situation if you have large amount very large amount of data for example we have a case study where we process i think close to 100 gb of data we have one of the case studies that we do as part of the course videos this is microsoft malware detection that data will never fit into our ram but we discuss about how to tackle such problems it will take time you have to be patient there if you have only one computer at your disposal you have to be patient but it can be processed by copying chunks of data into your ram processing it letting it go again copy one more chunk into your ram processing it get letting it go and so and so forth that's one approach or if you have multiple computers just go use spark right so very simple uh okay somebody has a question what should i do to teach my friends simple i mean i became better i was a public speaker when i was a student in school days i enjoyed public speaking so i think some of my teaching and in general explanation abilities were built then but my suggestion to you is this if you want to teach one of the things that i learned from andrew ing i mean this very nice video from andrewing is he says when he first joined stanford he was not a very good teacher he was rated one of the worst teachers in the computer science department but then he said okay i'll take a topic i will record myself or i'll look at the look at the look at a mirror teach a concept for 10 minutes and he consistently improved that he saw his own video and he said hey i'm not teaching correctly he would again re-teach he would re-teach so he would get he would criticize his own videos either using a mirror or using a video recorder and he improved within one to two years but in being a very good teacher that's one the second thing that i would recommend here is teaching requires patience teaching requires you to put yourself in the shoes of the learner right and thinking from the learner's perspective this is a skill that i learned over the last few years i didn't know this skill earlier okay very simple tricks so survey your question here is is there any online solution where we can preload data needed to process something like radius and crunch more quickly look at this you can use any distributed system dude see suppose you have radius what is red is at the end of the day it's a distributed key value store that's what it is right at the end of the day that's what it is it is a distributed hash table okay of course it has redundancy it has all the other fancy things but bottom line it's a distributed dictionary which means you have let's say you have ten boxes box one box two box three so this dictionary is being stored in the memory of all these ten computers you are going to read from that that's all you're still you still have compute as your bottleneck right each of these computers still have only a few number of cores how you store your data see what radius does is redis enables you to be able to quickly query the data because it's stored in a distributed hash table but at the end of the day all the compute equation updates have to be run on your codes right ladies can't speed that up radius can only speed up access to data in a distributed fashion right so some people in spark environments again redis is designed for a distributed computing not for a parallel computing like approach you know if it's a parallel computing approach i can store the dictionary using standard dictionaries in python right why do i need radius if it's a single box why do i need radius i just have dictionary in python inbuilt right so in a distributed systems environment yes like for example if you have spark you can fetch the data into ram using redis but again that seems a little overkill because the rdds in spark as we have discussed in the previous live sessions rdds in spark are distributed memory units and spark has optimized them enormously for speed right cool can julia also perform all ml functions you mentioned again i'm not an expert in julia i don't want to comment about it but languages like julia go etc they're still not mainstream to be honest with you the easiest language to build machine learning systems today is python because there are so many libraries available for it it's as simple as that you can write in julia i'm not saying you shouldn't i've written code in c also earlier but the libraries that are available are too few so you'll have to implement lot of stuff from scratch okay cool so what else uh okay i think there are some questions from earlier that i missed so i'll go up and answer a few questions from there tensorflow developer certification from google will help my resume to stand out again if you have bandwidth if you have time if you have resources go ahead do it so right now one of the problems here is some of these cloud certificates are useful for example the google tensorflow developer certificate i think it will certainly add some value to your resume but the effort that you need to put in and the price for the examination is non-trivial so you'll have to take all of that into account okay so feel free to do it okay can we use spark for reading io operation instead of multi-threading again i think many of you are getting confused between multi-threaded or parallel code with spark code so we have done a other session it's publicly available the first spark session is publicly available on our youtube channel feel free to check it out where we discuss how spark fits in remember spark also internally see spark uses date spark spar splits the data on multiple computers on each computer it uses the multiple processes and multiple cores and multiple threads internally it does that already right spark is spark typically kicks in when you have large amounts of data that you can't store on one computer either on disk or on ram okay that's where spark kicks in so spark falls into the distributed systems model but spark internally uses multi-processing multi-threading all of that okay so you should check out that other live session it's a two-hour live session about the basics of spark and hadoop the architecture of spark and hadoop so please check it out so that you can understand that better yes parikshit says doesn't numpy use some of this parallelization yes it does so some when you write so very simple example right imagine you are writing uh some matrix multiplication on your own okay some for loop for loop so triple for loop thing that you can write right for matrix multiplication the code that you write is going to be much much slower than by than actual python sorry numpy's multiplication why is it so because python or sorry numpy's internal implementation is written in c so it will run much much faster than your python code number one number two it also leverages the multi-processing and multi-threading capabilities that's why whenever possible for if you want fast execution use numpy's internal functions because they're exec they're very beautifully implemented right so some of it is implemented the reason i went into the mathematical details is because understanding the foundational skills is important many people don't realize it unfortunately people say hey i know this library so that's why if you look at most top product based companies they don't ask you hey do you know that tool do you know this tool you'll say tell me how does this work internally like i've been in interviews of senior engineers and managers and people say i've used redis okay explain me how does redis handle this situation because i i want to understand whether that person understands radice's internal architecture because in our team we use radius and mam we use radius and memcache so what is important is a deeper understanding what is happening under the hood again otherwise people can say hey i know how to write tensorflow code i know deep learning no you don't know you know how to write tensorflow code you don't know deep learning right so to understand deep learning you have to understand the underlying math how tensorflow optimizes things internally so if you want to be a good engineer or a good data scientist or a good machine learning scientist or in general a good engineer in general try to grow below the hood and understand what's happening because that way your depth of skills are better you know which tool to use where okay i could have i would have finished this session by saying hey numpy has this distributed stuff again i wanted to show you how to implement a unicorn like system from scratch but that would take two to maybe even four hours we can implement a unicorn like system ourselves by the way it's not impossible we can implement it from scratch by using the basic functions of basic functionality of python but i said okay most people will end up anywhere using unicorn in production so let's talk about it if you are a software engineer it's a very good exercise to say how unicorn works internally okay or how uh how how let's say cuda works internally okay so those are interesting so always try to go one level deeper if you can multiple levels deeper so that you get a deeper understanding of the concept and you have a better appreciation better applicative stuff now i know what to apply where right someone suppose imagine there's a new distributed systems platform that comes up let's assume kafka it's not very difficult to learn because my foundations of computer science my foundations of distributed computing my foundations of machine learning my foundations of data structures and algorithms are strong enough that i can pick up kafka in no time because at the end of the day it has to use basic data structures algorithms distribute computing concepts right it can't it doesn't invent anything new it probably optimizes some of these things so having strong foundations either mathematical data structures algorithms computer networks all these are very very useful for example whole of multi-processing and multi-threading right it start to computer science students in operating system subject i mean if you're a computer science student if you've done b-tech in computer science you should know all of operating systems you should be able to implement something like unicorn in like a weekend because you know basics of computer networks you know basics of operating systems right but unfortunately most people don't learn that which is unfortunate but again knowing their foundations are important okay so uh can you implement logistic regression machine learning using c plus plus yes manoj you can again one of our assignments in the course videos is to implement linear regression with regularization in python you can certainly implement it in c c plus plus and the code that you write in c c c plus plus could be faster but the only problem is this the amount of time it takes to implement a c c plus plus code because remember c and c plus plus doesn't have garbage collection which means you have to take care of that you have to do memory memory stuff you have to you're re-implementing everything that is implemented by numpy and scikit-learn to some extent probably your code could be faster than scikit-learn also again in my previous work experience we have implemented things like logistic regression or graded boosted decision trees using multiple cores in c plus plus which is way faster than most python libraries we've implemented it from scratch because that was a production environment where we needed like speed the difference between even we didn't even use java we wrote code in c and c plus plus okay because speed was critical there unless speed is critical for your applications there is no point doing it okay if if you want ease of use python is much better scikit learn xgboost all these are pretty good but if you want speed in productionization you have to do it in cc plus plus okay my python executable produces predictions for multiple time series in parallel but the program gets stuck at some point have you have you encountered this behavior one second your python executable produces predictions for multiple series in parallel but the program gets stuck at some point avnish again debugging parallel programs can be little tricky but probably what's happening in your system is it's getting into a deadlock like situation where one process is waiting for other process to complete other processes waiting for another process to complete if there is any such dependency like structure probably that could be one case that could be happening in something like that again just put in tons of debug commands and debug it slowly again i understand that debugging parallel code can be a nightmare because keeping track of okay what process till where it is run what process till where it is run can be a nightmarish situation but uh you can do it using either the python debugger or just print debug commands print log log everything so that you can go through the log commands all the output and see why things are getting stuck one reason could be a deadlock-like situation okay cool uh can you briefly touch upon how to build unicorn-like system using python so look at look at what is gunikon doing okay it's taking http requests okay it's taking http requests and then it creates a pool of processes and whenever a request comes it's adding it to a queue okay suppose you're getting a request so let's assume you're getting requests the gurikon system takes the request adds it into a queue it has a bunch of process it has a pool of processes it assigns them to each of them and it returns the output right so effectively you can build this system using flask using a queue and using a pool of processes again you may not be able to build all the functionality the full scale functionality of gunicon but all you need is basically a pool of processes a first in first out queue a queue data structure which is inbuilt in pi uh the python has a queue implementation and a system that can take all these requests so again you can take web requests using web apis right so you can build a flask api which takes this web request stores them into a stores them into a queue and it also has a pool of processes it will assign each of these requests to a pool as soon as the pool sorry as soon as the process returns back it sends the response back a little bit of orchestration needs to be done but can be implemented over a weekend okay so uh somebody's uh heretic says between vs code and pycharm which is the best ide i don't use either of these ids i use vim i'm being honest i mean i've used vim all my life it's one of my favorite ideas i've used atom so there is atom also on my computer which i've used a few times but like ids are like t-shirts okay i like this t-shirt you might like another t-shirt or it's like hair styles everybody has their favorite okay my favorite is bim because i would i'd like to go into the production system log into the production system just look at the code change it there like i like vim because it's bare bones and i've been using it for 15 years now so vim is my favorite editor but everybody has their favorite editors i don't have a favorite except them okay did someone finish all the course material first and finished all the assignments later and showed up at a good company with great package abhishek i mean ah so what we recommend again there are some students like that okay to answer your question to the point some students finish all the content then start the assignments but we don't recommend that we recommend you do the assignments in parallel so for example you can you can assign your weekends one suggested strategy is this throughout the week you're learning a bunch of concepts we can do an assignment if you want to spend the whole two days to do the assignment that's perfectly okay because this assignment is connected to the concepts that you learned through the week think about it that way right think about i mean that's a much much easier way to think five days you learn let's say you spend two hours a day ten hours of content you learn do the assignment okay even if you learn five hours of content you can finish the assignment so this assignment can be thought of as a revision of the concept you learned in the week right before you move on to the next topic that's what we recommend okay um so okay janardhan has a question i'm working on a feature engineering for my final case study at applied a course good good okay my data set is 5 gb and increasing i have two boxes with 12 and 32 gb boxes will spark be an option don't use spark man come on you already have a 12 gb box why bother about it see spark should only kick in when you have something like a few hundreds of gb of data or maybe a terabyte of data otherwise spark becomes an overkill okay uh okay so prakhar sorry it's again a tongue twister prakhar says is it sensible to to pursue applied air course by doing mtech from iit we have hundreds of students from iits and iisc and other top research institutes across the world who take applied air course the reason they do it is as follows i mean i have friends of mine from iisc who have graduated 10 years back who are taking some of these course now the reason is as follows most iits and iic teach you the theoretical parts very well they focus on lot of mathematical stuff they do that very well they focus also a lot on research what they lack is the applicative details so a lot of students at iits and iic they take our course primarily to learn how to apply all the theory that they are learning into real world stuff again when i was a student in indian institute of science which is one of the best universities for machine learning in india for sure hands down i learned a lot of things i didn't learn any applied stuff to be honest with you i mean i have huge respect for my professor i learned a lot of beautiful theory i was doing beautiful mathematics but i didn't learn how to apply these things to real world it is in the industry that i learnt it so i had a lot of mentors in the industry who helped me who nurtured me over the years through their help i learned how to apply all this beautiful math to real world problem solving okay okay cool uh how do how do i install vim editor vim comes in terminal right linux terminal has bim like i think i don't know about windows but i think windows also has vim right so i think it's sim it should be simple to install like i've been using linux and mac computers for the last 15 almost 20 years now and i had women built almost right so i don't know windows also i think it should be straightforward there should be a vim installer can you suggest any one docker container that includes multiple deep learning machine learning libraries uh i i don't know exactly i've used it but some of these dockers i would write the docker config file myself and install these things but i'm sure if you google search for it you should be able to get it it shouldn't be so hard i feel okay okay cool uh yeah i think i've answered most of the questions again i'm trying to answer as many as possible uh can javascript be used for machine learning it's a good question um so uh yes you can so if you want to do some really interesting stuff there is something called tensorflow.js i really love that okay yes there are people who have built machine learning systems in javascript so that they can run these models on the client tent which means on your browser instead of running these models on the server they would run it on the client so tensorflow.js is very very i love that tool because all the tensorflow models that you build you can execute them on the client browser so we have some some tensorflow.js systems that we built uh for for video analysis for face tracking and things like that so we we had applied it we had applied a course have used it extensively so we conduct these uh tests regularly right so where there are some video video questions there are some multiple choice questions and things like that and we track so as to avoid cheating these are these are primarily for self-assessment of students and we built uh we use tensorflow.js to see where the person is looking at is there a abnormal behavior cheating or anything like that so it's a cool tool that we really enjoyed using okay cool okay uh uh so some uh okay so somebody says somebody wants to mix gaming and machine learning again machine learning and deep learning in gaming is used primarily to design smart agents where reinforcement learning is used okay that's one of the application one of the biggest applications okay again um that's that's one space that you can look at okay so between python and r i would choose python any day i've used r also r is a beautiful tool for some advanced statistics functionality but i mostly use python because it's much easier productionizing python code is easier than our code so we prefer i mean that's why we teach python and not r r has its own applications but if you look at the in general market python has much higher lead because much more stuff can be done in python okay r surely if you want to do some advanced statistics stuff r is certainly useful i mean you can pick anything but i would pick python any day okay cool uh so how can we connect the theoretical concepts when building ml models rather than treating them as black boxes again scientis the logic there is suppose you're training a model right ask yourself suppose you got some weight try to interpret the weights from a geometric standpoint okay think suppose suppose you're doing a simple logistic recreational linear regression you have a hyperplane now in multiple trainings of the hyper plane are the is the hyper plane changing a lot okay which means there is there is a lot of the model itself is not robust okay so always think of whenever you're training a model don't think of it in terms of scikit-learn's function think about it from the last function's perspective ask yourself on this data what is happening with the loss function what is happening with the regularizer what is happening with the constraints if you're training something like an svn right so don't again code is important i'm not denying that fact but thinking from a code front is not very helpful thinking from again even when you write code right look at each of the hyper parameters and always try to connect it to the underlying mathematics or geometry again that's something that we strongly recommend our students do that's something that i do every day whenever i'm training a model i think from okay see code is only a way to express what you want to do right but the thinking part should not be from code thinking should be okay i'm writing this function what is it doing underneath what how does the loss function look like what is happening with the regularizer right or what is happening with this plane okay so that that's a better way to think about it i mean that's a that's a thinking process that you can build slowly by connecting the by try whenever you're solving a problem think from the geometry perspective think from the mathematical equations perspective and not from the code code is only a means to get a numerical solution okay okay what do you think about market mix modeling is it a data science problem again market mix problems have existed they have been solved mostly by statisticians and they have been there even before modern data science okay so you can pose them as data science problems i've seen people who pose it as data science problems but a much better way to pose market mix models is think about it like an optimization problem okay that's what that's what it is right in the edit score a market mix model is basically an optimization problem that you're solving you want to determine where to allocate how many resources right or where to allocate how much money so instead of thinking as a data science problem it is better to think about an optimization problem and remember that optimization is core to lot of machine learning also right so optimization is this beautiful area of mathematics which is used by operations research by supply chain management by market mix modeling but tons of people including machine learning guys okay cool uh is google pro available a google collab pro available in india i don't think i think so i tried to get one for myself i couldn't get it so i'm just using the regular google collab i don't think i think it's only available in the us it's not available in india right so somebody says how to conduct competitive programming problems uh yeah so we are thinking of some of these things so probably in the near future we'll do a session on how to solve dynamic programming problems we'll try to do that okay so we'll surely do that or how to think when you're doing a competitive programming problem how to think it's not about exactly or this problem this is the solution but we thought we'll do a session on how to think or how some people think okay when they have to solve a comprehensive programming problem okay so somebody says applied blockchain course we are not doing it anytime soon so not now it's not on our horizon for sure okay cool uh sounds good thank you folks uh thank you for attending the session and uh again we'll announce the next live session for the next week thank you once again for joining the session i hope some of you have learned some concepts i will share the notes of the discussion with all the reference links in the description section of this video in the next few minutes okay see you thank you very much have a good weekend bye

Info

Channel: Applied AI Course

Views: 6,749

Rating: 4.9069767 out of 5

Keywords:

Id: mxhAE9Zj89I

Channel Id: undefined

Length: 121min 22sec (7282 seconds)

Published: Sun Aug 30 2020