Hello everyone, I am Xiaobai. This is a machine learning course that combines theory with practice . We will use Python to complete the practical part. So if you haven’t learned Python , you can go to my Python course first. The theoretical part is because of this It’s not a mathematics class , so there won’t be too difficult mathematics in it, so don’t worry . Well, let’s start without talking nonsense. Hello everyone, I’m Xiaobai. Welcome to this course. The first one is artificial intelligence. AI is full of artificial intelligence. You should often hear artificial intelligence. So what is artificial intelligence? Artificial intelligence or you can call it artificial intelligence. In a simple sentence , we want machines Can have the same intelligence as humans. Once the machine has intelligence , it can help us deal with many complicated things. Then what is machine learning ML ? How to let the machine have intelligence? One of the ways is to let the machine learn . Now the question is how to let the machine learn. We can first think about how humans learn. Can humans learn from the past history and the past ? The same is true for machines . The past history and past experience of machines are actually the data stored in the past. So machine learning , in a simple sentence , is to find out from the data stored in the past. Out of the rules OK Machine learning is to find out the rules from the data in a simple sentence. Then what is deep learning DL ? Learning is to find the rules from the data . There are many ways to find the rules from the data. One of the methods is deep learning . The method of deep learning is to learn from the data by imitating a neural network of the human brain. Let’s find out the rules OK , so what is a neural network? I will introduce it to you later in the course . Okay, so let’s summarize the artificial intelligence, machine learning and deep learning we just mentioned. We can use this picture to do it. One said that artificial intelligence is the ultimate goal we want to achieve . We want machines to be as intelligent as humans , so that they can help us deal with many complicated things . How to achieve artificial intelligence? One of the ways is to let machines learn . In a simple sentence, machine learning is to find rules from data. There are many ways to find rules from data . One of the most powerful methods is through deep learning . The method of deep learning is to imitate the brain. A similar neural network to find out the rules from the data OK, so this is the relationship between artificial intelligence, machine learning and deep learning . Next, let’s understand how the machine learns . The machine has been mentioned before. Learning is equivalent to finding out the rules from the data . Then we will further discuss how the machine or computer finds out the rules from the data. In fact, it can be achieved by using some mathematical skills and programs to make the machine find out the rules from the data. Out of the rules, let’s take a look at what the process of machine learning looks like . Suppose we have some picture data in our hands . These pictures are pictures of dogs and cats. We want the machine to learn from the data of these pictures. Learn how to distinguish between dogs and cats OK , that is to say, we want the machine to find out the rules for distinguishing dogs and cats from these materials, so what can we do? You can first think about how humans learn to distinguish between dogs and cats Suppose you want to teach a child today. This child does not know what is a cat and what is a dog . You have to teach him how to tell the difference. At the beginning, you may take a picture and ask the child whether it is just a cat or a dog. You can see that the eyes of the child are quite different. Because he doesn’t know what you’re talking about, he just wants to go to bed quickly , so he casually answered him, saying it’s just a dog, obviously he got the wrong answer , then we, as teachers, give him a cross and tell him The animal that this child looks like is a cat. Then we took another picture and asked the child if it was just a cat or a dog. You can see that the child’s eyes revealed a little confidence compared to the last time . He replied It’s a pity that he said it’s just a cat. It’s a pity that he got the wrong answer again . As teachers, we gave him a cross again and told the children that the animal that looks like this is a dog . Just like this, after constantly looking at the photos, let the children answer . As long as the child answers the wrong answer, we will ask him to correct it . Over time, the child has seen too many pictures of cats and dogs, and he will probably know what a cat looks like and what a dog looks like . So you can ask him if it is just a cat or a dog. He will tell you very firmly and confidently that this is just a dog. The same learning method can be applied to the machine. We can input a picture into the machine or input a picture into the computer and ask him if it is a cat or not. What about dogs? After entering into the computer, some programs will be triggered . Behind these programs, some mathematical skills are actually used. Here , mathematical skills and programs can be combined . We can also call it a model, so here we can say that we You can input a picture into the model and ask him whether it is a cat or a dog . At the beginning, the model is the same as an ignorant child. He does not know what is a cat and what is a dog , so he guesses randomly. It’s just a dog. Obviously he got the wrong answer . We have to give him a cross at this time and tell him that the one that looks like this is a cat. Please correct the model and that the one that looks like this is a cat. Then we took another picture and entered it . Input it into the model in the machine and ask him if he is a cat or a dog . This time he answered that it was a cat. Obviously he answered wrong again. Give him a cross and tell him that the one that looks like this is a dog. Please correct the model and it will pass. Continuous training like this, constantly show the machine a lot of pictures of cats and dogs . If he gets the wrong answer, ask him to correct it until the model has a certain accuracy rate. Input a picture of a dog and this model tells you that this is a dog . After training the machine or training the model, suppose you have a new picture in the future and you want to ask whether it is a cat or a dog , you can input it into In the model, he will tell you that this is just a dog . Next, let’s look at an example. Suppose we have some data on house sales . These data are the same as the square meter of the house and its corresponding sale price. We want the machine to start from Find out the rules from these materials, find out the rule between the number of square meters and its corresponding selling price , that is to say, we want to guess what the selling price should be based on the number of square meters . In this way, we can say that the number of square meters is called The characteristic feature price is called the label label. We guess the price based on the square number, so the square number is called the feature price is called the label . Then the training process is similar to the previous one. We can input the square number, which is the feature, into the model . The model is ignorant at the beginning. So he will make random guesses. Suppose he guesses 4 million , let’s take a look at the information we have. According to the information we have in hand, the price of this 50- ping house is 5 million , which means our label is 5 million , which is obviously far from his guess . So at this time, we tell the model that the house is The selling price is 5 million. Please correct it. Next , enter the next data of 66 pings . Then enter the model and he guesses 9 million. Let’s look at the data in hand . It is 6.5 million and the label is 6.5 million. Constantly revising, constantly looking at the data , until the final prediction of the model is consistent with our real data, that is, when the error value of the label is not large , we say that the model training is completed. After the model training is completed , suppose you have a house and want to sell it It’s 72 pings. You don’t know how much he should sell . You can input the feature of 72 pings into the model and ask him to predict how much he should be able to sell . You may have doubts at this point. The price of a house should not just be It can be determined by how many pings it has. We may also need to look at its location, its layout and other comprehensive factors to evaluate how much a house is worth. There is nothing wrong with it , so you can collect more complete information. In addition to the number of pings, we have also collected his location and the number of rooms , so now the machine needs to estimate the selling price based on the number of pings, location, and number of rooms . Now we will talk about the number of pings, location and number of rooms The room is called a feature, and the price is also a label . Because we estimate the price by the number of square meters, the location and the room, these three are the characteristic price and the label . The learning process is the same as before. We input the characteristic square meter, location and the number of rooms. Into the model, the model is ignorant at the beginning, so he will guess randomly . Suppose we guess 4 million at the beginning. Let’s take a look at the data in hand, which is the label. The label is 5 million, so please correct the model and then enter the second data, the same. Ask the model to speculate that he speculates 9 million. Look at the label, it is 6.5 million, and then ask the model to correct it . Just keep training until the error value of the model reaches a certain range . We say that the training of this model is completed. In the future, if you have a house It is 72 pings, there are 6 rooms in Washington that you want to sell. If you want to guess how much he should be able to sell , you can input it into this model. He estimates that it can sell for 7.3 million . This is probably the process of machine learning. Next , we will go directly Implementation In this class, we are going to use colab . Colab is a python writing environment provided by Google for free. It allows us to write python programs very quickly and easily on the browser , but because it is provided by Google , you need to create a Google first. Account and log in . After logging in, you can see 9 dots here . Click to find cloud hard drive. We can see a new addition in the upper left corner , and then click here. If you have found Google colab here, you can Click it. If you can’t find us, click to connect to more applications . Well, we can find colab here. Click to enter and click to install . Click to install. He may ask you to log in. Then you log in. After the installation is complete, we will Just click OK to complete OK, then we will add more here, you can see that there is an additional Google colab, click in, after clicking in, you can see that it has helped us create an environment for writing Python . First, let’s see The upper left corner, the upper left corner is the name of the file , you can modify it , assuming I change it to the environment to build OK, then the modification is over, and because I am more used to the background being black , so I go to the tool to find the settings Then here is a theme in the website . We select Dark and click to save the background and it will become black. Of course , you can also set your favorite color . On the editor side, you can also set the size and spacing of the text, etc. Waiting, I leave it to everyone to play by yourself . Before we write the program, we need to do the connection action. We can go to the upper right corner to find a connection point , and he will allocate some resources for us to use . Wait for him After the connection is completed, we can see the block in the middle . The block in the middle is where we can write programs. First of all , we can enter the program code directly in this grid. Suppose I enter print 87 , and I will enlarge it a bit for you to see. It’s clearer. Well, after printing 87 , there’s an execution button here, click it , and it will start to execute , and the execution result will be displayed below. In colab, the code can be divided into pieces. We can put the mouse Put it a little bit above this grid, and you can see that it jumps out a code or text. If you put it a little bit below, it will jump out. Suppose I click to add a code , and it will have an extra grid . That 's what we You can write python programs in this grid . Suppose I write print 88 and press execute . You can see the execution result will be displayed below . Let’s take a look. If I click to add text , it will jump out of a grid where I can write text . Suppose I Write a hello, me, hello, everyone . In addition to writing text , you can also choose some fonts, fonts, etc. You can see that there is an additional text grid here . OK, we can add a lot of code grids, and we can also add a lot. If you don’t want the grid of the text , you can delete it here . There is a symbol of a trash can, click on it and delete it. Delete Delete Delete OK . Colab has another important point , which is that it can provide free GPU. Let us use it with TPU. GPU and TPU can help us speed up a lot in computing . We can find the editor on the top and then there is a notebook setting . You can see that there is a hardware acceleration here. You can choose GPU or TPU Then in the following classes, we will also use GPU for acceleration , so suppose I choose GPU here and press save , you can see that he will help us re-allocate the connection, and we have to wait for him . After he is connected, he will Ask us if we want to delete the previous execution stage. Suppose I press cancel first . After connecting the GPU, we want to see which GPU he uses for us . We can type it in the grid of the code ! Press nvidia-smi to execute, and you can see that the GPU he is using for us now is this Tesla T4 OK . Then, our environment construction is complete. The first mathematical technique I want to introduce to you is called simplicity. Linear regression is simple linear regression in English . You can see that there are two words in front of its name, so you can imagine that it is very simple . Don’t worry too much. Let’s first describe the situation we encountered today . Suppose today you are a new The boss of a start-up company , you want to hire your first employee , but you are not sure how much salary you should pay him , so you go to the market and collect some information about people in the same position . These data are their years of work. With the corresponding salary , let’s take the data here and draw it in a graph, so that the x-axis is the seniority and the y-axis is the monthly salary. Each cross here It means every piece of information you have collected . From this picture, we can easily see that the seniority is directly proportional to the monthly salary, that is, the higher the seniority, the higher the monthly salary . Well, now the problem comes. You are a new company. The boss of a start-up company , you want to hire your first employee . Your first employee comes and he tells you that his seniority is two and a half years, which is 2.5 . So how much salary should you offer him? You can do it at this time. Simple linear regression is used. Since seniority is directly proportional to monthly salary , we can probably use a straight line to represent these data. Simple linear regression is to represent the data with a most suitable straight line . Suppose now that we have found this The most suitable straight line , then we can bring the 2.5 seniority into this straight line to see how much salary we can give this employee . After bringing it in, we can see that we can probably give this employee a salary of 51K . Now The problem becomes how to find the most suitable straight line? Let’s first look at how a straight line can be represented mathematically . We can write y=w*x+b to represent a straight line . Then , this formula Applying it to the present example, y will be equal to the monthly salary and x will be equal to the seniority. So the problem now becomes that we have to find the most suitable w and the most suitable b to represent these data. We have to find the most suitable w and the most suitable b to represent this straight line . Well, let’s implement it directly to see what kind of straight line will be produced by different w and different b. First, let’s read in the data first . We can use the Pandas module To do the reading action. Here I call it pd Pandas. This module is a very useful data processing tool . We don’t need to install it in colab because the default is already installed , so we can directly import it for use . Next The URL of the data we want to use can be found in the course file . I use a variable to represent it. Here, pay special attention to the file we want to read . It is a CSV file , so we can use Pandas under it . read CSV to do the reading action. We only need to write the URL into the parameter and it can be read . Here, I also use a variable to represent it. We can directly display the read result and execute it . This is what we do The data to be used, seniority and his corresponding salary , there are a total of 0-32, that is, 33 data. We want to use a straight line to represent these data . We have said that a straight line can be written in mathematics as y equals w times x plus b . In our example, we want to use seniority to predict salary , so seniority will be x and salary will be y. We first separate x from y, and we want to get the column of years experience. If you want to get this column, you can put brackets years experience behind it so that it will separate out that column. The same is true for x and y. I want to get the column of salary , so I will display it to see if there is no problem with x and y. Then Well, let’s first draw the pictures corresponding to these materials to see. I ’m opening another grid. If you want to draw pictures, you can use a very useful suite called matplotlib. We use the pyplot under it. I call it plt, which is the same as matplotlib. The presets in colab are also installed , so we can directly import and use them. I can use the scatter under it to draw the picture. We only need to pass the x and y that we just separated into it, and it will help us draw the picture. Then Finally, I want to display it and write a show to execute it. You can see that it will display our data bit by bit in the form of a point. If you want to change the style of this point , suppose I want to put I want to change its color into a cross . Then I can add some parameters . I can write another marker equal to the original mark . I want to change it into a cross, so I can write it. x Then I also want to change the color. I can set the color to turn it into red. I write Red so that it can be executed. You can see the mark and it will turn into a red cross. As for what other marks and colors can be used here , if you are interested I leave it to you to research by yourself. Here I can add a title to the picture. I can set its title assuming that my title is seniority corresponding to his salary . Let’s execute it. You can see that there seems to be a little mistake in this part of the title. It has 4 frames. The reason for this error is that matplotlib does not support Chinese by default. If we want to display Chinese , we can add Chinese fonts by ourselves . Well , let’s add it . First, we need to download one first. Chinese font Here I create a new grid to do the downloading action . We can use the wget tool to do the downloading , but it is not installed in colab by default , so we need to install it first. We want to install the package in colab Module, you can enter pip install what you want to install after the exclamation mark. After installation, import it. Then we can use the download below it to do the download action . Just write the URL of the thing you want to download in the parameters. That’s ok. Then this URL can be found in the course files . This is the font we want to download , so we can execute it directly . It will be installed first and then imported. After the download is complete, let’s take a look. You can find a file on the left and open it . This is what we just did. The downloaded font is called Chinesefont.ttf . Let me close it first. After downloading the font , we can add it to matplotlib. If we want to add fonts, we will import matplotlib. I call it mpl and then import it from matplotlib. The font_manager under it is imported into the fontManager. After importing everything here, you can use the addfont under the fontManager to add fonts . The font we just downloaded is called Chinesefont.ttf, so we write Chinesefont.ttf here. After adding the font, we It is also necessary to set this font to be used now . To set this font to be used , we can use the rc under mpl to set something. What we want to set is the font, so the first parameter writes font and the second parameter specifies to use The font is Chinesefont , so it should be fine. Let’s add the font first , then set the font to be used and execute it again . You can see that the Chinese can be displayed normally . In addition to setting the title of the chart , we can also set it. The x-axis and y-axis labels of the chart To set the x-axis label, we can use the x label to do the setting , then our x-axis is the seniority, so I write the seniority, and then the y-axis , the y-axis is the y label, and the y-axis is the monthly salary Then its unit is thousand, so we write the monthly salary Execute the brackets again, and you can see that the labels of the x-axis and y-axis are now up with seniority and monthly salary . Next, let’s draw the straight line. Here I’ll add another grid. As I mentioned just now, a straight line in mathematics can be Written as y=w*x+b , pay special attention to the fact that we want to use the value of x to predict y, so this y is different from our y, this y is the predicted value , and this y is the real data, so Let's go back to the bottom and we can use x to multiply a w plus b to represent our predicted y , so I call it y_pred to represent the predicted y. Now, suppose I first set w to 0 and b also set Let’s take a look at what the straight line drawn by w=0 b=0 will look like. To draw a straight line, we can use the plot under plt to draw it . The first parameter is to write the value of x. The second parameter is to write Let’s see the value of y. This time, our y value is the predicted result of y_pred. We also need to display it and write show OK. Let’s execute it and see. You can see that this is the straight line drawn when w is equal to 0 and b is equal to 0. Then I You can also change its color. Suppose I set its color to blue , which I think is good. Then I will draw the above information and these crosses into it, and I will put all these here. Copy it, copy it here to execute it again , this is our prediction line , these are real data , our goal is to find a straight line that can best represent these data , now this straight line is w=0 b=0 So we need to find the most suitable w and b to represent these data . Before that, we can add some legends to the chart. Our plot is to draw a straight line, and the scatter is to draw these crosses. We can Add a label behind it , the legend is the prediction line , and then we can also add a label on the crossed side , which is the real data. After adding this label , we can write it here plot.legend will display the legend, let’s execute it, and you can see that the graph here is displayed. This straight line is the prediction line , and the crosses are the real data. Let’s first write these here as a Function This way, let us bring in different w and b . I call it plot_pred , and then it can pass in w and b to indent it. Let’s try to bring in different w and b to draw it. What will the result look like? Here we can call plot_pred. Assume that I put in 0 and 0 first and execute it once. You can see that the result is the same as just now . Suppose I modify the value of b and I change it to 10 and execute it again . Seeing this line, it seems that it has not changed , but in fact it has changed . It is because the values of the x-axis and y-axis are not fixed , so the graph looks like it has not changed . So this Let’s fix the values of the x-axis and y-axis first. If we want to fix it, we can write plt.xlim like this. We set its maximum and minimum ranges. If the x-axis is about 0-10, here it is There are a little more , so I set it to 0-12 , here we can write it like this, write a list in it , and then set its minimum and maximum values , and the value of y is the same , we set its maximum and minimum. On the other side, it is possible to get negative numbers , so I let it be -60 at the minimum, and 140 at the maximum. Then we execute it again, and you can see that the values of the x-axis and y-axis will be fixed now. Execute once and the values of the x-axis and y-axis will be fixed . The y-axis is -60 to 140 and the x-axis is 0 to 12. Let’s try again. The result of 0 and 0 is the result of 0 and 0. This line is when y is equal to 0 In this place , if I bring the value of b into it and change it to 10 and execute it again , you can see that the straight line is going up , so we can know that the value of b is to control the up or down of the straight line. If it is changed to 40, it will go up . If it is -30, execute it. You can see it and go down . Now let’s try to change the value of w . It was originally 0. Let ’s change it to 10 and see. This line has become slanted and slanted up . If it is changed to 20, it will be more slanted . If I change it to a negative number -10, you can see that it will fall down and slope downward. If it is positive, it will slope upward. Negative number If it’s a downward slope, I’ll try -5 on this side , and I can see that it’s a line like this. We can’t see this side because it’s beyond the border . Then let’s make this picture dynamic , so that we don’t have to. I have been manually adjusting the values of w and b. If I want to add some interactive components in colab , we can use the ipywidgets tool. Here I introduce the interact in it . Then I can do this and use it. Write a function in the first parameter Name Then what we want to dynamically adjust are the values of w and b , which are the two parameters to be passed in by this function. Here I can set the value of w . I hope its range is between -100 and 100 and then Its spacing is 1, and then set the value of b . Suppose I want it to be -100, and the spacing between 100 and 1 is the same . Let’s execute it directly. You can see that there will be more interactive components here . The default value is w and b is all 0 and I can make adjustments. When I add w to 8, the line will look like this. Then add b to the top and it will slowly move up and then add more. Everyone can play around and adjust different ones. Let’s see what the straight line will look like with different values of w and b. Well, I’ll leave it to you to play with. After seeing what kind of straight line will be produced by different w and b , then our question will become what is the best way? What about the straight line suitable for these data? Let ’s define it and give each line a score. Let’s first look at a set of relatively simple data . If I also want to use a straight line to represent these data , suppose today I am Using this straight line , this straight line is the result of w equal to 0 and b equal to 0 , then we should give this straight line a score for how well it fits these data , or you can say, how well this straight line fits these data Let’s give a score. Then how do we give this line a score? It ’s very simple. We just need to calculate the distance between these real data and this line . Because if the data is closer to this line If they match, the distance between these data and this line will be smaller , so we only need to calculate the distance between each line and these real data , and then find the smallest one among them, and that line will be the most suitable for these. The straight line of the data Let’s do an actual operation directly . In this example, there are a total of three data points , which are respectively at the positions of 1, 1, 2, 2, and 3, 3. We want to use this straight line to represent these data . This straight line is w Equal to 0 b is equal to 0 and then we want to give a score for the suitability of this straight line with these data. How to score ? We can use the sum of the distances between these data and this straight line Use it as the basis for scoring. If this point is 1-0, the distance between it and the line is 1. If this point is 2-0, the distance is 2. Here, 3-0 is 3. So our formula will be written like this, you may see Here, besides 1-0, I also squared it . The reason for doing squares here is because it is convenient for us to calculate , because you may have negative numbers in the future , so you can solve the problem of negative numbers by directly square. Solved So in this example, the square of 1 minus 0, the square of 2 minus 0, the square of 3 minus 0, the sum of the final distance , it should be said that the sum of the square of the distance will be 14. Let's look at another line . Suppose we want Use this line to represent these data . This line is the result of w equal to 2 and b equal to 0. We also calculate their distance . The point here is also 1, so here is 1 minus 2, 1 minus 2 Then here is the square of 2 minus 4, 2 minus 4 , 3 minus the square of 6. You can see that there will be a negative number here . If we don’t have a square, there will be a negative number here , so after the same calculation here Do the sum , the same is 14. Then look at the next line. You can see that this line is quite consistent . It is the result of w equal to 1 and b equal to 0. Then calculate the distance , and you can find that each of them has the same value , so the calculated distance will be If the three lines are 0 , we will say that the line where w is equal to 1 and b is equal to 0 is the most suitable straight line for these data , and then the scores or distances generated according to different w or b The sum of squares , we can write it as a function . This function is called cost function in Chinese, it is a cost function. In our example , the cost function can be written like this. It takes the real data to subtract the predicted value and then squares it , that is, we To calculate the real data, the square of the distance from that straight line will be the cost , which is what we just said about the score . Let’s go back to the original example . We want to find a straight line that best fits the data , so we need to help the straight line evaluate A score , that is, to calculate the cost here , let’s assume that all the straight lines now have their b equal to 0 , let’s see how much the cost will be corresponding to different w, we will bring in one by one , first bring in one w is equal to -10 or so, and its cost will be here , then bring in the second and third , bring in all the way, bring in and bring in, finally we bring in many, many points , we will find that the cost function is actually the cost function , it will It is a parabola like this. This is the case where b is equal to 0. If b is not equal to 0, what will the cost corresponding to w and b look like? The value of w here is the value of b , and the z-axis is the value of cost . You can see that its graph will look like this. The red point is the point with the lowest cost . Let’s look at two other angles. From these two angles, we can You can see that the place with the red dot, as I just said, is the place with the lowest cost . We can see that the place where w is equal to 10 or so , and where b is equal to 20 or so, will produce a point with the lowest cost . This is our goal. Well, our goal is to find the most suitable straight line . To find a most suitable straight line is to find the distance between the straight line generated by the most suitable w and b, the most suitable w and b, and all the real data. The sum of the squares will be the smallest , that is , the place where the cost is the smallest here. After we understand it, then we will directly implement the cost function . Here, I opened a new colab file to implement the cost function. At the beginning, we need to read the data first. The method of reading the data is the same as before , so I directly copy and paste it for execution. After the data is read, then we will implement the cost function and open a grid. Our cost function is Take the real data and subtract the square of the predicted value, so here we can write it like this , first calculate the predicted value, I call it y pred , it will be equal to w, multiply x and add b, then w here I don’t know anything about b and b yet , so let’s assume that w is equal to 10 and then b is equal to 0. After calculating the predicted value, we can take the real data. The real data is y minus the predicted value y Pred , and then we need to square it . So I put it in brackets and square it. This side is called cost . Let’s display the cost directly. You can see that there are a total of 33 values displayed . These 33 values are the real data minus the prediction. Value and then do the squared result , which is our squared distance. If we want to sum up the squared values of these distances , just write .sum() after it to display it . You can see that this is the summed up The result is the sum of squared distances . If the value here is too large, we usually average it . We have a total of 33 records, so we will divide by 33 here , but it is better to write it like this . To calculate the length of x, the length of x is 33, so dividing this by 33 is the average of the squared distance. Next, let’s write the cost calculation here as a function , so that we can bring in different w and The value of b is good to open another grid . Here I call it compute cost , and then it needs to pass in our data x and y and the predicted line, its w and b values. The calculation method is the same as here . Paste it directly. We need to do the predicted value first. The predicted value is w multiplied by x plus b , then calculate the cost and then add it up and take the average. I will make the final cost equal to the sum and then take the average result. Well, let’s pass this cost back, let’s try to see what kind of results will be produced by adding different w and b, let’s call here to see that xy is the real information here, and w we just brought 10 and b 0, w is the same as b, let me try to execute 10 and 0, he said that the compute cost is not define, oh, it has not been executed yet, so it needs to be executed first and then executed here, and the value returned is 602 points , which is the same as before. The input is 10 and execute it again. You can see that the cost this time is 227 points . Then you can import it yourself to see how much the cost will be generated by different w and b . Then let’s try it when the fixed b is equal to 0. In this case , let w be between -100 and 100 , and then let's see what its cost will be. Here , I will first define a cost, let it be a list , and then store the cost value of w between -100 and 100. OK What about here, we can use a for loop to do calculations . I let w make it range from -100 to 100. Here I want to write 101 , and it will generate a similar one from -100 to 100. the list Then we have to use the function on the cost calculation side to do it directly . Use xy and fix b equal to 0. Let it always bring in the value of w. This side is called cost and then append it to our costs. In the list, I finally displayed it to see how to execute it . Here, the total number of w equals.. Wow, there are a total of 201 values from w equal to -100 to 100. I will turn off its cost first and then Let's draw the cost corresponding to different w into a picture to see. Here we introduce our drawing tool matplotlib and the pyplot under it. I call it PLT. Then we can use scatter to put each data point into it. Draw it. We draw each w and the corresponding cost . The w on our side is between -100 and 100. The cost corresponding to it is displayed and executed. You can see that it is between -100 and 100. There will be a total of 201 points. The 201 points will be so densely packed when drawn. Besides drawing like this, we can also directly connect them into a line or a parabola. Here we use plot to pass in w and Corresponding to the cost and then execute this parabola to make a connection. Here I will help him add a title and a label . This one is the cost function when b is equal to 0 and then w is between -100~100 . Then set its xlabel xlabel is The value of w and then execute the ylabel cost to see that the title and the labels of the x-axis and y-axis are all up . Then we also take the value of b into account and let the value of b also range from -100 to 100. Let’s see what the cost will be How much? Here I introduce a very useful tool for matrix operations, numpy . I call it NP. Now both w and b are between -100 and 100. We can use arange under NP . The usage of this arange is basically the same as above. The range is almost the same. I want to create a matrix between -100 and 100, and the interval is 1. We can write it like this. Here, I call it ws because there are many w and b are the same. I call it bs Then I create a two-dimensional matrix. Here I can use NP point zeros to create a matrix with all 0s in it. The first dimension is 201 and the second is also 201 because the value of w here is total. There are 201 b values and 201. This matrix is to store the costs corresponding to different w and different b , so I call it costs and then you can do calculations. I first use a for loop to run through all The value in ws and then use a for loop to run through all the values in bs , and then we can calculate their cost. If we want to calculate the cost, we have already written the function above, just take it down and use it directly copy here Come down , well , xy is the real data , and then the cost will be calculated by passing in wb here, and we can store it in the costs matrix. Here , we define another i, which is equal to 0 at first , and then define j here. If it is equal to 0 , then the values of i and j here will be equal to this cost , then let j add 1 here, and finally let i add 1 here, after the calculation, all w and all b will be added The cost corresponding to the combination is stored, stored in this cost , and displayed to see the execution. He said that numpy has no attribute and wrote an extra r here to execute it again. He needs a little time to wait for him . It looks like this A two-dimensional matrix . Each value in it represents a corresponding cost calculated by w and b. Next, let’s draw the cost calculated by considering w and b at the same time. Take a look Because we need to consider the values of w and b at the same time , we will draw a 3D graph. To create a 3D graph, we can write plt.axes and set its projection=3D. I call it AX and we will display it first. Come out and execute it . You can see that a 3D image is created , but there is nothing in it now . We can see that the edges of this image are a bit gray. I don’t want it to be gray . I can also make a setting . I can Use ax.xaxis.set_pane_color to set the color . I want it to be white . I can write an rgb value here. For white , let it be 0 0 0 for easy execution. He said that there is no attribute and it is reversed here , so you can see it When it comes to the x-axis side, it will become white. Then I will use it on the y-axis and z-axis side to copy it directly . Just change it to y and z to execute it again , and it will be white. It feels much more comfortable. Next , let’s draw the cost corresponding to w and b as a surface graph . Here we can use the plot_surface under it to compare the w value we just created with the value of b , that is, ws and bs These two matrices are passed in , and the last is the cost we stored. Pay special attention here . In fact , we don’t just pass in the two one-dimensional matrices of WS and BS . What we want is the two one-dimensional matrices. A two-dimensional grid is generated . If we want to generate this two-dimensional grid , we can use the mesh grid under it in numpy, and here we pay special attention to the first one. We first pass it into BS and then pass it into WS. Then it will be It will be made into a two-dimensional grid and sent back to us . I will call it b grid w grid. If you want to know more about what this two-dimensional grid is and how it works , you can take a look at this URL for his explanation. It is very detailed and I will not go into details here . If you are interested, I will put the URL below and you can study it yourself . Here we will change it to pass in w grid and b grid , so it will be no problem. We can execute it and you can see this surface The picture is displayed. Let’s add some titles and labels to this picture . Here, we can write ax.set_title. Here, pay special attention to the fact that we need to add an additional set in front of it, which is different from the previous one. In the front , just write title directly . If we want to add a set title, we will let it be the cost generated by wb, that is, the corresponding cost will be used in Chinese. If it is used in Chinese, we need to add Chinese fonts . So I will go back to the previous file. Inside, the Chinese language is added in this place . I still need to download it first, so I will paste it here . This is to download it, execute it first, and then add it . This is to add it , so I will paste it here. First download and then add the font. After the title is set here , then we set the labels of the xy and z axes . Here , we also need to add set xlabel . The label of the x axis is w , and then set it.. here I directly Use copy to set its y-axis label, the y-axis label is b , and then the z-axis, and the z-axis is cost Ok, let’s display it again, you can see that this is w and then b, and their corresponding cost is the z axis . If we want to rotate this picture, it is also possible. If we want to rotate, we can do it here Set a view init, which can pass in two parameters. The first parameter is the rotation angle of up and down , and the second parameter is the rotation angle of left and right. Suppose I write 45 for the first one and -120 for the second one. Execute it again , and you can see now As shown in the picture, it turns like this: this side is w, this side is b, and this is the cost . Well, you can also play with the angle of rotation. Then I think the color of this curved surface is not very good-looking, and I want to make it One more modification, you can set another parameter here called cmap , I set it to spectral r, let’s execute it and see, you can see that the color is much better now , as for what other colors can be set here, I will leave it to you Go research, I think this color is good , and then we can set another opacity value , the opacity is called Alpha Alpha , let it be equal to 0.7 and execute it, you can see that now it has a more transparent feeling, it looks like It’s more comfortable. Then, if we want to make this picture look better , I can add a border to it. I can write a plot wireframe and pass in the first three parameters . Then I set the color of the border to black , so let’s execute it. You can see that this border is added now , but it is a bit too dark . We can also set its transparency Alpha assuming that I make it 0.1. Then execute it. You can see that it is very comfortable and beautiful. Next, let’s put Find out the point with the lowest cost. To find out the point with the lowest cost , we can use the min under it in numpy to find our entire cost. I will print it out to see what the lowest cost is . The lowest cost It is more than 32.69 . If we want to know the w and b corresponding to the cost of more than 32.69, we can do it like this. We can use np.where to find its location and find the lowest cost among all the costs . Where is the index of the position , because this cost is a two-dimensional matrix , so it will return two values, that is, two indexes. I call it w index and b index . Then we can put these two values Print it out and execute it. You can see that the index it finds is at 109 and 129. We need to find out the w value corresponding to this index from all w matrices . The part of b is the same . OK , let’s execute it again. Once, you can see the corresponding value. The corresponding w value is 9 and b is 29. That is to say, when w is equal to 9 and b is equal to 29, there will be the smallest cost. Here we can write like this. When w is equal to this value and When b is equal to this value, there will be the minimum cost . Then the minimum cost can be obtained from the two-dimensional matrix of costs. The value to be obtained is the value corresponding to the two indexes . Let’s execute it again and it will be Speaking of which, when w is equal to 9 and b is equal to 29, there will be the minimum cost , and the minimum cost is more than 36.69. Finally , we can also draw the point of the minimum cost. If we want to draw it, we can Use the first value of scatter to pass in the value of w, then the value of b, and then the z-axis, which is the value of our cost. It is easy to execute. The point where you can see the smallest cost is here, so I can put it Change the color , set its color equal to red , and then I want to make it bigger , set its SS to be the size , and let it be 40, so I can execute it again. You can see that it is much more comfortable . Finally, I want to make this To make the picture bigger , I can go to the top to set it . I can write plt point figure and then set its figsize. He can set two values. These two values represent the width and height of the figure respectively . Suppose I start with 5 and 5 are fine. Let’s see what it looks like. It’s still a little small. I’ll make it a little bigger. 7 and 7 will make it look more comfortable . Well, the other parameters here are the rotation angle. The size of the picture Or color, etc. Everyone can make adjustments by themselves . This is the implementation of our cost function. After reading the cost function , our question will become how to find the best w and b efficiently. The best w and b correspond to the point with the lowest cost. It is not difficult to see from our last implementation that we use a brute force method. I exhaustively enumerate all the values of w from -100 to 100 and then go to Look at its cost and find the lowest point from it. The w and b considered here are the same. I exhaustively enumerate all the combinations of w and b from -100 to 100 , and then find out the corresponding cost. Its lowest point , but this is not a good way . We must efficiently find out the best w and b. To efficiently find out w and b, we can use a method called gradient descent, which is gradient descent in Chinese . Everyone Don’t think too hard about it . In fact , gradient descent is to change the parameters according to the slope . In our example, the parameters are w and b, so the values of w and b are changed according to the slope . Let’s directly look at the gradient descent. How does it work ? Let’s take this as an example . Let’s assume that b is equal to 0 and only consider w. How to find an optimal w that can minimize the cost? First, we need to set an initial value of w . Then The initial w value can be set randomly. Let’s say I set it here , which is roughly equal to -75. Then we can calculate the tangent slope at this point through differentiation. We can calculate the tangent slope at this point through differentiation . Everyone here Pay special attention when we use gradient descent . In fact, we don’t know the blue line. We don’t know the blue parabola. This parabola is obtained through our exhaustive violence . So you don’t I don’t know that the lowest point is here . Let’s re- describe the problem now. It can be like this. Today you are blindfolded and thrown into a place where you can only go forward or backward . What is your goal? You have to go to the lowest point , but fortunately, you can use some method to calculate the steepness of the front and rear of your current position, and then you can use this steepness to find the way to go down , back The same as the original example, at this point , you can calculate the slope of the tangent line through differentiation. The slope of the tangent line is equivalent to the degree of steepness . Then we can use this slope To find out the way to go down , let me briefly explain how the slope is calculated. First of all, we need to set what the cost function looks like . In our example, the cost function looks like this. We use real data. To subtract the predicted value and then square it . In mathematics , it is to subtract ypred from y and then square it. This ypred can be expressed as w times x plus b because we want to represent the data as a straight line. So here is w multiplied by x plus b and in this example b is equal to 0 , so we can omit it directly and then we just need to differentiate it with respect to w to get the slope of the tangent line . After differentiation President, I will not show you the detailed differentiation process. If you are interested, you can study it yourself . In fact, it doesn’t matter if you don’t know what differentiation is at all , because there are many tools that can help us do calculations automatically. After the differentiation, if we want to know what the tangent slope will be when w is equal to -75, then we will bring in -75 . The x and y here are our real data , and we can calculate the tangent slope by bringing in the same. How much is it ? After we know the slope, it is equivalent to knowing the steepness here. After knowing the steepness here , we can go down. How to go down? We can subtract the slope from w and multiply by A learning rate , what is this learning rate, I will explain to you later . Let’s look at the previous w to subtract the slope. In this example, now w is about -75 , and the tangent slope is obviously a negative number , so we let w to subtract a negative number , that is to say, it will add a value , so after adding a value, it will move forward , then we repeat this action all the time, and then calculate the tangent slope at this point , then we bring the number with You can find the slope when you come in. After you find the slope, bring it to the following formula . Now, where w is about -60 , you can obviously see that the slope is still negative. Let w subtract a negative number , that is, add a Value, so it will move forward . We keep repeating this action . Calculate the slope update w Calculate the slope update w Let’s take a look Calculate the slope update w Calculate the slope update And then keep repeating until we are almost close to the lowest point Or at the lowest point, the slope of the tangent line here will be quite close to 0. OK , the slope of the tangent line here will be very close to 0. After it is very close to 0, we let w subtract a value close to 0 , which is equivalent to w being If there is no update , then we will find the lowest point . This is probably the operation process of gradient descent . Next, let’s take a look at what the learning rate is. Let’s go back to the previous example . At the beginning, we first find an initial w, and then we can calculate its tangent slope . After calculating the tangent slope, if we want to go down , we can subtract the slope from the value of w and multiply it by a Learning rate The learning rate is that you need to set a value . You have to decide how much the learning rate is. Let’s take a look here, multiply the slope by the learning rate . If your learning rate is larger, the side will be larger. Well , it may be the larger the positive value or the larger the negative value. Then we subtract a larger value from w, and the change of w will be greater , that is to say, its pace will be larger . Conversely, if your learning rate is higher If it is small, it will be relatively small after multiplying here , and then we take w to subtract a relatively small value, and the change of w will be relatively small , that is, the pace will be relatively small . Well, let’s see what happens with different learning rates. The result of the appearance, let’s first look at the learning rate. If the learning rate is high , then his steps will be relatively large. OK, it’s relatively large. Well , you may have a question when you see this , is whether his steps are gradually getting smaller. That’s right , he is gradually increasing. Get smaller , even if our learning rate remains the same , his steps will gradually become smaller. Why ? Because the slope here will also change . The slope here is relatively large , and the closer to the lowest point The slope will be smaller , so his steps will also be smaller. Let’s see what it would look like if the learning rate is smaller . Then his steps will be smaller. Let ’s compare the two pictures together. Clearly, the one on the left has a high learning rate, and the one on the right has a small learning rate . You can see that the strides on the left are relatively large, while those on the right are relatively small . Speaking of this, you may say that we must have set the learning rate to a high value. Ah, because it is set to be too large, the speed of going down will be faster, so that we can reach the lowest point faster . This is a very good question, but our learning rate may also be too large. What will it look like if it is too large ? Take a look, you came here in the first step, and then in the second step, you didn’t go to the lowest point , you just stepped over to the opposite side , then let’s look at the third step. In the third step, he stepped back and you will In this way, you keep stepping over and over and over and over again, and you will never reach the bottom point because your steps are too big and there is no way to reach the bottom point at all. This is the problem of too large learning rate . Let’s see if What will happen if the learning rate is too small? Every step you take is very small, very small, so small that you can’t go to the lowest point even if you go forever. So you can’t make the learning rate too large or too small . To find A most moderate value , how do we find the most moderate value? We can find it through continuous experiments and tests. Well , this is the learning rate . Then let’s go back to the original gradient descent. The examples just now are only considering w In this case, if we have to consider even b now , how will the gradient descent work? It is basically the same . First of all, you need to set a random initial value of w and b . Here you see this The picture is also obtained after exhaustive enumeration , so you don’t know that the lowest point is here , so we directly apply it to the metaphor . Today, you were taken to a place inexplicably and then blindfolded . This place is a bit The terrain is similar to a canyon , so your goal is to find the lowest point of this canyon , but this time is different from last time, you can not only walk back and forth, you can also walk left and right , you can also know your front and back through some method The steepness and the steepness of the left and right , then you can find the way down through the steepness. The steepness on this side is the same as before, which means the slope . So here is to calculate the slope in the w direction , and here is to calculate the b The slope of the direction To calculate the slope , we can also differentiate the cost function. In this example, the cost function is long. It is to subtract the predicted value from the real data and then square it. That is, subtract y pred from y and then square it. What about y pred here? We can also decompose it into w multiplied by x plus b, and then differentiate w to get the slope in the direction of w . If we differentiate b, then You can get the slope in the b direction. After the differentiation, the result will be like this . The slope in the w direction is as long as this. The slope in the b direction is as long as this. If we want to know what the slope is at this point , then we can just bring the value into it. Now , bring in the w here, bring in the b here , and then x and y are also the data we have , then we can calculate the slope in the direction of w and the slope in the direction of b , and then we can update it The values of w and b means that we can go down . To update w, we subtract the slope in the direction of w from w and multiply it by the learning rate . To update b, we subtract the slope in the direction of b from b and multiply it by Learning rate In this way, we can go down . After reaching this point, we will recalculate the slope in the direction of w and the slope in the direction of b . After the calculation, we can update it, and then we can slowly move forward. Walking to the lowest point , we are walking blindfolded . How can we judge whether we are at the lowest point? Similarly, when you get closer to the lowest point , the slope of the w direction and the b direction will be smaller , and then you If w and b are subtracted by a small value, it means that there is no change . We can use this to judge where we are now. For the learning rate, you have to set a value yourself . This value cannot be changed. Too big and not too small. If your learning rate is set too high , his steps will be very large, so big that there may never be a way to reach the lowest point . On the contrary, if the learning rate is too small, your steps will be very small. It's so small that you may not reach the lowest point even if you go to the wildest place. This is our gradient descent. Its operation process . Then we will directly try it out . Okay , then we will implement the gradient descent. First of all, we must read it first. The action of getting data and reading data is the same as before , so I directly use the copied one. Then we just said that gradient descent is to calculate the slope and then update the parameters. To calculate the slope in the direction of w , we can differentiate the cost function with respect to w. To calculate the slope in the b direction , we can differentiate the cost function with respect to b . Let's do the calculation. Let 's differentiate the cost function with respect to w . The result is 2 times x and then multiplying w times x plus b to subtract y is the result of differentiating the cost function with respect to w . If we differentiate the cost function with respect to b , the difference is that one less x is multiplied here, so the slope in the direction of w is called w gradient, and the direction of b Well, I’ll call it b gradient Now if I want to know what is the slope of the w direction and the b direction when w is equal to 10 and b is also equal to 10, then I will first set w to 10 and b to 10 , let’s go first Looking at the good execution in the w direction, we can see that it has generated a total of 33 values because we have a total of 33 data. Every data you bring in will generate a slope . Here we will average it. If we want to average, we can First add it up , and then divide it to see how many records it has . Here, how many records are there? I will use n to represent it. Let ’s calculate the length of x , and then we can divide it by n. Let’s execute it again. This is the average result, which is -118 points. Then let’s look at the b direction again. We have 33 data, so it will calculate 33 values . Here we do the same , that is, add it up and divide it by n Execute again , and you can see that the average result is -27.46 . In fact, if we want to calculate the average here, we can also directly write it like this . We write some mean, which also calculates the average. There is no need to calculate the length of x separately. You can see that the result of the execution is the same. Change the direction of w here, so they are all the same. For convenience, I will calculate the slope or you can say the action of calculating the gradient and write it as a function. Here I will Call it a compute gradient , we need to pass in x and y, which are our data , and then we can do calculations with the value of wb , and we will return the calculated results to the w direction and the b direction , then I will open another grid to try it out Suppose what I want to know now is when w is equal to 20 and then b is equal to 10 , what is the calculated slope ? This side needs to be executed first and then execute it. The w direction is 537 points , and the b direction is more than 70 points . Calculate w and After the slope in the b direction, then we can update w and b , because our update method is to take w and b to subtract its slope and multiply it by a learning rate . We first assume that the initial w is 0 and the initial b is also 0, here you can set it randomly, I set it to 0, then we can calculate the slope when w is equal to 0 and b is equal to 0, then we can update w according to this slope and The value of b To calculate the slope, use the function just written. We bring in wb, and it will send back to us the slope in the direction of w and b . Then we can update the value of w and b according to the slope. For w, we can subtract w from the slope in the direction of w , and then multiply it by a learning rate. After my tests and experiments , I think 0.001 is a good learning rate. Well, if you want to update b , the same thing is to subtract b from b in the direction. The slope is then multiplied by this learning rate . I will make w equal to the result after the update , and then b is also equal to the result after the update . Let’s display it and execute it. You can see that after the update, w has changed from 0 to more than 0.87 , and then b has changed from 0 to It’s more than 0.14 , then let’s see if wb has changed from 0,0 to this . Is it really reducing the cost , that is, is it really going down? If you want to calculate the cost, you have written it before and copied it directly. The compute cost is directly copied , and we can use it directly here . Let’s see if the cost has really dropped after wb originally changed from 0,0 to this new value. I will print it out and put it It is printed out and executed. You can see that it was originally 6,040 and then changed to 5,286 , so the real cost is decreasing , that is, we are really going down , so let’s pause for a while. Let’s take a look at the place where the gradient was just calculated. How is it written? Find the compute gradient here . Whether it is to calculate the gradient in the w direction or the b direction, you should have found that it has a multiplication by 2. In fact, this multiplication by 2 can be omitted. Here we can omit it. It ’s gone. I’ll run it again after omitting it. Why can it be omitted? Let’s look down here . Do we take w to subtract the slope in the direction of w and multiply it by a learning rate , then b to subtract the slope in the direction of b and multiply it by a learning rate . What we just did in the direction of w and b The slope is multiplied by 2, which is equivalent to multiplying by 2 later. In fact, this multiplication by 2 is unnecessary because it will indirectly affect the size of the step . The size of the step is controlled by the learning rate. That’s good , so multiply this by 2. In fact, we don’t need to write it. We can just write the double here. Let’s just multiply the learning rate by 2 and calculate again. We can see that the result is the same , so we The above multiplication by 2 is unnecessary, so just omit it. I delete the multiplication by 2 and execute it again . This time, it is 6040 and then becomes 5656. This is the result of only one update. We only updated the result of w and b once. Then let's try to see what it looks like after updating 10 times. I will use a for loop to repeat this 10 times and let it run 10 times. Then let's record the values of wb and cost . The cost here is equivalent to recording it. Let’s use an f string to write it. At the beginning, I will first write how many times it is now, that is, the i-th update of the iteration . Then we will record the current cost. If the cost is this value , then the value of w and b will also be written. Record it and let’s execute it to see what it will look like . The cost is this value at the 0th time, then w and b let’s see if the cost is really decreasing . It is really decreasing. 5656 to 3161 I let him have an interval , and execute it again , which is much more comfortable , but he has a lot of decimal points, so it will look a bit untidy . If we want him to display only two decimal places , we can do this. Write it on the back: .2f will only display 2 digits after the decimal point. If you want 3 digits, it will be .3f and so on. Then I will only display 2 digits of w and b and execute it again so that it looks neat. It’s a lot, the cost is decreasing, and then the value of w is updated all the way, and the value of b is updated all the way. Let’s try it out. If you give it 20 times and then execute it, you can see that the cost is also decreasing all the way . When it reaches 20 times, only There are 1,705 left , but now he seems to be untidy again. The reason is that he occupies two grids by the 10th time . If we want to make the number of grids occupied by him the same, here we can Write on the back of him: Then I want him to occupy a few squares. Suppose I want him to occupy 5 squares, then I will write: 5 and then execute it. No matter what the number is, it will occupy 5 squares , which is much neater . In addition to calculating the cost In addition to the values of w and b, I will also record the slope of w and the direction of b . Here, it will also have a lot of decimal points , so I will let it only display 2 digits . Execute it again so that everything will be fine. It ’s recorded. Where we see the cost , we can see that it is still obviously declining. So I asked him to run a few more times to see how far he can drop . After running 100 times, it still seems to be declining. But we You can see that the right side seems to be becoming irregular again, and this side is becoming irregular again. The reason is that our numbers still have sizes, maybe two or three digits , so they will still become irregular. Here, if we really want to make it neat, we can use scientific notation . Here, I can write .2e and change it to .2e, then it will display two digits . Others are presented in scientific notation. What Is it a scientific symbol? Let’s see if we execute it . You can see that this is a scientific symbol . It writes 5.66e+03 , which means 5.66 multiplied by 10 to the third power, which is equivalent to 5,660 . If what you see here is 5.66 e-03 is 5.66 multiplied by 10 to the power of -3. We use scientific symbols to represent it , which is much neater. We see that the cost is still falling, so I let him run a little more here . Let him run 100 times , then I will let him run 1,000 times and then execute it. Let’s take a look . It seems that it is still falling at 42, and then it is still falling at 41. Let’s let him run it a little more . I let him run it once. 10,000 times, it is too much to print it 10,000 times, so I asked him to print it only once every 1,000 times. Here , I judged that if it is divisible by 1,000, I would print these. Information , execute it again , then it will only be printed once every 1,000 times. Well , we can see that this side seems to be a little messy. The reason is that it has an extra minus sign . If this problem needs to be solved, we can add one in front of it. Blank If you add a blank, it will give up one more bit to represent the symbol . To indicate this symbol , let each side give up one more bit and execute it again. This problem will be solved . Let’s take a look at the place where the cost is still declining. After 10,000 times, it seems to be still declining. He has multiplied 3.51 times 10 to 3.39 . Let’s let him run another 20,000 times to see. Here, let him display more decimal places . I asked him to run .4e and execute it again, so that he would display 4 digits , and it continued to decline, but the decline became very small . We just let him run it 20,000 times, and we can see that the decline is slow It's getting smaller and smaller. Let 's take a look at the slope next to it. We can see that the slope in the direction of w and the slope in the direction of b are also very, very small , almost close to 0. Then I will directly write this gradient descent process as a function , so that It is convenient to use later. I will call the name of the function gradient descent . There are many things to be passed in. The x and y are our data, the initial w and the initial b, and then the learning rate and us. The cost function used to judge the quality and the gradient function used to calculate the slope are finally the total number of times you have to run run iter and how many times you want to print out the data. I call him p iter, and I let him preset the value of p iter. 1,000 , that is, 1,000 times, it will be printed once , so here we change it to p iter, and for 20,000, we change it to run iter and the initial w is w init , so I make w equal to the initial b of w init. B init then calculates the function of the cost. You need to change the compute cost to this function , and then the compute gradient to this Then here, by the way, I store the cost in the process and the values of w and b. For cost, I call it c hist and make it a list. For w, I call it w hist. b is also b hist . The three lists are used to store the cost and the value of w and b of each time we ran so many times. Here I store the wb and cost after each update . So w hist.append Store w in and then b is also the cost , store it in and then finally I can do the return action . For the return, I just return the final w and b and all the w processes in our process All the w, b, and cost are all returned , so let ’s implement it . He said that there are some problems. Create another grid here. Before using it , we need to set the initial w and b . Suppose my initial w and b are equal to 0, and then the learning rate is 0.001 . I can also use science here. How to write the symbol I write 1.0e and then -3 , that is 1.0 multiplied by 10 to the power of -3, which means 0.001 . What about the cost function? Here we need to pass in the compute cost , which is the gradient function that we use to calculate the cost. What we wrote is the compute gradient. Here we can pass in the function as a parameter . Then it is run iter. Run iter . I can’t move it, I can move it, I don’t need to set it, okay, let’s execute it and see , it will return these 5 values , so I’ll also write it, this w is the last w, I call it w final b is also the last For b, I call it b final , and the stored wb and cost are good for us to execute. Why is it only printed once? Let’s see why it is only printed once . Oh, I accidentally wrote it into the for loop in the place of return So it should be executed again outside , so it should be no problem. 1000. Let’s see that the cost continues to decrease , and the slope is also decreasing. The final w can be seen to be more than 9.17 , and b is about 27 . I'll also print it out to see the final w, the final w and b values , let's take a look, the final wb is more than 9.14 and then 27.88 . The same here, I let it not have so many decimal points, and let it display two digits That’s good , so I wrote that the final wb of .2f is 9.14 and 27.89, then we can use this final value to make predictions. You should not have forgotten the problem we want to solve. Let me review it for you. Suppose today you are a The boss of a start-up company , you want to hire your first employee , but you don’t know how much to pay him, so you go to the market to collect some relevant information about this position , that is, his seniority and his corresponding salary . Now Well, you want to use a straight line to represent these data, and after the representation , you can use this straight line to predict how much salary this employee should be given . Now that we have found this straight line , we can make predictions. Action, suppose the employee who applied for the job today tells you that he has 3 and a half years of work experience , which means that his seniority is 3.5 years , then we can help him calculate how much salary he should give him. Here , his If the seniority is 3.5, we can use the found w final to multiply it by 3.5, and then add the value of b, which is b final, because we use a straight line to represent these data. A straight line can be written as w multiplied by x Adding b, now our x is 3.5, and the unit of this is k. He said that the seniority is 3.5, and he can give him about 59.88 k . There are too many decimal points, so I just give him 2 digits, er, one. That’s ok, one decimal point is good to execute again , then we need to give him 59.9 k. If we predict it, we can probably give him 59.9 k. There are more blanks here . Suppose another employee comes and he tells you that his seniority is 5.9 years. Then we Just make a prediction. Here I will type two more words to predict. The predicted salary is easy to execute . If he tells you that his seniority is 5.9, then the predicted salary is probably 59.9k. If not, I will change it to 5.9. Execute , the predicted salary is about 81.8 In this way, our problem is solved. We can predict salary in this way . Next, I will draw some data into a graph to see . For example, I can draw it into a graph to see the cost here. If we want to draw a picture here , we can use the drawing tool matplotlib and then use numpy. Let’s draw the process of the 20,000 update cost decline. We can use plt.plot to draw it as a line . The value of x here we use np.arange 20,000 times, so it is 0~20000 and the value of y is the historical data of our cost. The updated data is stored here, so we can display it to see Optimistic about the implementation, you can see that he looks like this , then I will add some tags and titles for him, and set his title. The title here is a total of 20,000 updates and its cost. So I will write iteration vs cost here and then Then add an x label for him. The x-axis is the number of updates. It is iteration, and the y-axis is cost. Execute it again , so that the subtitle and labels are all up . From this picture, we can find that the front of him has dropped very quickly. But later, it will be slower. If we want to look at the previous paragraph in detail , suppose I only want to see 100 0~100 , then I will change it to 0~100 here , and then I will write: 100 also It is the first 100 data, so it can be executed again. This is the descending process of its first 100 updates. If you want to see other intervals, you can also set it yourself. It seems that there is one less a iteration , so you can execute it. Finally, we can also use w Follow the update process of b. The 20,000 update process draws it with a picture . Let’s first go to the previous cost function and copy the 3D picture. The 3D picture is here . We copy the action of drawing a 3D picture over there . At that time, we set w and b to be between -100 and 100 , so the calculation of w and b from -100 to 100 should also be copied here . Here, w and b are from -100 to 100 and its cost Here I will open another grid to do calculations I will let him calculate first , then we will use Chinese fonts here , so we have to do the action of downloading fonts, the same way I went to the previous cost function , found the place where the fonts were downloaded, copied them, and then opened a grid for him to download , okay here Let's wait for him and let him finish the calculation, installation and drawing . OK, the drawing is finished. I 'll adjust its angle . It was originally 0 degrees and 0 degrees. I made it 20 degrees and -65 degrees , so I can execute it again . The angle is good, the red one. The point is the lowest point. Then we draw the update process of w and b with a line. If we want to draw a line, we can use the plot to pass in the w hist b hist and the hist of the cost that we saved , that is For the c hist side, let me run it first and see that he draws the line , but I don’t know where the initial point is , so I draw the initial point as well . If I want to draw a point, I use scatter Here I directly use the copied one . If we want to draw the initial point, it is its 0th value b hist is also the 0th value cost is also the 0th value. Here I make the color green and execute it again, and you can see the green. The point is our initial position , and then it updates and updates all the way to this side , basically reaching the lowest point. Ok, it’s updated all the way, so I think the color is a little bit of an eyesore . The color of the surface is a bit of an eyesore , so I changed the color I don’t think it’s necessary to cancel the border , so execute it again and it looks better . I set its opacity to the first point and let it be 0.3 and execute it again . It looks much more comfortable. It ’s a Such a curved surface. Then our green point is the initial point . It is updated along the way. Basically, it is about to reach the lowest point , because the red point is the lowest point. Well , I can also play around here . If I am making a gradient When descending, I set a different value. Here I copy a copy to the following and copy it here . If my initial point here is set at -100 and -100 , let’s try it and see what the result will be like. Well, after running 20,000 times, we will also get the final w and b and the stored wb and cost for each time. We will execute the drawing here again to see what it will look like . You can see this time. The initial point is here , and then he goes all the way down, down, down, and down... and then turns around like this, which is also close to the lowest point . Then I can try it if I didn’t let him update it here. Many times , I only asked him to update 1,000 times for easy execution . I only asked him to update it 1,000 times. Let me see what it looks like. He walked down this time , only came here and didn’t move forward because we didn’t update enough times If there are many, you can also adjust other things . For example, I can increase the learning rate a little bit. I set it to be 1.0*10 negative quadratic, which is 0.01 . Then I will try again here . He still has reached the lowest point , then you may find that the initial point we set is not -100 -100. According to you, the initial point should be here . Why is it here? The reason is because in fact, we have stored it here. It is the result of the first update, so he ran here after the first update -100, -100 is the initial place here, he ran here for the first update , and the storage here is from the first update When the last result is reached, I will try again . If I increase the learning rate a little more , let’s say I change it to 5.9 for easy implementation . Let’s see what it will look like. You can see that his stride this time is very large. He has become like this. Back and forth, but it seems to have reached the lowest point, so what if we make it bigger? Here , I change it to -1 of 1.0*10 and execute it. It’s easy to execute. You can see that something terrible happened . Ours The surface in the picture is gone. Why ? The reason is that it has exceeded the range too much. When we drew the surface, we set the values of w and b to be between -100 and 100. Now it has exceeded too much. It will cause the surface to disappear. When our learning rate is set too high , it is possible that each update is not closer to the lowest point , but farther away from the lowest point. In our example, every update is farther away from the lowest point. The farther the lowest point is , so it will lead to such a situation that the entire surface disappears. The example just now is like this. If you set the learning rate too high, it may go farther and farther away from the lowest point . One step to this, the second step to this , and the third step to the point where he doesn’t know where he is going. Here , everyone can play and see by themselves , set different initial values of wb , and set different learning rates and different settings. This is the implementation of our gradient descent . After completing the simple linear regression, I will briefly summarize the process of machine learning. In our example , the first step is to prepare the data first . According to this The distribution of the data , we think we can use a straight line to represent it , so we use a straight line to represent the data, and then we need to find the straight line that best represents the data, that is , the line that is most suitable for the data . What kind of straight line is the most suitable for these data? We always have to give him a standard for judging, so we set it up. As long as the sum of the squared distances between these data points and this straight line is smaller , then we will call this straight line. The more suitable these data are , on the contrary , the larger the size, the less suitable it is. Well , after we have a scoring method, we can never exhaustively enumerate all the straight lines and then score them all and find out from them. The best one , it looks too inefficient . We must find the most suitable straight line in an efficient way. Here , we use the gradient descent method . In fact, the whole process is probably machine learning. The process is the same when you apply it to other examples . First of all, you need to prepare the data first , and then you need to help him set a model based on your data . In this example, the model we set is a straight line . After setting the model, it contains There may be some parameters that you need to adjust. In this example, you need to adjust w and b . How to adjust is called good , and how to adjust is called bad. We always have to give him a standard for judging, so we need to set a cost function . For this example, we set it like this . Then it is impossible for us to exhaustively enumerate all parameter combinations and then calculate their cost and then use it . It is too inefficient to find the best one . We must find the best parameters in an efficient way. In an efficient way , that is, we need to set an optimizer . The optimizer is translated into Chinese. Optimizer In this example , the optimizer used is gradient descent , which is probably a simple machine learning process . No matter what other examples it uses , it is actually the same . You must first prepare the data and then set up a model. Then set a cost function and finally set an optimizer OK. The second model I want to introduce to you is called multiple linear regression. It is multiple linear regression in English. It is actually similar to the simple linear regression we introduced before , but it can be compared. There are many features , so let’s take a look at what problem we want to solve today. In the previous example of simple linear regression , we used seniority to predict salary, but think about it carefully. We only want to predict salary based on seniority , right ? It’s weird, we should consider more factors , so now we have collected more complete information. In addition to seniority , we also collected his education background and where he works. So now we want to use seniority, education background and work To predict his salary, that is to say, our features have more education and employment . If we also want to use a linear model to represent these data , then we can use multiple linear regression. Multiple linear regression uses mathematics. It can be written as y is equal to W1 times X1 plus W2 times X2 plus W3 times X3... you can multiply all the way depending on how many characteristics you have, and finally add a b to that In our example, there are three features , which are seniority, education and place of work , so the mathematical formula will be long when written out. What we want to predict is the monthly salary. The monthly salary will be equal to W1 multiplied by the seniority. Our first feature plus W2 Multiply by the second characteristic of education, add W3 and multiply by the third characteristic to work , and finally add a b. Then our goal is to find a combination of W1, W2, W3 and b so that he can best express This information is what our multiple linear regression needs to do . Before we start to find the most suitable w and b, I think everyone should have discovered a problem . In our formula, we multiply W2 by education and then take W3 is multiplied by the place of work , but we can see that the two characteristics of education and work place are both text. How to make the text into a number? This is not good , so we must first combine these two features. Do some processing and convert them from text to numbers before we can do the calculation here. Let’s deal with the feature of education first . There are three possible values for the feature of education, which are below high school, university and master’s degree . From this feature, we can see It can be concluded that it actually has a relationship between high and low. If there is a relationship between high and low, can we use numbers 0, 1, and 2 to represent these three situations? 0 is used to represent the smallest high school and 1 represents a university. 2 is a master’s degree or above. If a feature has a size relationship and a high-low relationship , we can use this method to replace these words. This replacement method has a name, it is called label encoding, and we replace the education just now. After the fall, it will become like this. Come back and take a look. We use 2 for master’s degree and above, and 1 for university, and 0 for high school and below, so here it will become 1 2 0 and so on . Then we will first Let’s implement this label encoding. First, read the data in. The reading action is the same as before , so I directly use the copied one. However , in the part of the URL , because our data this time has two more features. So there are some changes. We are on the side of salary data because it is the second edition, so we write 2 at the back and read it. We will display it to see OK. Our data looks like this. There are three characteristics in total , namely , seniority, education and In the place of work , we want to deal with the feature of education first . We want to convert it from text to number first. Well , we can do the conversion in this way . Let’s get the data. Let’s take a look at the feature of education . It’s called education level and we’ll get it. This feature , let’s display it first to see what he looks like. We want his university to be 1 , master’s and above to be 2, and high school and below to be 0. Here we can do it like this, write a map on the back and put it in it. Write a dictionary. We want to correspond to high school and below, let it correspond to 0 , then university, let it correspond to 1, and finally, master and above, let it correspond to 2. We write it like this, it will help us do the conversion, and put this feature into it The value inside is converted like this. After the conversion, I will change the value of this feature . After the change, we can display the entire data and see that it will now become the corresponding value. It’s worth it . The first piece of information here is a university , the second is above a master’s degree, and the third is below a high school , so it will become 1 2 0 , and then the same conversion will be done below . In this way, our label encoding is completed. After dealing with the feature of academic qualifications, let ’s deal with the feature of place of work . There are three possible values for this feature , which are city a, city b, and city c . Here, you may wonder whether we can use the same method . We use The label encoding uses 0 1 2 to represent the cities a, b, and c, but if you think about it carefully, do you think this is okay? We don’t know whether there is a high-low relationship or a size relationship between the cities abc . If we If it is represented by 0 1 2 , whoever is 0 and who is 1 and who is 2? If we want to convert this feature with no size relationship, we can directly convert it into multiple features , like this . We originally There is only one characteristic of a city , so I changed it to three characteristics of city a, city b, and city c. The first piece of information here is that he originally worked in city a, so we gave him the attribute of city a. 1 Then the two attributes of city b and city c are 0 , so look at the second data. The second data originally worked in city c. So we give it the attribute of city c as 1 and others give it 0 and so on. This method is called one hot encoding as long as we want to convert a feature that does not have a relationship between size and height from text If it is converted into a number, then we can use one hot encoding , which will change from one feature to multiple features. See how many values this feature originally had , and how many possible values it has, then it will become several features. Take our example Say its possible values are cities a, b, and c , then it will become three features, city a, city b, and city c . After converting it like this, in fact, we can also use one of these three features as Delete why ? The reason is that among these three features , in fact, we only need to know the value of two of them to deduce the value of the third feature . Suppose we delete the feature of city c now. Let ’s see if there is any way Deduce city c through cities A and B. It looks like this now . If city a is 1 and b is 0, then city c must be 0, because only one of these three features will be 1, so let’s look at the second If city a is 0 and city b is 0 , then city c must be 1, because it is not city a, not city b, it must be city c , and so on . Therefore, we have a way to deduce the third through two of the characteristics. The value of a feature . At this time, we can delete one of the features . Let’s take a simpler example . Suppose you now have a feature called gender . There are only two possible values , either male or male. Girls , obviously, this feature does not have a relationship between high and low, and there is no relationship between size, so we can use one hot encoding to turn it into two features that are divided into boys and girls. But it is either a boy or a girl , so we can derive it from one of the features. In this way, we don’t need to divide it into two features, because too many features will make our calculations more complicated, so we can delete one of the features. Okay, so go back. In the original example, among the three characteristics of cities a, b, and c , we only need to know the values of two of them to derive the third one. In this way, we can delete one of them . Here we choose to The feature of city c is deleted . Let me remind everyone that not all the features that can be derived should be deleted , because even if some features can be derived , they have special meanings or they can speed up our calculations. Efficiency , then we won’t delete it . But in the example of one hot encoding, we can delete one of the features after conversion . Well, then, we will directly implement one hot encoding . Now our data It looks like this, we want to convert the feature of city , from text to numbers, so for this feature , we can use one hot encoding to convert it. Here we can use the preprocessing under sklearn and then Well, let's use the one hot encoder sklearn suite in it. It provides a lot of things we will use when doing machine learning , such as the one hot encoder we introduced now , which can help us quickly convert This one hot encoder is a category , so let’s create it first , let’s create a converter first , or you can say create an encoder first , here I call it the one hot encoder, after creating it, we can first let This encoder allows this converter to read our feature , the feature of the city, so here we can write some fit and let him see the feature of city , but here it is important to note that it only accepts a binary input It is a one-dimensional matrix , so we can’t write it like this because it is a one-dimensional matrix. If we want to make it two-dimensional, we need to add a pair of square brackets and let him use two pairs of square brackets. It is a two-dimensional matrix , let him read all the values of this feature , and then we can transform it. We can use the transform under it to do the transformation . It also passes in the feature we want to transform , and here is the same A two-dimensional matrix is required , so two pairs of square brackets are required . Here I will call the converted result city encoded, and then we will display it directly. After execution , you can see that the converted result looks like this. Why does it look like this? The reason is because of this one After the hot encoder is converted, it is preset that what it will send back to us is a sparse matrix . OK, it is a sparse matrix. It doesn’t matter if you don’t know what a sparse matrix is. What we want to see here is that it will send us back a The complete matrix, we can write toarray later for the complete matrix, so it will return the complete matrix to us and execute it again . You can see that this is the result we want. It will put a total of 3 in a feature The possible cities a, b, and c can be turned into 3 features OK and 3 values . Then we can replace the result of the conversion with the original feature and replace the original feature of the city . We can do this Our original data is data. Let’s execute it once OK. The original data looks like this. If we want to add 3 more features for him , that is, if we add 3 more columns, OK is 3 columns . If we want to add 3 more columns and 3 more features for him , we can write it like this. Write two pairs of square brackets . I want to help him add a cityA, then a cityB, and finally a cityC, and then we Its value can be specified as the result after the conversion we just made , and then we can display it to see. You can see that writing it like this will add 3 more columns and 3 more features cityA BC and then we can put The feature of city is deleted because it has been converted into these 3 features. Then, among these three features, we can delete one of them. I delete cityC , then we can do it like this . We can write data.drop Then write the features we want to delete in the first parameter. The features we want to delete are represented by a list . We want to delete the two features of city and cityC. Then write the second parameter that we want to delete. Where is the axis of the deleted thing? The axis we want to delete is a column. City and cityC are two columns. If the thing we want to delete is a column , then we need to specify the axis axis as 1. If we If the thing you want to delete is a line , then here we need to specify the axis as 0 Okay, so here we write 1, and then we will display the deleted result to have a look. Execute , you can see that the two columns of city and cityC are gone now, so we will put the work place This feature is also processed . After converting the text , we usually divide the data into a training set and a test set when training the model. Training the model is to find the most suitable parameters . In our example , it is to find the most suitable parameters. Appropriate w and b Usually when we train the model, we don’t use all the data , we only use part of it, what to do with the other part , and the other part is for testing , because you think, if we Use all the data for training , and then find out the best set of w and b , but how do we verify the effect of this set of w and b ? Do we want to test it ? If we want to test it We will test it on unfamiliar data. We will not use the training data for testing , because the training data machine has already seen it, which means that he already knows the answer . If he already knows the answer, we If we still use this data for testing, it won’t be accurate, so in order to have unfamiliar data for testing , we usually divide the data into training sets and test sets, like this , assuming we have 10 data . Usually, the training set will account for about 70% to 80%. In this example, we account for 80% , so 8 data will be the training set , and then 20% of the data , that is, 2 data will be the test set. The training set is to take To find the best w and b, the test set is after we find the best w and b and use it for testing to see how it works . After understanding what training sets and test sets are, we will directly implement them. OK Now our data looks like this. First of all , let’s separate x from y . Everyone should remember that the model we want to use is multiple linear regression . Multiple linear regression is written as a mathematical formula , which is y=W1*X1+W2 *X2 is multiplied all the way down to see how many characteristics you have. Finally , add a b . In our current example, there are 4 characteristics , namely seniority, education and job location . The y we want to predict will be salary , so we First separate x and y. Here, x is equal to the four characteristics we need to obtain it . The first is seniority, the second is education , then cityA and cityB , and then y will be equal to our salary. Here I display x and y to see that x is these 4 features , and then y will be the salary. Then we can divide it into a test set and a training set . Here we can also use sklearn, which is very useful. The tool uses the model selection under it , and then we introduce the train test split under it, and then we can use it to help us divide it into a test set and a training set. In the first parameter, we pass in our x, which is the feature, and then the second Enter y in the first parameter, and then we can specify the size of the test set. Here we can write the test size . I want it to be 2. Here, I can make it equal to 0.2 and write it automatically. Take 2 components of our data as the test set and the other 8 components as the training set . If you want the test set to account for 30% , you can write 0.3 and so on . If you write it like this, it will return 4 The values for us are x for training, x for testing , and y for training and y for testing, so I will call the first one x train and the second one x test and then y train and y test will return these 4 values to us , then I will display it for a look. I will display x train, which is the x we used for training. You can see it and take out these data as a training set . Let's take a look at its length . Let's take a look at its length . Its length is 28. Let's take a look at the original length of x. The original length is 36. Let's take 20% as the test set, which is 80% As a training set, let’s take 36 and multiply it by 0.8 to see what it is. It is 28.8 . So, if it is 28.8, it will automatically eliminate the point 8. Take 28 data as a training set . Let’s take a look at the length of the test set. How much can you see in total? x has 36 pieces of data , and after we divide it , 28 pieces are used as training sets and 8 pieces are used as test sets . The same is true for y here, and we display x train Come out and take a look here, and you will find one thing here , that is, the x train seems to change every time we execute it. Let’s take a look at the first data. It is 5.1 0 and then 1 0. We execute it again and it changes to 6.9 2 1 0 Execute it again and change to 7.8 2 0 0 Why did it change? The reason is because of the splitting process. It will split randomly for us by default , so the result of each split will be different. If you want him to split If the result is fixed, here we can also set another parameter called random state, and then you can specify a number for it. Each number will correspond to different segmentation situations . Suppose the number I gave him is 87 , so we can execute it again. Seeing that the result of its division this time looks like this, the first stroke is 4.6 1 1 0. Now, when I fix the random state equal to 87, I execute it again , and you can see that it will not change. But if I change this number to It makes changes. If I change it to 86 and execute it again, it will change again . If I continue to execute it without changing 86, it will not change. So if you want to fix the split result here , you can give He specified a number, here I will specify 87 for him because I want the result of its segmentation to be fixed , so that it will be convenient for me to do a demonstration later . Okay, then I will also display it in the x test to see how long it is like this. 8 pieces of data , then y train will also display it OK, there are these, then y test is done, then we have successfully divided the test set from the training set . Finally, for the convenience of subsequent calculations, I will first put x train and x test convert it into numpy format, now they are both in pandas format , so it will be very beautiful after execution, OK, there will be such a grid , if I convert it into numpy format, it will look like this, I can write after it to numpy , it will become a matrix, and it will be ugly if it becomes such a matrix , but it will make my subsequent calculations more convenient , so here I will convert it first . The x test is the same, and I will also convert it It can make my follow-up calculations more convenient . After converting the text into numbers and dividing them into training sets and test sets, we can return to the model. The model we want to use is multiple linear regression , so it can be written like this , but our features have changed from 3 to 4. The place of work has become two features of cityA and cityB , so we have to rewrite it Now that it has become like this, our goal now is to find a combination of W1 2 3 4 and b, so that the monthly salary predicted here can be closer to the real data . Well, let’s implement this part first . Let me set it up first. The value of w and b , at the beginning, I set w randomly, it has 1 2 3 4, so I use a one-dimensional matrix to represent it . There are 4 values in this one-dimensional matrix , so here I first introduce numpy and call it np Next, I create w and let it be a matrix with 4 values. Suppose I let it be 1 2 3 4. This 1 2 3 4 means our W1 2 3 4 and then we need to set a value of b In the same way , I set it randomly and let it be 0. After setting w and b, we want W1 to multiply the first feature , W2, the second W3, the third W4, and the fourth . Now because We are in the training phase , so the feature we want to multiply is in the x train . I first display the x train , and it looks like this . It has a total of 4 columns and each column represents a feature , so what we want is this 1 is multiplied by the first column, then 2 is multiplied by the second column, 3 is multiplied by the third column, and 4 is multiplied by the fourth column. If we want to multiply like this , we can directly write it like this. We directly multiply x train by w , then it will be It will automatically multiply 1 by the first column, 2 by the second column, and so on. Let’s execute this. This is the result of the multiplication. The first column here is W1 multiplied by X1 , the second column is W2 multiplied by X2, and so on. Then what we want is for them to add W1 multiplied by X1 plus W2 multiplied by X2 and then add 3 plus 4 , that is, we want to add each row here , we want to compare each row If you add it, you can do it like this. First put the calculation result here in brackets , and then use some sum. If I only write this way here, it will sum up every value here . But what we want is every Well, what we want is to sum up in a row. Then I can set the axis equal to 1 and equal to 1 to do the sum in the horizontal direction. If you want to do the sum in a straight line, you can also set it to be equal to 0 Okay, let's try to execute it . This is the result of the sum . Each value here is the sum of each line just now , which is the result of multiplying W1 by X1 plus W2 by X2 plus 3 plus 4. Here After all the additions are done, we need to add a b at the end , so if I add a b here, if I add b here, it will add b to every value in it , but now we set b to 0, it seems to be It doesn’t come out , so I set it to 1 , let’s execute it again, then add it like this, every value here will add 1 , so every value calculated here is our predicted salary, I call it y pred Next, we want to find out the most suitable combination of w and b. To find out the most suitable combination of w and b , we must first define what is called the most suitable , that is, we need to set a standard for judging. That is, we are going to set a cost function. The cost function here can be set the same as the previous simple linear regression . Because we also want the predicted monthly salary to be as close as possible to the real data. So I set it as real. The reason for subtracting the predicted value from the data and then square it is because there may be negative numbers in the subtraction here. For the convenience of calculation, we directly square it so that there will be no negative numbers . So what is our current goal? It is to subtract the predicted value from the real data and then square it . The smaller the value, the better , that is, the smaller the cost here, the better . Then we will directly implement the cost function . Our cost function is set to subtract the real data. Drop the predicted value and then square it. The predicted value has been done in the previous step . The y pred here is the predicted value. Let me display it first . This is our predicted value. What about the real data because we are now In the training phase , the real data is y train , so we use y train to subtract y pred. I also display y train first to see if it is OK. There are a total of these sums. We use y train to subtract y pred , which is After subtracting each sum, we will square it for easy execution. You can see each value here . It is the result of subtracting the predicted salary from the real salary and then square the result . We hope The smaller the value here, the better. If you want the value here to be as small as possible , we can calculate its sum or calculate its average. Here I will calculate its average. I hope its average should be as small as possible . Calculate it You can enclose it in brackets and write “mean” at the end for easy execution. You can see that this is the average result . Here I will directly write the cost calculation process as a function , which can facilitate us to bring in different w and b. I call it compute cost. It also passes in x and y, which are our real data , and then passes in the values of w and b . In it, the value of y pred is calculated first, which is our predicted value , but here is x In train, change it to x to calculate the predicted value, and then we can calculate the cost. The calculation of cost is here, and the words here are the same. In y train, I changed it to y, the real data, subtracted the predicted value, squared it, and then took the average. This is If the cost is good, the last thing is to return the cost and try it directly. Here we will use it directly. The compute cost here is to pass in x train and y train because we are in the training phase now. Then w and b are the same as I pass in first. Enter these two values , first pass in these two values , let’s execute and see, and you can see that the calculation result is the same as above , if I set b here to 0, then set this to 0 2 2 4 Well, execute it again. The cost calculated by the combination of w and b in this way is even higher than the one just now. The one just now is 1,772 . How about this ? Let’s define the evaluation standard, which is the cost function. After setting the evaluation The standard is after the cost function, and then we need to use an efficient way to find a set of w and b, which can make the cost as low as possible. The efficient way is to set an optimizer, which is the optimizer . In our example, we can also use the gradient descent method . You should remember that it changes the parameters according to the slope . However, when we used the simple linear regression before, the parameters were only two w and b . Now our parameters are Becomes 5 W1234 and b Let's take a look at how to update the parameters before. If we want to update w , we need to subtract the slope in the direction of w * learning rate from w. To update b, we need to subtract the slope in the direction of b * learning rate from b . In fact, the parameters are updated. The method is the same, but we have changed from two parameters to five parameters . Now we have five parameters W1234 and b. To update these five parameters, we need to subtract the slope of their direction*learning rate individually . How to calculate the slope in each direction ? In fact, it is the same as before . We only need to differentiate the Cost function to get the slope in each direction . Our Cost function is to subtract the predicted value from the real data and then square it to write it in a mathematical formula. The sub is to subtract y pred from y and then square it. We can spread out the y pred to become W1*X1+W2*X2... multiply to 4 and add b at the end if we want to know W1 The slope of the direction is to differentiate it with respect to W1, or to be precise, to make a partial differential with respect to W1 . If what you want to know is the slope in the direction of W2, then you need to make a partial differential with respect to W2 , and so on. W3, W4, and b are both It’s the same. It’s okay if you don’t know what differential is at all here , because there are many tools that can help us to do differential automatically, so the slope in the direction of W1 looks like this after calculation, and you can see a very long string , but in fact, we can I found that this string of W1*X1+W2*X2 plus 3 plus 4 and then adding b is actually y pred, so I simplified it and it will look like this , it is 2*X1 and then multiply Use the result of y pred-y and then use the same method to calculate the slope in the direction of W2. It will grow like this . We can find that in fact, he just replaced X1 with X2, and everything else is the same . From this we can see that the slope in the direction of W3 is the If the side is replaced with X3, then the slope in the direction of W4 is replaced with X4, and the slope in the b direction is calculated to be longer. Here , there is no need to multiply an x . Here we can see whether it is in the w direction or in the b direction. In the previous place, there is a multiplication by 2. I don’t know if you still remember. In fact, we have said that this multiplication by 2 can be omitted , because when we update the parameters , we will multiply it by another learning Rate, so this is multiplied by 2. In fact, we can leave it to the learning rate to multiply. Here, we can omit it all and become like this , so that we know how to calculate the slope. After we return to the original place , the slopes in all directions are now. We will forget it , but it will be multiplied by a learning rate later . How to set the learning rate? Just like before, it is through testing and experimentation . You can’t set it too large , because it may directly cross the lowest point . Or it is farther and farther away from the lowest point , so you can’t set it too small , because it may reach the lowest point forever. After setting the learning rate and calculating the slope , we only need to keep updating the parameters. Let it approach the lowest point step by step , and then we can find the most suitable combination of w and b we want . Then we will directly implement the gradient descent. First, we will calculate the slope in each direction because the slope in the b direction is relatively simple. So I calculate it first, I call it b gradient , it is y pred minus y, now it is the training phase, so our y is y train, this y pred has been calculated before , so I just copy it , but we Now it is the training stage , so I will change the x here to x train , and I will display the calculated results to see , because our training set has 28 records , so there will be 28 values here , so let’s do it An average , put it in brackets and then use the point mean to calculate the average . The average is -46.94 . After calculating the slope in the b direction, let’s calculate the slope in the w1 direction . I call it w1 gradient, which is y pred minus Drop the y train and then multiply it by x1 . This x1 is actually our first feature . We are in the training phase now, so our feature is in the x train . I will display the x train first. Seeing that it shows a total of four columns , these four columns are our four features . Here, x1 is the first feature, which is the first column here . If we want to get the first column here, we can write it like this. Write a pair of square brackets and then:, 0 means that we want all the values in the first dimension : means all of them, and the comma 0 means the values in the second dimension. We only need the most The first one, we can see that this matrix is a two-dimensional matrix , we want all the values in the first dimension , that is, we want all the values in the first square brackets here . We all need the first value of the second dimension. The second dimension is every row here. We only need the first value , which will be the first column here . We execute it, and we can see that it will take out all the values in the first column , which is our x1, so here I can replace it , replace it , and then we will display the calculation results to see. It will calculate a total of 28 values , because we now have 28 data in the training set , so let’s take its average , enclose it in brackets and then take the average and execute it , so that the slope in the direction of W1 is calculated 295 -295.8 and then the calculation method of W2W3W4 They are all the same , so here I directly use the copied one . Here we change it to 2, 3, 4 , and then we want to get the second column . Here is 1, then 2, 3 , and we will display the results of the execution. Come out and see W1, W2 OK execution, you can see that this is the slope of W1 W234 in their direction , here I use a loop to write it, it will be more convenient , so I recreate a w gradient , which stores the slope of W1234 direction, here I will First create a matrix with all 0s in it , let it have 4 values , so I can display it first, and I will let it be 0 first if there are 4 values in it , but it seems not good to directly write 4 here. The meaning of 4 is How many features do we have? How many ws do we have because we have several features ? If we want to know how many features we have, can we directly get x train? If we want to know how many columns x train has , we can do this here. Let's get its shape and it will tell you that it has a total of 28 rows and 4 columns , that is, it has 28 values in the first dimension and then There are 4 values in the second dimension . The value of the second dimension is what we want . We want to get the value of the second dimension. Here , I can get 4 with square brackets 1. So this It will be better if I change it to this Then I use a loop to ask him to calculate the slope here. He will run it four times in total , so I will paste it here directly , and the i-th slope of w gradient will be the calculation method here. But here I will change it to i , so it should be no problem. I will also display the result after the calculation . You can see the slope W1234 in the four directions is these four values. Well , what about this ? Calculate the slopes of the w and b directions. This is the w direction and this is the b direction . Next, I write the process of calculating the slope as a function so that it can be used later . I call it compute gradient . Pass in x and y, which are our real data , and then there are the values of w and b . Here we first calculate y pred . Then y pred has already been calculated. I will paste it directly to the calculation of y pred. After finishing, I create a blank matrix , create a matrix with all 0s in it , and then change this side to x and then use a for loop to calculate the slope of each w, here also changed to x and here changed to x Then after the calculation here, we still need to calculate the slope in the b direction . The slope in the b direction is also pasted here . It is also changed to y here. After the calculation, we will return the result to w gradient and b OK to execute . Then we Just use it directly and see here . The x to be passed in is now the training phase, so x and y are both x train y train w and b. I pass in the value here , so let’s execute it and see. The result is exactly the same as above, w and b are the same. Let’s make a change. Let’s say I change b to 1 and then change this to 1 2 2 4 and execute it again . Then the calculated slope will be different. It’s easy to calculate. After the slope, then we try to update the parameters. To update w, we need to subtract the slope in the direction of w from w , and then multiply it by a learning rate . It is the same when updating b . Just change this side to b and then the learning rate here. Let me set it casually. Suppose I let it be 0.001 as the initial value of w and b. I also let it be like this. The slope in the direction of w and b. We will directly call this function to do the calculation, and it will send the slope back to us. In the direction of w and b, here I will display the updated results of w and b to see if there is any difference. You can see that it was originally 1 2 2 4 and then it becomes 1.2 more than 2.0 more than 2.0 more than 4.0 and then b There are also changes. Let's see if its cost has really decreased after such an update , that is, has it really gone down? Here, I will directly use the function here to calculate the cost. Calculate it before it is updated , and then I will print out the result . After the update, I will display it again to see if it is really smaller . You can see that it is really smaller. It was originally It is more than 1,800 and then becomes 1,675. After confirming that it has become smaller, we need to repeat the actions here to let him keep updating the values of w and b. Then we write it as a function called gradient descent In fact, we have already done this function in the previous simple linear regression. The writing method is basically the same , so I directly copy it, so we can execute it directly . After execution, we can call it directly. We have called before , so I copied it directly , but there are a few places to change here . The initial values of w and b here, I changed it to this learning rate, I set it to 0.001, so This side can also be written like this, I just let him run it 10,000 times first , well, what about this side , because we have now been divided into a test set and a training set , so here we are now in the training phase, I write x train and y train after The places are basically the same and we don’t need to change it . The names of compute gradient and compute cost are the same. Well, I will execute it and you can see the errors it finds . The reason for the error is because of the error here. We use the numpy matrix format for w and w gradient. Here , we use the numpy matrix format for w and w gradient. If we use the numpy matrix format , there is no way to write directly: .2e to put here Format conversion , so let me delete it first and see if we execute it again , so there should be no problem . You can see that there is no problem. You can see that the cost is decreasing . Then the value of w The value of b and b are both being updated here , but if we don’t format it like this , it will look ugly and uneven . Here are the slopes in the direction of w and the slope in the direction of b because w has W1234, so there are 4 values here , and there are W1234, so there are 4 values , but now it is ugly to execute. If we want to make it more beautiful , we can directly set the format of the numpy matrix that it prints out . By the way, we can set np.set_printoptions to set its formatter. If it is a dictionary , then I want the format of the floating-point number to look like this. The colon is blank. The format of 2e is written at the end . How about setting it like this? It is equivalent to writing in this way, it is equivalent to writing in this way, it will print out every value in this matrix in this format , then let’s execute it again here , and you can see here It has become much more beautiful. Let’s take a look at the place where the cost has been decreasing. There is no problem . After running 10,000 times, we can find that the cost seems to be still decreasing. Here I will test to see if I let Its learning rate is a bit higher , well, it seems that it is still declining , and the decline is obviously faster than before, so there should be no problem with this learning rate . Slowly, it seems that the speed of decline is getting slower and slower. Here , I will first set it like this and then let it It’s good for him to run 10,000 times . Here, everyone can go and play for themselves . Set different initial w, different initial b, different learning rate, and different times of running. What kind of results will there be? Here I will put it at the end. Find out w and b and display it to see. You can see that the w and b it finally found are these 5 values . Here I want to verify whether these 5 values are good or not. We can use it in On the test set, we take the w we finally found and multiply it by x test, which is the test set. This multiplication is W1*X1 W2*X2, and after multiplying 3 and 4, we need to add them together, so I put It is enclosed in brackets to add it. I want it to add in the direction that the axis is equal to 1. At the end, we need to add another b plus the b we finally found. The result calculated here is our prediction on the test set. I call it y pred and we can compare the difference between the prediction on the test set and the real value . Here I display it as a Pandas dataframe The format will be better. It will have a grid . I let him have two columns. The first column is our prediction result y pred on the test set , and the second column is the real data y test on the test set. Let’s see if there is any difference. How many executions? If there is an error here , type an extra u and execute again . It is still possible to execute again if there is wrong data. You can see that the column on the left is the result we predicted on the test set, and the column on the right is the real data on the test set . We predict It is more than 4.0 and the real 43.8 does not seem to be much different. 67.7 72.7 seems to be worse. 61.6 60 OK. The difference feels okay. Well, then we successfully implemented the gradient descent and found the final w and b . Use the final set of w and b on the test set. The result is like this . Finally, if we want to know more accurately whether this kind of prediction is good or not , we can also calculate it besides directly looking at their errors. Because the value of cost is our criterion for judging good or bad, so here we can calculate its cost , then we directly use compute cost. Now we are the test set , so x is x test, y is y test , then w and b are We finally found the w final and b final, let’s calculate and see, its cost is more than 18.1 , then let’s take a look and compare the cost during the previous training . When we are training, the final cost is reduced to 2.52*10. It is more than 25.2, which seems to be no problem , because our cost on the test set is smaller than that on the training set, which means that its performance seems to be better than ours on the training set , so after the cost calculation and the above crossover After the comparison, if the error is acceptable to you , we can apply this model to the real situation. Suppose someone actually came to interview today. He told you that his seniority is 5.3 years, and his degree is above a master’s degree. After the interview in city A , I think I want to admit this person, but I don’t know how much salary I should give him . At this time, we can use the trained model to predict how much salary we can give him . First, we must first Do some processing on the data , because there are texts here, we need to convert it into numbers first , how to convert the data on the test set, and how to convert the data in the real situation , I will directly convert here. Seniority 5.3 is no problem Use it directly. After master’s degree or above, we use label encoding to convert it into 2 and then work. We use one hot encoding to turn it into 3 features , and then delete one of them, so it will become 2 features . So here is city A The third feature is 1, and the fourth feature is 0. Here I use a matrix to represent it. I call it x real , which is equal to np.array. If you have feature scaling here, it is the same in real situations. We also need to do feature scaling. How about feature scaling? Let me look up and see where our feature scaling is done. Here is how our test set is scaled . Then we will do scaling in the real situation. OK, I will directly copy ours . The x test is scaled in this way , so the x real is also processed in the same way . Let’s display the zoomed results to see. He said that there is a problem . Here he said that only two-dimensional arrays are accepted , not accepted. It is not wrong to enter one-dimensional. I forgot to transform. It only accepts two-dimensional matrices and does not accept one-dimensional ones. So here I add a pair of square brackets and it will become two-dimensional . It can be executed again. Seeing that this is the result after zooming, then we can apply it to the model . Is this how our model is calculated? I just copied it here. What we want to apply here is x real, so here Change it , then let’s display the last one as y real, and you can see the prediction of the final model . It tells us that we can give this person a salary of 6.55*10 , which is 65.5k . If a second person comes for an interview today and he tells you that his work experience is 7.2 years and he has a degree below high school, then the place of work is in city B. If we want to predict how much salary we can give this person I ’ll just add this piece of information and write another one here directly . It’s 7.2 and below high school is 0 for us. After the conversion, it’s 0. City b is 0, 1 OK after the conversion. This is the second person We do the same prediction execution, and you can see the prediction given by the model , that is, you can give this person a salary of 23.4k . This is how to apply the model to the real situation . We have already done the gradient descent in the previous step. In fact, In our example, we can accelerate the gradient descent. We only need to use a small technique called feature scaling in English , which can allow us to achieve the effect of acceleration . Well, let’s take a look first. The current example has four features, namely seniority , education and work places cityA and cityB , and then we use multiple linear regression to predict his monthly salary , so it will be written like this : W1*X1+W2*X2 plus 3 plus 4, and finally add X1234 on the side of b. For our four characteristics , we can find that their distributions are different from the values of these four characteristics. For the first characteristic, its distribution range is between 1 and 10 , and the second characteristic is 0 to 2. The third and fourth features are either 0 or 1, so the range of their distribution is like this. We can clearly see that the distribution range of the feature x1 is larger than the other three features , so we take it Going back to the original formula, it will look like this. Our w1 is multiplied by x1 , so it is multiplied by a relatively large value , and W234 is multiplied by a relatively small value. If so , let our gradient descent be slower , that is, make our gradient descent slower. Why ? Because you can see here that W1 is multiplied by a relatively large value , and W234 is multiplied by a relatively small value . In other words, this W1 only needs to be multiplied A slight change will greatly affect the calculated result here , because it is multiplied by a relatively large value, it will greatly affect the calculated result here , and indirectly affect the calculated cost That is to say, as long as there is a slight change in w1, the calculated cost will also change a lot So if we use W1 and W2 to compare their corresponding costs here, it will probably look like this . Here I use a contour map for comparison. The center point is the place with the lowest cost, and the cost is higher as you go to the outer circle . Then the x-axis is W1 y The axis is W2. We can see that this contour map is long and narrow. Why ? Because W1 here is multiplied by a relatively large value , so as long as there is a slight change in W1, it will greatly affect the cost here. We said that the lowest point is the place of the red dot , and the further you go to the outer circle, the greater the cost. Here you can see that as long as there is a small change in W1 , the change in cost will be great . On the contrary , the change in W2 is not so good. Affecting the value of cost Let's take a look at what will happen if we do gradient descent in this case. Suppose our initial point is here , the initial w1 2 is here, we update the gradient descent parameter , it is very likely to happen like this It can be found that it oscillates here and there. Why? Because we said that as long as there is a slight change in w1, it will have a great impact on the cost. When we are doing gradient descent to update W1, it is very likely that it will happen accidentally. The update is overdone, so if you accidentally update it here , you will end up here , and if you accidentally update it , it will cause such a back and forth oscillation, and this will make us reach the lowest point very quickly. Slow , that is , our gradient descent will become very slow , so how do we solve this problem? It is very simple . The reason for this problem is that the size and range of the features are different. Our first feature has a relatively large range and then The others are relatively small , so this problem is caused. If we want to solve it, we can change the scope of the features to the same . I will change the large here to small. Let’s see that the four features are all in the same range. Afterwards, our contour map will look like this . If we are doing gradient descent now, it will be very smooth , and we will go directly to the lowest point . So , we only need to scale the size of each feature to the same range. There are many ways to speed up our gradient descent to do feature scaling. Here I will introduce a very classic and commonly used one called standardization in English. Its method is to subtract our feature from the average of this feature and then divide it by The standard deviation of this feature, in terms of our seniority feature , is to subtract the average of seniority from each data of seniority and then divide it by the standard deviation of seniority. It doesn’t matter if you don’t know what standard deviation is, because there are many tools. It can help us to do calculations automatically , so if we standardize each feature, it will look like this. You can see that their size ranges will be very close . You may have doubts about the three behind us . Aren’t the features inherently small? Why do we still need to do feature scaling? There’s nothing wrong with them ; Let's implement feature scaling directly . If we want to do feature scaling, we do feature scaling after dividing the test set and training set. We divide the test set and training set here, so we do feature scaling after they are divided. The way we want to do feature scaling here is standardization. If we want to use standardization, we can directly import the prepossessing under it in sklearn and then introduce the standard scaler . We can use it like this because it is a category, so I will create it first. It is called scaler. After the creation, we will let him read the feature data of the training set . Note that he can only see the feature data of the training set here , but not the feature data of the test set , because we only have the data of the test set. It can only be used when doing tests. Here we can write some fit and pass it in x train . After passing it in , it will calculate the average, standard deviation, etc. of the features in it. Well , after it is calculated , we can Let’s convert it. To do the conversion, we can write some transform and pass it in the same way as x train , so that it will do the conversion . Let’s replace the original result with the result of the conversion. Well , let’s display it to see . This is The result after conversion , here I also convert the x test by the way , here we just do the conversion first and don’t use it , so that it will be more convenient when we do the test later, and here we just replace the converted result directly Pay special attention here . We will not calculate its standard deviation or average based on the data of the test set, and then do the conversion. We will directly use the result of the training set fitting , which is calculated by the training set here. The resulting standard deviation and average will be directly used on the test set for conversion , so I will display it directly to have a look. This is the result of the conversion of the test set , and then we will compare it directly . After feature scaling, we will do it again. What is the difference between gradient descent and gradient descent without feature scaling ? The gradient descent we made before is here . This is the result. I will directly copy a copy and execute it again here. All the parameters here are the same . The only difference? It is because our x train has undergone feature scaling , so I will execute it again under the same conditions to see if it is really faster. You can see that its cost will reach 2.5* when it is updated about 1,000 times 10 times, and then the following 2,000 to 3,000 updates , basically the cost has not moved, so we can guess that it has almost reached the lowest point. Let ’s compare the previous results . It has been updated and updated to about 4,000 times. It has only reached the first power of 2.5*10 , so we can clearly see that after feature scaling, we are indeed doing gradient descent, and its descending speed is relatively fast . Next, let me introduce to you that we are very good at classification problems. A model used is called logistic regression. English is logistic regression . Let’s first understand the problem we want to solve today . Suppose we know whether a person has diabetes or not . It may be related to his age, weight, blood sugar and gender, so we will collect it. After collecting some information, each piece of information here represents a person ’s age, weight, blood sugar and gender, and whether he has diabetes. If 1 means he has diabetes, 0 means he has no diabetes . So we just want to base on these 4 Features To predict whether this person has diabetes , there are only two possibilities for the output value of whether he has diabetes, either 1 or 0 , that is, he has diabetes or he does not have diabetes , which is much different from our previous two examples. In our first two examples, when doing simple linear regression and multiple linear regression, The prediction is the monthly salary, so the output of the monthly salary may be infinitely diverse . Unlike here, there may only be two types of output. Either you have diabetes or you don’t . The possibility of this output is limited. We will say that it is a classification problem. It’s a little vaccination on the classification side. These materials are randomly generated by me , so if you have a medical background, don’t take it too seriously, and you may also say , isn’t it true that if your blood sugar exceeds a certain level, you will have diabetes? Let’s treat it like this first. I don’t know , so let’s take a look at it with a graph . This is an example of linear regression before. We want to use seniority to predict salary . In fact, the characteristics can not only be seniority, but also work place, education, etc. Here It is for the convenience of presentation, so I only show the seniority . We said that we can use a line to represent these data . Let’s go back to our example today. We want to use a person’s blood sugar to predict whether he has diabetes . The characteristics of this side can also be Many of them can have the weight, age, gender, etc. just mentioned . Here, I will only draw a feature of blood sugar for the convenience of the graph. We can see that the distribution of the data here is much different from that on the left. Because here Its output value has only two possibilities , either 1 or 0 , that is, with diabetes or without diabetes , so it will look like this . At this time, you still think that we need to use a line to represent these data, which seems wrong, right ? What should we do? In fact, we don’t need to make too many adjustments . Just bend the line a little bit to look like this. Does it seem like it can represent these data? The value of this curved line can only be between 0 and 1 . It is very consistent with our data. Because our data is either 1 or 0 , the question now becomes how do we bend it. It is very simple . We only need to use a function called sigmoid function in Chinese. It can be called The s-type function can achieve the bending effect . Taking our previous linear regression example, our mathematical formula can be written as y=w*x+b or if you have many features, we can write y=W1*X1+W2* Add X2 all the way to see how many features you have. If we want to bend such a linear model , we just need to bring it into the sigmoid function. The sigmoid function looks like this , it is a power of 1/1+e It sounds like a tongue twister, so we just need to take this linear model and add a minus sign to bring it to the power to make a bend. Here you may have a question , what is the e in the English letter? What does it mean? This e is just a constant representing 2.7182818.... Just like our mathematical pi , it is just a constant . The position of the square and adding a minus sign can achieve the bending effect . Before bringing it in, I will first change the y here and change it to z. OK, change it to z. Then after bringing it in Our new predicted value y will become like this . Bring this string into it to see how many features you have. Here is how many times you multiply . This is our logistic regression model . Don’t look at him like this and feel very Scary. In fact, there are many tools that can help us do calculations automatically . Back to the problem we want to solve. Today we want to use age, weight, blood sugar and gender to predict whether we have diabetes. So our characteristics have these 4 characteristics That is to say, we want to find the best combination of W1234 and b , bring it into the Sigmoid function here and bend it , which can best represent our data , and then we will directly implement it. First of all, we are the same First read the data in, the action of reading is the same as before, so I just copied it directly , but the URL of the data here is different, so I changed it , and the URL is the same, you can find it in the course file OK , let's execute it, and we can see that this is our data. There are 0-399 in total , so there are 400 data . After reading in, we first process the data. There is text here . We need to convert the text into Here I convert boys to 0 and girls to 1 , so let’s get its gender, which is the characteristic of gender , and then I’ll convert it . We can use map to convert it , so boys correspond to 1 and girls correspond When it reaches 0 , let’s display it after converting it. You can see that the conversion is complete . Then we divide the data into training set and test set. The division method is the same as before, so I just use it directly. The copy is OK here, just copy it here, but the x and y here need to be changed. Our characteristic is age and then weight... the part of y is whether there is diabetes or not. After this is done, I will display it to see Look , display the x train and x test to see OK. This is the result of the division . Then we also display the y part to see the y train and y test. OK, there is no problem. Here we also use 0.2 for the test set, which is 2. Well, we have a total of 400 records, so the 2 results will be 80 records for the test set and 320 records for the training set. Then because we have 4 features , their distribution ranges are different . If we want to let him in the future If the gradient descent runs faster, then we need to do feature scaling . The feature scaling part is the same as before, so I will directly use the copied OK feature scaling here and copy it directly . Here, we will also use the scaling results. Display it, and you can see that this is the result after zooming . I also display the x test. OK, there is no problem. After the data processing is completed , we can bring it into the model. First, I set the values of w and b casually. Then here, I first introduce numpy into it. If w, I will make it an array with 4 values in it , because we have 4 features . OK, there are 4 features. W1234, I will let it be 1234 and then set another If b b, I will let it be 1 , then our model is like this, first let w multiply by x , we are now in the training stage, so I multiply by x train to display the result , this is the result after the multiplication , after the multiplication We want to add it, it is the same as before , so we use sum and let it be the part of the axis 1 to add and then execute , well, this is the result of the addition , after the addition, we need to add b to it Then this string will be the y we predicted in the multiple linear regression before , and now we are going to use logistic regression , so I call this z , and then we want to bring this z into the sigmoid function I will first do the sigmoid function, I call it sigmoid, and then it passes in a z calculation method is to divide by 1 and add by 1 A certain power of an English letter e , so you can write np.exp on the side of the English letter e , and how many powers does he want? It is written in the parentheses , and what we want is the -z power , so just put it It returns the denominator here , and the denominator needs to be enclosed in parentheses . Then I will import the numpy and write it here OK to execute. Then we will use it directly . Let’s take this z in and have a look . After conversion, you can see it. After this simoid function , each value in it will be between 0 and 1. So now we have successfully transformed this linear model . This is originally a multiple linear regression. We have bent this linear model. The next thing is the same, we just want to find the best combination of w and b so that it can best represent our data