Create GPT From Scratch - Neural Network | Deep Learning | Transformer #tutorial #chatgpt #pytorch

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi recently chat GPT has taken the World by storm in this video I'll present a walkthrough tutorial on building a gpt-like model by the end we will generate some poems about cats we'll also discuss some new Concepts Beyond GPT I believe everyone should learn about neural networks because it has relation to many fields I am interested as a neuroscientist because it gives me more insight into how Ai and the Brain can Inspire each other I'll try my best to avoid this and provide a gradual transition between the concepts I assume zero knowledge in machine learning if you know some programming you can follow along if not you can still learn from the illustrations and the analogies let's start open your python interpreter if you don't have one I suggest downloading Anaconda after installing open Anaconda prompt and type spider the IDE opens on the left side type internal World Press the green button above to run it if you see Hello World on the right side we are ready to cut you can also select each line right click and run the selection intelligence is about predicting the outcomes let's take this simple example your brain tries to associate the switches with the lights after observing the events one way to model that is using the good old-fashioned AI we Define a button and then model the conditional events using IF else statements but this can quickly get Messy as the number of the conditions increases also there is no obvious way to make this model learn from the data another approach is perceptron for Simplicity let's do one light for now change off to 0 and onto one here we use the weighted sum of the events then threshold the result and use it to predict the outputs but different buttons have different relations with the lights we can add the relations to our model 2. all we need now to predict the output is to model the relations correctly now let's simplify the variable names X's are our inputs W's are weights or relations and the result is y this can still get tedious if we add more inputs numpy can help so import numpy as MP now you can put all the inputs into XS array put all the weights into WS array now you can simply type y equal access dot W's the add sign here is the dot product which calculates the weighted sum of all the inputs just like below but in a much more compact way now we simplify this so remove it in real life the relations are hidden from us comment the weights out we can only observe the inputs and their outcomes to predict the outcome we need to figure out the relations this is what learning is about in machine learning but to figure out the weights we need more than one even to observe let's observe a few more inputs and outputs now that's enough to figure out the relations between these inputs and outputs if we figure out the weights I mean the relations we can predict the output for new inputs like this input or this one delete this we have now an optimization problem to solve the number of inputs is 5 the number of out outputs is one to find the correct weights we need to search for the right combination you can think of the solution as a multi-dimensional point searching here is not different from geographical searching in which your search space is only two Dimensions one approach is Crackdown search let's Define the initial weight as a random set of numbers W's equal to weights with the size of the inputs by outputs let's print our random guess the correct weights are hidden so we don't know if our guess is correct or not we need a feedback we can get a global feedback by checking the predictions from our guess and then comparing the output YH with the expected output Ys the sum of the absolute errors can be used as the global feedback if it reached the threshold below 0.05 then we found the solution else we make another random guess let's try random guessing for 5 thousand iterations let's record the outcomes in this array to plot the results we need to type this and import matplotlib Library like this after running the simulation you can see we haven't found the solution and then of our 5000 iterations came close to our threshold of 0.05 the minimum error is 0.8 this Brute Force search becomes worse with more dimensions and more variables so what can we do to find the needle in the haystack a simple approach comes from Evolution here we record the current error the father makes a child the child's location is slightly and randomly mutated the child's error is assessed if the child's error is more than the current error then the child is removed the father makes another child we repeat the process if the child's error is less than the current error now the father can rest in peace the child grows to become a father the error is updated if you repeat that one day you reach the goal let's implement this Define mutation as a small random set of Weights copy the original weights and rename them as CW for child weights rename e to CE for child error compare the child error with the original error and select the better weights for the next iteration let's run the error is reducing but stuck right above our threshold let's reduce the mutation amount now it found the solution let's break the loop the last error is now below 0.05 and the solution YH is close to the true solution Ys the guest weights are also close to the true weights which were hidden from us now let's make an arbitrary output where we don't know if we can even find the solution with a linear regression let's rerun no solution is found and the last error is high let's add a bias term because our data are not centered on zero bias theorems are additional ones added at the end of the inputs and they are necessary to model the shifts in the data the error is still above the threshold the reason is our simple Network can only find linear relationships between the inputs and the outputs but if the relation is non-linear then we can't find a solution because we cannot fit a line to a curve no matter what slope or shift we try to solve this we need to add two things first another layer of Weights connected to some middle nodes second apply an non-linear activation function to the middle neurons you can use any activation function from this list but I prefer sine waves we know from Fourier transform that we can approximate any signal by adding together different sine waves let's Implement that at wi which takes the inputs to the nodes now WS are getting their input from the nodes Define 15 notes dot product the XS with the wi then apply the sine wave to generate non-linearity let's add that to the child network too but we need to change this to nodes let's rerun or we get an error because we accidentally changed the initial input axis let's rename all the intermediary results to X and only keep the initial inputs as X's rerun yes we found the solution again and the error is below the threshold let's reduce the number of notes to 5 and rerun now no solution is found it looks like we need more than five nodes to solve this problem within 5000 iterations this is a simple problem and it can be solved with fewer parameters if you have a better search for a more optimal solution we can use parallel Computing to find better Solutions faster for example the father can make several children at once and we select the best ones to breed and they too can make more children in parallel this way we can reach the solution faster however this requires sacrificing a lot of children and that's horrible and immoral Shameless DNA has been doing this for billion years now we live in the age of brain individual lives matter so we need to find a better way first take advantage of the error magnitude which is known to us second and even though it's dark and you don't know the direction towards the global Maxima but you can follow the steepest path from your immediate surroundings use derivatives to infer the direction of the steepest slope from calculus we know that derivative of x is one the derivative of x square is 2x the derivative of sine is cosine and so on apply the derivative to the error and update the weights by the error some nodes have higher activities and contribute more to the error so it makes sense to scale the update amount for each weight by X when important note the magnitude of the arrow doesn't tell how far we are from the solution but it tells how steep your direction is that's why you made overshoot the solution and the error plot will look noisy therefore you need to scale down the gradients by another Factor it will take longer to reach the solution but at least it won't overshoot the target let's Implement that update the weights like this this the derivative of the output is just one because we haven't applied non-linearity to the last output okay now we know to learn from our mistakes we don't need to sacrifice children so delete this part rerun the code we found a solution in just 300 iterations let's remove the break and run for the whole 5000 iteration the error rate is almost reaching 0 after 5000 iteration cool let's see if we can find the solution with five notes yes the error is below 0.05 let's see if we can find the solution with two nodes unfortunately we can't the error stuck around 6 because you follow the steepest Direction in the immediate surroundings you may get stuck on a nearby local maximum that's why more nodes can help more knots means more Dimensions if you follow the local evidence in more Dimensions you may find a solution eventually but we can do better so far we only fine tune the outer layer if we fine-tune the inner layer 2 we may capture the hierarchical structures and be able to model with fewer parameters for example you need at least 15 parameters to Model 5 triple letters however if you add hierarchy you can share the common pair with all the other letters this can reduce the number of the parameters dramatically especially when modeling deeply nested structures fine tuning the inner layer is similar to that of the outer layer except now we need to propagate the errors backward rename the weights with indices to indicate the layer number assign x0 to the initial input and rename all the intermediary variables appropriately to back propagate the error just like the forward pass you do the dot product between the error and the weights and multiply by the cosine of the previous output because cosine is the derivative of sine rename these scaling factors to LR for the learning rate so that we can tune it as a hyper parameter now re-wrap the code it found the solution even with two notes and that's awesome let's rearrange the code like this if we add more layers it will become obvious what's happening whenever you add new layers the input of the previous layer will become the output of the next layer adding a new layer is just as simple as copy and pasting with proper indexing of the variables you can see a clear pattern starting with the input you do the weighted sum of the activations using the adult product followed by any linear activation like the sine function repeat the process until the last layer then compare the output YH with the target Ys the mismatch is the error that's propagated backward in a similar way you do the weighted sum of the error and then multiply it by the cosine of x finally use the forward activations and the backward errors to update the weights in each layer there are four main parts the part we redefine the structure of your neural net the forward pass where the inputs are propagated through the network the backward pass where the errors are back propagated and the update part where the weights are updated to minimize the error so the entire code can fit into a page let's rerun as you can see the error is almost approaching zero let's try other data sets let's try xor problem xor has two inputs so reduce the ins to two it solves the Excel problem perfectly let's try fitting some other functions Define X's as a small range of even numbers from -10 to plus 10. Define the output as an arbitrarily linear function of x's change the input size to 1. Let's increase the number of the notes and reduce the learning rate rerun the code the error plot is noisy let's reduce the learning rate more looks like a great fit let's try another function it fits that too it looks like we can fit any linear function now let's try an non-linear function okay now we are back to nandfitting and noisy Euro plots we need to reduce the learning rate further that looks better let's increase the learning rate slightly we still under fitting more notes may be necessary let's use 100 nodes again noisy error plot we need to reduce the learning rate we may need to increase the number of the iterations finally a good fit okay now it's obvious we can fit any data if we tweak the number of the nodes the learning rate and the iterations even though a multi-layer neural network is a very powerful modeling tool however in its naive form it comes with many side effects which needs to be addressed before you can use in any real world application first trying to solve simple linear problems using multi-layered neural networks is like trying to kill a fly with a tank it may do the job but it's very messy and inefficient second back propagation is not bulletproof especially for deep networks imagine several people collaborating to score a goal but they make errors you need to blame the errors on each player correctly starting from the last player back to the first one two problems made occur here either we blame too much error on the early players and by the time you reach the first layer you have nothing to blame so the early players have nothing to learn this is called The Vanishing gradient problem or the opposite you are observative in blaming the errors and by the time you reach the early layers of the remaining errors are blamed on them this is called exploding regions where the gradients become uncontrollably large mathematically speaking the two problems happen when repeatedly multiplying values smaller than one or higher than one perfectly balancing this is quite challenging and you can't achieve it without using some extra tricks and tools here is some advice if you have a simple linear relationship then linear regression with a simple perceptron is more than enough if you have an non-linear but a shallow problem then a simple neural network with one hidden layer is probably enough for the job if you have complex nested structures then we need a deep neural network but we also need some extra tricks to make it work talking about the extra tricks sometimes you have recurrent and sideway connections between the layers as well as various mathematical operations you need to be an expert in math and programming to properly propagate the error in such networks even then you cannot avoid countless hidden bugs therefore do not reinvent every single wheel now you understand the basic principles it's a good time to add a deep learning tool for our next task close your spider IDE go to the pytorch website here choose conda if you have a good GPU choose this otherwise choose CPU and copy this line open Anaconda prompt again and paste the line here once all the installations are complete reopen The Prompt and restart spider IDE now you can import Torch from torch.in import functional as if to be compatible with pi torch do the following changes change every MP to torch set WS to W's required graph to true to compute the error set loss equal to mean squared clause which takes YH and y s as arguments initialize the optimizer with zero gradients Define the optimizer above as follows it takes a list of the weights and the learning rate as an argument here type loss dot backward now you don't need this part because lows.back code takes care of that type Optimizer dot step now you don't need the update part because the optimizer takes care of that one more thing is you need to change the numpy arrays to tensors so that they are compatible with pi torch let's print the error every 500 iterations now rewrap the code reduce the learning rate manually changing the learning rate is very tedious it will be awesome if you have an adaptable learning rate fortunately there is change the optimizer type to atom and increase the learning rate to 0.003 now the error plot is smooth but slightly under fits let's increase the notes to 200. we have a great fit we even didn't need to change the learning rate after increasing the number of the nodes that's because the optimizer now takes care of the learning rate to some degree it adaptively changes the step size for example up to this point the optimizer makes small steps because the past steps change its direction frequently therefore the optimizer was not confident but here the optimizer makes large steps because its past tips are in the same direction so the optimizer is more confident to make larger lips amazingly we can do that in the dark just by relying on the global score and our past tips in programming it's a good practice to encapsulate the commonly used parts into their classes Define class model move the weights into the class and add self keyword so that they become part of the class Define a forward function and move the forward pass to that function we can simplify the variable names because now the optimizer takes care of all the trackings Define the module here change the list of weights to params and Define it as an empty list above then add the weights to the list once they are generated now we can use the model here it takes the inputs and it spits out the output everything looks neat now this part defines the structure of our Network and this part takes care of the training one important note before we generate cat poetry let's re-examine how neural networks learn a function like the square rule you can see it fits the training data very well with error rates approaching zero now let's test it on unseen data to test the model Define a value and change it to a tensor pass the value to the model and print the results you also need to append one for the bias let's try for we get a value close to 16 which is correct now let's try minus 5 which is a new value not seen by our model the result is minus 3.7 which is not even close so what's happening here didn't we fit the training data perfectly and get Euro rates close to zero well we only did fit even numbers in our small training set you know these oversized networks have a lot of degrees of freedom they can fit the training data however they want one way to reduce the crazy fit is to regularize the network a very simple solution here is just reducing the initial weights by multiplying them by 0.1 now let's rerun again a good fit for the training data but let's see how it generalized to unseen data now let's try five we get 24.5 not perfect but much better than before when the initial weights were large the output was complex and it was possible for backprop to find crazy Solutions just by fine tuning one or few impactful weights however with smaller weights no weight alone has a big impact and all the weights have to be fine-tuned and collaborate to fit the data another Improvement can be achieved by changing the sine function to relu relu fits the curve with small straight lines and that makes it behave less erratically so now with -5 we get 25.1 let's try 8 it's a perfect 64. this is expected because 8 is already in the training set for nine we get 82.4 still not perfect but we are getting better interpolation here is the real challenge let's try 100 which is very far from our training set unfortunately the result is also very far from the Target in light of these results let's revise our conclusion neural networks can fit any data provided that we have enough notes in the hidden layer it can interpolate fairly well there is a condition here it can't extrapolate far beyond the data distribution to be honest humans are also bad in extrapolation to make your neural network good with interpolation do this add more data and train the network for longer times finally add more constraints like regularizers or electrics honestly there is no hard rule here sometimes you have a great idea but it won't work and sometimes you have an insane idea but it works in short you have to mess around a lot to find out many useful applications can be built with neural networks for instance image detectors medical applications and translators it can be used to model mapping between any inputs and outputs another useful application is auto regression here you divide the screenshot data into past and future once you train the network on the past context to predict the future you end up having a generative AI you can do that for text audio image or any other sequential data let's take language as an example you can train a network to predict the next letter in a sentence the predicted letter will become part of the next context this way you can generate sentences let's Implement that open a text file using this python code or you can copy and paste your text between these triple quotes I'm going to use this simple poem about cuts for the training the text is just over over 3 000 liters which is considered very small but let's keep it simple for educational purposes first change to lowercase to reduce the vocabulary size get the set of letters from the text the neural net only understand numbers therefore we need to assign a number to each letter and record that in a dictionary use the dictionary to map all the letters to their corresponding number now we have successfully converted the text to numbers the vocabulary size is 48 unique letters let's move these settings here let's set the input size to 5 and keep the output as 1. this means we use 5 letters as input and the letter after that will be used as the output convert the data to tensorflows X's are the stack of letters from I up to the input size of the next letters y's are the stack of letters that comes after the last input letter let's check X's unwise so all we did so far is to print is to prepare the text data for training we feed these arrays to our model and we train the model to predict the corresponding outputs delete the bias part because we will Center our data later on now remove plus one here because we no longer have the bias term now let's run the entire script this will take some time you can see the error plot is reducing let's examine the fit we are using a context of five letters so the fit is expected not to be perfect let's increase the context size to 16 letters this will take ages to finish it's a common practice to divide your data set into smaller patches so let's do that sample hundred trials randomly from the data and repeat for each iteration therefore move this part to the beginning of the chaining Loop rerun the code let's zoom in to see how well the model fits the data we see negative values in the output that's because the network produces a floating value for the prediction this can be converted back to a letter we need to tell our Network that we are dealing with categories in our output to do so change the Y's here to a long format change this to cross entropy which is suitable for categorical classifications we need to change the output size to vocab size make these necessary changes in the code finally set the output size to vocab size so basically we do the forward and backward passes just like we explained before to plot correctly we select the maximum probability and use that as our predicted value let's fix that typo here and rerun we have a better fit now we can now easily convert the predicted indices back to letters let's do that set is to be the initial context feed is to or model the model produces a vector of probabilities in this case the highest probability is for index number 15 which I think corresponds to the letter N here is an interesting trick if we multiply YH by a factor less than one the higher probability outputs are suppressed more this gives more chances for lower probability letters to be selected we will come back to this useful trick later for now let's select the maximum probability let's repeat that three thousand times each time we predict the next letter we will roll the context s by one step and replace the last letter with the predicted one Define gen text as any as any pity string and append the predicted letters to it we also need to convert the letter indices back to alphabets so you will Define another dictionary i2s to reverse the indices back to the alphabets let's rerun the code finally we print the generated text now you can see it spits out the memorized text now instead of selecting the maximum probability sample the letters according to their probabilities we see newly generated words and phrases however the overall structure doesn't look like a poem that's because we use a small context of 16 letters which is not enough to model the entire sentences let's increase the context size to 64. Let's rerun the model spits out a memorized text and by the time it reaches the phrase always free which is the end of all poem it continues with meaningless phrases as I said I do not expect a high quality generation from such a small data set however I do expect that a well-trained model should be able to produce new phrases that looks like variations of the original text the current network is very wasteful for example to learn the word cat at least one of the nodes needs to activate whenever it encounters that word the weights need to be adjusted properly so that the note can detect the word cat the problem is this note is position dependent and it will not activate for any other cat words in other places hence other notes need to be recruited to learn about cats in other places in the sentence to solve that convolution can help the relation among the letters are localized we can use a filter that is just the adjacent weights and move it across all context once the filter learns to recognize a pattern such as the word cat its position invariant and can recognize the word anywhere in the context different filters can be combined in the higher layers to produce phrases for instance this node can learn to detect the phrase cat's fund however Fun Cards can't be detected so we need to recruit another node for that generally we want a network that's invariant to position rotation and scaling the more invariant our Network the more generalizable it becomes let's imagine a simple filter that takes one input at a time and moves across the entire context if we sum together the outputs from this filter we will get a result that's invariant to both position and permutation however the capacity of this filter will be too small to learn and memorize any pattern to solve that we can represent each letter with a unique Vector instead of representing it with a single value now if we sum the embeddings for each letter we will get a contextualized vector that's invariant to the position and permutation of the letters so putting this all together you embed each letter into its corresponding Vector then pass the embedding through a simple linear Network and record the output then move the filter and do the same for the next letter in the context then sum all the outputs and pass the result through an non-linear neural network the output of that Network will be used to predict the next letter hopefully now this network can learn to recognize patterns regardless of their position let's Implement that set the embedding size to 64. and let's associate each vocabulary with a random Vector X now passes through embedding therefore we need to change the ins to an embed here change the data to Long format because now we are using the data as indices for our embedding table now we use the fetched embedding and Dot product with the weights of our filter Define the filter weights as WV which takes a vector of size of n in bit and Alps a vector of the same size then we sum the output of the convolution across the input Dimension before feeding it to the non-linear layers of our neural network let's rerun the output is garbage because it spits out letters without caring about their position we need to add the position information too you can be deliver about it but a simple way is to Mark each position in our context with a random Vector then multiply it with the with the embedding okay now we get better results let's put the filter into its class called the class hit and move this part to the class just like before now you can easily reuse the class here by defining self dot hits and pass the input x to the forward path of the head you can easily make more filter hits let's make an array of four hits then we need to concatenate the output from all four hits and this will increase the embedding size by 4. to keep the output from expanding we need to divide it here by the number of the hits the reason for having more filter hits is to increase the capacity of our Network so that it can detect various words on word connections you may say foreheads will not be enough to hold on all the possible patterns well the trick is we are using distributed representation it was only for illustration purposes we represented each word with one note despite learning the training data well our model is still back at generalization one important reason is that we are using the longest context to predict the next letter predictions rely on the context length to if the model has to predict what comes after the letter is here then any of these letters are reasonable however if the past context was the word cat then a more reasonable prediction is the space bar to make our Network aware of various context lengths we need to pass the inputs with various context links along with their corresponding outputs surprisingly this is easy to do change this part from I plus ins to I plus 1. now instead of summing the longest context we need to sum the inputs of the smaller context too and feed them separately to our non-linear Network so now the network clears what comes after each context length one easy trick to do that efficiently is by doing the dot product between the input and a matrix of the same size filled with ones except the upper light masked with zeros now you can Implement that like this make a matrix of ones with input size then drill the upper right half a good practice is to change the Matrix values to probabilities by applying soft Max function to it however before applying softmax you need to mask the zeros with large negative values so that they turn back to zeros after self-maxing finally you don't protect the resulting Matrix with the input X now after making our Network aware of various context lengths we are still going to use the longest context for our prediction so here at negative 1 to fetch the last context we still cannot model long-term dependencies it's easy to model what comes right after T but it's very difficult to model what comes after thin steps because it depends on the context learning to form a proper context from the inputs is a challenging task since we are summing the inputs directly back propagation method will share the error equally over all the inputs but that's not good for learning inputs need to be penalized appropriately one of the old Solutions come from recurrent neural network in this case we use another set of weights to share the inputs from the past to the prison in a recurrent way these weights learn to keep important letters in their internal state to train this network we need to back propagate through time but remember we talked about the vanishing gradient problem well in this case it's even worse because we need to back propagate through many many steps an interesting solution to Vanishing gradient comes from lstms the basic idea behind the lstms is to get the inputs with a separate Network that learns to re-weigh the inputs according to their significance the idea is we can reduce the number of steps by attenuating the non-significant parts hence we can pay attention to more important parts talking about attention if every player pays attention to other players then they can learn to collaborate better and minimize their error the basic idea is we need to transform this Matrix to attention magic where the values indicates how much attention each input needs to pay to its neighbors then we re-weight the inputs accordingly this network structure was the state of the art for language modeling up until 2017. however it was computationally intensive and it was very hard to train in parallel because you cannot compute this part until you are done with the earlier step because its inputs rely on the output from the previous step one group paid more attention to the attention part they removed all the recurrent parts and they left only the attention part and it worked this was important step because now we can train this in parallel and take advantage of GPU to scale these networks up to 100 billion parameters okay now let's implement the attention part because that's all we need first we need to produce the attention magic we already had the Matrix of ones to produce the attention Matrix we use the dot product between the input vectors with itself and use the similarity scores as our attention Matrix let's Implement that remove the part that produces the Matrix of ones do the dot product between the input vectors with themselves but transpose it across the input Dimension then normalize it now call that attention and pass it like this generally this x called Q for query and this one called K for the key so rename them accordingly to have a finer control over the attention scores we need to pass the query and the key inputs through some learnable weights let's define these weights like this for consistency with the literature let's call this x v for value now the self attention is complete one final important thing is that you need to change the multiplication to addition here for position embedding to work with attention properly let's rerun as you can see now we get better results because now instead of adding the input vectors together in a dermi way we have the attention scores which are used to re-weight the inputs appropriately before adding them together now you can see better phrases and better looking structure this is just one block of attention you can generally stack these blocks on top of each other to form a multi-layered Transformer let's move the attention part to class block now we can reuse the class let's define an array of three attention blocks and let's tag them on top of each other as we have mentioned before learning becomes difficult with more layers because because of the vanish ingredients another easy trick to mitigate this use residual connections think of the residual connections as convolutions across the layers just like convolutions Shield weight across the input Dimension the residual connections share input across layers allowing the learning patterns from the lower layers to be reused in the upper layers to implement that just re-add the input X across all the blocks okay we get an error because we need to change the output of each block to embed size so that they match together when they are stacked on top of each other let's fix this typo here and rerun this will take a while so grab a coffee okay now we get a poem on cats that looks like the original poem overfitting happens here because we have a very small Corpus and a relatively large Network so all not off of the phrases are just regurgitated but there are some new made-up phrases too so we transformed this rigid neural network that only learns from one context and is not invariant to position or permutation into a more flexible Network that learns from context of different lengths and is invariant to position and permutation this is not a standard implementation of GPT because we are still missing some details for a more standard implementation referred to carpathy's awesome video however I would like to continue this with some alternative ideas and direction I have been thinking a lot about where is the magic in self-attention mechanism because the end result is just a matrix of values ranging from zeros to ones which we use them to re-weigh the inputs for each layer so I try to reproduce very weight Matrix directly without the self-attention mechanism using learnable lateral Connection in easy since we are modulating each input by lateral connections from other imp you can think of this method as lstm on steroids because now we are getting all the past inputs using separate Gates let's Implement that delete the self attention part we are going to Define lateral connections among all inputs and call them WR now we use this to generate the attention Matrix which now we call a re-weight matrix and let's pass that instead of attention and let's rerun the code in my experience this method learns very well but it overfits the data if we have the same number of parameters as that self-attention mechanism so this method May benefit from more regularization or we can simply make the model sizes smaller let's reduce the temperature by multiplying YH with 0.8 as we said before this will allow us to generate more inventive text here it puts e at the end of storm trying with the other centers basically reinterpolating as we have discussed if you reduce the temperature further you may get more inventive phrases however if your data set is too small or if your model didn't generalize well you will get more errors and nonsense phrases by the way if you have a good GPU you can run this faster buying making these slide changes type device equal to CPU right after this line send the Ws to device you also need to send all the other tensors to device finally send excess unwise to device 2. to run on GPU set device to CODA I have already tested a bigger model on a collab instance with more layers and with a temperature of 0.7 so here you can see that even in the original text the term chasing was only used twice but our model learned to use this term in various other contexts like chasing Grays chasing milk chasing toys chasing in the sun chasing strings chasing stale and so on so the idea here is if you train this model on a large Corpus of poetry about cats and if you train the same model on a large purpose of texts about dogs then we can generate a poem on dogs that's new and doesn't exist in the training set because the model can now interpolate and integrate what it has learned about dogs into a poem I believe we can simplify the current state of the art neural networks much more in the end everything is translated to dot products and floating Point operations however we know in principle that all the arithmetic operations are translated back to simple logical operations on binary input so in theory if we cut the middleman we may find a lot of redundant operations and simplify AI to few lines of logical operations on binary on symbolic inputs so one day we may go back a full circle this is important because we know that neurons in our brain don't do floating Point operations instead they communicate with binary spikes and if we identify intelligence in its basic Essence then we will have a better chance to understand intelligence and the human brain the objective of any intelligent system should be aligned with human interest but human interests are not aligned with each other some people say they will follow truth even if it hurts their feeling but what's truth anyway if the base reality is subjective then I don't know what truth is if the base reality is objective then actually I still don't know what truth is but I can entertain some thoughts so basically our brain tries to predict the future now that's a bit tricky because we are not just observers but let's talk from the eye of an observer the truth is the path that will actually happen therefore intelligence is the ability to see the path more clearly from the noise so intelligence is the ability to predict correctly which is the ability to compress losslessly this has even inspired competition for cooperation algorithms and this resonates well with okam's principle which is a simple guidance for finding more truthful hypotheses let's say this graph is all the knowledge that we can attain the part inside the circle is already discovered by us for finding new notes inside the circle is interpolation finding new notes outside the circle is extrapolation ideally we want to discover new emergent nodes with minimal trial and error let's say we have two models a big model and a smaller model but both are capable to fit and explain the discovered knowledge inside the circle very well from Occam's principle we should trust the smaller model more because it has more potential to extrapolate truthfully now what's the smallest possible model that can fit all of our discovered data The Theory of Everything which physicists are looking for is probably the smallest possible model if we apply this model recursively to the initial state of our universe then its output can predict the Big Bang even the emergence of galaxies and planets and the evolution of Life have until the current point where we are trying to predict the next letter in the sentence but that that's too much computation just for predicting the next letter also it's not possible to compute the model of our universe faster than itself therefore even though the Theory of Everything is the most truthful model it's useless for predicting the future because we can simulate it faster than our universe all the other models even our mental model have some degree of uncertainty so probably we can never find the absolute truth and we are cursed to live with uncertainty thanks a lot for watching stay tuned or watch a related video
Info
Channel: Brainxyz
Views: 16,531
Rating: undefined out of 5
Keywords: chatgpt, machine learning, tutorial, coding, pytorch, deep learning, gpt-4, gpt-3, programming, neural networks, gpt, gpt-5, gpt4, gpt5, bard, openai, google, llms, large language model, transformers, self-attention, backpropagation
Id: l-CjXFmcVzY
Channel Id: undefined
Length: 47min 53sec (2873 seconds)
Published: Mon May 01 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.