NLP Roadmap 2024: Step-by-Step Guide | Resources

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey hi everyone welcome back to my channel So today we're going to discuss NLP roadmap so if you are someone who are let's say beginner data scientist or someone you know who wants to do career in natural language processing today we will discuss you know how you should approach it you know what step by step things you should do what resources you should report and what kind of problems you can you know solve with that knowledge that's what something we're going to discuss today I have noted down you know some of the things that will be helpful for us you know to discuss so let me start this thing so I actually earlier shared you know the NLP roadmap on the LinkedIn I got really good response you know maybe you know maybe one year back or maybe you know eight to ten months back something like that but today things has changed you know there are many libraries and a lot of things has been changed even the things I use right as my day-to-day job as a freelance of those things also have changed so if you don't know about me I am you know I'm pradeep I'm full-time freelance data scientist and I you know primarily do work on the natural language processing so I'm like a you know expert weighted uh you know freelancer on upward which is like a top one percent on the upwork and I have earned more than a hundred thousand dollars there using the same thing what I'm going to you know discuss with me so stay tuned so that I could you know tell you the step-by-step approach that you can you know become at least you know applied NLP expert that's what we're going to uh discuss so first thing before you jumping into the natural language processing and all since it's like a specialization you know machine learning you should have a basic understanding of machine learning you know what exactly machine learning what is supervised learning what is unsupervised learning what is classification regression clustering all of those things if you are not familiar with those things you should go and check something else right maybe you should go and check you know machine learning roadmap or something like that so I assume you're already familiar with machine learning let's say you have tried logistic regression random forage those kinds of things but now we want to specifically discuss hey how can I take my career to The Next Step you know how can learn about natural language processing what things we can do that's what we're going to you know do in this video so the first thing when we talk about you know NLP you know which is all about text so how do we going to pre-process the day that is the first thing text pre-processing is the first thing right so for example you have a big paragraph you have you know uh those web pages or it could be PDF file you need to process them you need to maybe let's say you know uh tokenize those particular uh text into small small word you need to do limitization something called you know you want to remove punctuation all of those things falls under let's say no text pre-processing so usually uh you could use Spacey there were only like an ltk but nowadays it's always good to use the Spacey Library you know to do the pre-processing kind of thing other thing is that let's say you could do pre-processing and the other thing is the text representation you know that right machine learning algorithms can't understand the text towards ultimately you need to convert those words into some form of vector representation or the number representation right so you need to understand what are the different techniques or different ways you can represent the take so that it can be given to them machine learning model let's say you know you want to do classification right let's see sentiment classification you have a reviews there could be positive neutral or negative reveals but you can't send that review text to the machine learning model you need to process in a such a that it becomes in the form of number so what are the different ways you can represent right so traditionally we have late you know we could use as a bag of words or the count Vector you know you can simply create a vector saying that whether this word present or not whether the other word present or not or it could be something like this how many times particular word is present so backup words for the count Vector is the one of the technique more sophisticated is like a TF IDF technique and then you know you could go for the work to wake or dock to wake right those are like word embedding kind of thing so you should learn about them whether you're going to use it or not that's a different thing but you should have a definitely better understanding of you know what are the different ways you can represent the text for the words nowadays we have more advanced you know technique that we're going to discuss but this basic theme like backup words TF idea what to make talk to it you should have definitely a better understanding of all these things so that you understand you know how to represent the text once you understand you know TF idea backup word then you can create the text classification kind of problem you can cluster the text using those feature or let's say feature engineering techniques you can call them those text representation technique right now this next thing you could do is actually something called information extraction right so let's say uh you want to extract the name entities like you have some text you want to extract the name of the person uh you know the name of the organization the places these are nothing but the entity extraction it could be something like you know part of speech tagging those kinds of you know linguistic problem also so you need to understand the information extraction part of it again here you could use Spacey to do this kind of thing right when I talk about the feature engineering and all backup words count Vector pfidf you can scikit learn it has those function you know what to make I used to is Gen scene but you I know you could see what other library that you could use to do those kinds of things for information extraction and the you know text processing space is good that you know you can use right but when you space you will get pre-trained model to extract the entities like name of the person name of the organization all of those things right but what if your problem statement is different you want to identify your own custom entities you are doing this something related to the medical uh you know something related to a medical thing then you might want to extract the disease medicine and all of those things right you can get the pre-trained model for themes also you can search for those particular models which are already trained for those problem statements but ideally you should be able to train a custom name entity recognition model and this is also one of the problems statement uh you know that is regularly used I as a freelancer get many times to build this thing you know building custom limited recognition models those entities will be you are working it if it is banking they will have their own entities like account ID you know account type and you know those kinds of things that you need to understand now up to this point it was all about basic takes pre-processing representation all of these things right but to the go to the next level we need to go into the deep learning so you should be already familiar with the neural network deep learning and all sorts of those things right nowadays I don't say that you need to do let's say lstam or RNN kind of thing but at least have some understanding of let's say you know neural network and how does back propagation algorithm all sorts of things you know what there are popular courses one of the popular courses from you know and to NG you know that's uh you can find that course on Coursera you can find on their own website which is deep learning.ai right I will share those links in the description or maybe I create one parallel blog post where I put all of this you know uh information now once you are familiar with the neural network and all of this thing you need to understand the transfer learning right I did mention you will get the pre-trade model what the speed Trend model mean right it means their existing model which will train on certain data set right so you could directly use them you will get written model for sentiment classification you will get pre-trained model for name entity recognition again sum up might take that base pre-trend model you know and then they can fine tune means you take one model which is already trained on some data but then you want to expose your own data set to it so that it will understand the you know what are the nuances of your you know data sets you have that's called the fine tuning and which is very important in machine learning right a lot of time you will not get any model you know which is suitable to let's say your business problem if it is very simple like you know sentiment analysis or you know extracting the entities which are very common as I mentioned the name of the person name of the organization now of this thing you will get the pretend model but if for your custom business problem you need to fine tune those models right similarly not every problem is like a sentiment analysis and classification sometimes you have your custom categories you have some documents and you want to classify those documents into the different categories right in that kind of things you should know how fine-tune a classification model right so you can use let's say a Transformers library right which is a famous Library where you could have models like a bird which is one of the Transformer model then you have right um let's say T5 there are many models that you can find too you know I have created many videos right I talk about fine tuning the name into the recognition model I have video on that even the you know how to use pre-trained model from the Transformer there is a video for that and you know how to fine tune the Transformer model that also I have created the video you can check other people's video also this is just I'm telling you these are the things you should definitely do no matter where you go and check them but you should definitely you know do so all of these things when we talk about the pre-trained this is the concept of Alexa Transformer transfer learning and you should really have a good you know Hands-On on this thing there are some couple of good blog posts also you know those are famous blog posts right I think Illustrated Transformer or something I'm not able to recall the name but I will share this link because those are very popular you know blog posts with respect to this now up to this point you know you are able to build let's say custom machine learning model or text classification model you're able to build the custom name and it recognition model or even you use the pre-trade model ultimately a pure business wants to utilize them you need to deploy this model somewhere so that they can be accessible now it can be deployed as a web application let's say during the development you can use the streamlit application you know there's a streamlined python Library using which you can build the UI so you can integrate those models there during the development purpose you know while you're trading and all of these things but ultimately when you put it into the production you know it needs to exclusive as an API so that the other developers other front-end guys like react.js and all they can communicate with your API right so you need to build an API so where do you going to deploy this model most likely you will be using one of the cloud services like AWS you know Google Cloud there is an Azure and all of these things right I mostly use AWS ec2 sometimes let's say if client has any other you know Cloud providers they are using they will share their you know VM details that I can connect to that VM and do the deployment thing right so you need to understand this basic thing how do we deploy this model so that others can use for API you could use anything you know or simple you could use class there is a fast API also you could use I personally use fast API right and you could also do using let's say you know AWS Lambda I used to do but it's better you stick to let's say flask or fast API you know to for the deployment purpose right now at this stage we are able to do this kinds of problematic entity uh extraction classification all of this thing another thing you need to understand we talk about you know different techniques in which we can represent the text we uh we talk about you know you can do uh count Vector you can do you know this TF idea what to make but you know nowadays we can use and create those representation using something called embedding right here embedding means converting your text into the vector representation once you have something in a vector representation you can compare different things let's have two document I create embedding for this talk these two documents and I can compare how similar using cosine similarity or other metric right so you should understand the embedding concept of embedding then it comes the semantic search also because how does Google work let's say you search for something right Google have all those pages indexed right then it needs to map those your query and find out what are the semantically you know matching document that can answer your web pages that can answer your very same thing applies to let's say you want to build question answering system right you have a lot of PDF document hundreds of PDF document and you want to make sure you know you want to actually will users are something you want to find a PDF document which is more likely to answer this particular query so you will take that user query you will calculate its embedding then you will have embedding of your let's say documents or paragraphs right and then you will compare each of them and then eventually you know you will find document which has the highest similarity call or the semantic search right and that's what you're going to written which is very important nowadays we we use this very frequently right semantic search and all of this thing so what option do you have for the embedding uh like most popular option is open source you know sentence Transformer embedding right which is you can use that library and you can create name padding right you also have paid options like you have you know open a embedding there is a cohere embedding but when it comes to embedding I stick to um open source embedding they are pretty good I didn't find any reason to use actually the open air you know embeddings okay now let's talk about you know we talk about embedding and all of those things let's talk about the llm right the large language models which is like a buzzword and most of my work is actually related to the llms right so what are the LNM rights the most popular llm is like say open AI models right you know GPT 3.5 gpt4 earlier it used to be GPT uh let's say three but nowadays you have this open AI model now recently there is an open source model also called lava 2 which is released by you know Facebook something which you can use commercially which is open source you can find in a commercially right I'm also exploring nowadays I don't have any first hand experience actually fine-tuning that open source model but this is what also I am going to explore nowadays right and now how does this all things come into picture right how does these things fit together what kind of problem we solve with this thing right so I mentioned one of the problem statement that you know question answering uh let's say we want to build something like you know hey I have a couple of documents PDF files now I want to build the system where I can ask the question and I should be able to get relevant answer from the you know my documents what I have right so typically uh if you think of question answering and the chatbot there is a difference like let's talk about the question answering first right then what do you need first of all you need some way to you know uh convert your document into form of embedding so that they can be compared with the user query right so for embedding you can use again I said sentence Transformer library or you can use openly now let's say you got the embedding but then where do you want to store those embedding sometimes you might have thousands up or millions of vectors actually that you got right so you need an efficient way to retrieve those Vector efficient way to match them right because you don't want to spend a lot of time you know when someone is asking something their query right a plot of time is going into the matching with lots of document it will have very high latency right so it means you need some solution you need some database which is capable of efficient searching through those vectors right and that's where the vector databases come into picture so we have a paid version or the cloud version like pine cone there are open source versions like a chroma DB right you can try both of personally I use pine cone and nowadays I'm kind of do lot of things on the chroma DB right so you could explore those things also yeah I was talking about how does all these things fit together right so let's say we want to build that same problem segment the question answering now we are able to you know create embedding of our document we store also to those documents in the vector database right now user asks something right because there are something then we check uh that embedding match with against all the documents embedding what we have and we literally find out the okay we find it out of this thousand document these two documents can answer this particular query but we still don't have any answer we only know that these two documents have can have this particular answer right so that was only semantic search we actually narrow down the documents that we could use right next step is the large language model then what can then what we can do we can take the user query we can take take those two documents that we got from the vector databases and we can give us the large language model like you know GPT 3.5 or gpd4 and say based on this documents of the context can you generate the answer then it will generate that answer for you right so this is how you know embedding semantic search and Vector databases and large language models work together right and this was question answering but what user want to ask some follow-up questions if they want to ask some follow-up questions right it means we need to remember what was user was saying earlier and what the model has the answer right that's where the conversational AI comes into the picture like you know child GPT kind of thing right and the the people nowadays have this basic expectation because they are now used to with the chart GPT right so whatever things you are building you will question answering your over your documents or your PDF or even question answering on your own let's say database like so you could even you know create interface where you can ask questions in the natural language and we can you know check with the Alexa database and give you the SQL query or we can execute that SQL query and give you the result right so all of those things right I already have videos about this so I have a video for Let's see you know how to use open AI chart GPT gpt4 how to build you know products using it then how do we build a semantic search I have videos for both I have videos for you know how do we use you know open Ai and chroma DB together and how do we use open Ai and Pinecone together to build you know these kinds of chat bot or the question answering system maybe I can give you Links of all those you know video to you in a description so this is something let's say the latest state of that applied NLP you know where you are able to take the large language model Vector databases embedding and fit together and do you know plot kind of you know stop and this is at the most demanding projects what we have so simply people want charge GPT for their own data that's what they exactly want right but can we fine-tune gpt3 and GPT 4 or GPT 3.4 and gpt4 so basically not for uh currently it's not possible so we could fine tune actually you know uh gpt3 but GPT 3.5 and GPT 4 is not yet you know uh there for the fine tuning maybe they release in couple of months or something like that but this is the way since it is not able to you know we can't fine tune it and we can't put all the information inside the prom that is what we are using semantic search so when user asks something we find out through the thousands of documents what is relevant to that query and then we put that particular document into the prompt this way we don't need to fine tune right we can dynamically put or augment the large language model with the recent information right this is also one of the things because chart GPT will be let's say you know train from like say last one or two years of data right and it doesn't know anything recently what is happening it doesn't know anything about your company and your company document right so this is the way you can actually augment that your own information using semantic search embedding and that you can put dynamically inside the prompt so that's something is very interesting and there are a lot of problem statement nowadays we have right now since large language models are getting popular there should be easy and you know better way to build those applications right so that's why you have two popular libraries one of the most popular is the Lang chain right so 19 is the library using which you could build the large language model application right and that's that is something I also use the most of the cases you know it it has really good utilities to you know chunk your PDF files create those chunks it has interfaces with let's say uh more large language models like open Ai and the vector database is right that makes things very neat and clean but sometimes I prefer a plain open AI Library sometimes I go with you know what is if it involves lot of you know document processing interacting with different different things if you want to do let's say you know natural language to SQL kind of a thing that's where also I use LINE chain links in a spirit I also have a video on the line chain so you can go and check I have a video with Lion chain and with all these other components like pine cone and let's say chroma DB and all of those stuff and I think um I also have a video on the line chain and SQL agent how to build natural language to SQL interface that is also something you can take then another popular library is llama Index right so actually they both have some kind of overlapping thing when it comes to connecting to the different data sources right whole idea is how do we you know take our data and give it to the L M or how do we augment lln with our own data so they also have a lot of connectors llama index for example if you want to connect to your Google doc if you want to connect your notion PDF files all those kinds of you know connectors you can connect and Infuse the data into the large language model where it comes to indexing the document means creating the vector and storing them somewhere right llama index has much more to offer there are different ways you can create that index it can have a different structure like pre-structure writing table Keyword Index normal simple you know list index there is a lot more you could do with the number index nowadays they have something called you know data agents also right here the way you can interact with the data and all of this thing both these libraries like Lang chain and the Llama Index right they kind of nowadays started having some kind of you know overlapping uh kind of a thing so usually if I want some kind of agentic behavior like SQL agent I know some kind of intelligent Behavior I usually use LINE chain kind of thing but let's say if I want to experiment you know uh different ways I can index the document and see how the accuracy of semantic search is improving most likely I will also experiment with the Llama index and their latest data agent for something I'm I'm just exploring I haven't done much work into that actually that is something I will be exploring nowadays right now with all of those things we talk about there right if you uh if you want to build an end-to-end application right so we talk about building this machine learning application you know deploying them on ec2 creating apis so that our other people can consume fine-tune those models and all of these things right but you should also have some understanding of database for example where you're going to store this all information sometimes the information required for let's say this large language model or your machine learning models are available in the relational database so you should have understanding of let's say database like MySQL postgresql right you did the prediction you want to store this prediction somewhere right you need some way to capture recording it so I would definitely say if you really want to you know do an end-to-end application or something you should definitely have understanding of the MySQL or let's say some relational database that you should be able to handle right other thing currently you know hold these things right deploying application and then you need to monitor whether it is working well when you deploy the application you can take also feedback from the user let's see just like in charge GPT you can give feedback whether you like a response or you didn't like response right anywhere you have done the prediction there you can ask for the feedback and you can store that feedback in the database later you can analyze that or you can even use that things to fine tune that particular model right now this whole thing is also called the ml Ops deploying your model monitoring it you know capturing the feedback making sure model you know performance is good so there are different tools that you could use like there are ml Ops there is a cube flow I think I used to start say not the envelope there is ml flow that I used to use earlier right and then you could use weight and biases right so there is a course also so there is a famous like Android NG course on you know evaluation of llm I haven't seen that full course yet but which is actually done with the you know collaboration with weight and bias you can watch that actually uh course there is a course from the Android also for the prompt engineering that also you know uh you can check and again whatever the things I am talking to you uh 90 to 95 of the things I use in my day-to-day job and I have a videos for each of them so I would suggest at least go and check on my channel see whether you have any uh doubts there are many videos what I talk and each of this video is a practical and available with the code you can literally do use that code and if you have any doubts you can message me on LinkedIn sometimes I'm not able to answer but most likely you know I try to answer if you message me with whatever the you know not like hi hello YouTube If you directly message me something I will definitely you know answer that thing right and I might miss something if you think I have missed something please comment and you know let me know because as I told you I will be you know releasing some you know some material with this particular video it could be the blog post or something where I will be putting all this information links there right so at least I can put there or while editing this video I might uh use that thing okay by the time it is getting deployed right so yeah sorry but at least you know I can edit the the blog post or something I will be writing based on your feedback and anything you want me to you know discuss or you know something that I should cover with respect to natural language processing data science freelancing or something you should definitely comment so that I can create the next video I hope you find this video useful as I told you you will get everything you know material from my side right thank you bye
Info
Channel: Pradip Nichite
Views: 7,288
Rating: undefined out of 5
Keywords: NLP Roadmap, NLP, ChatGPT, Sentece Transformers, Semantic Search, Langchain, Llama Index, Pinecone.
Id: 4uxKEqZV-7A
Channel Id: undefined
Length: 24min 6sec (1446 seconds)
Published: Fri Aug 25 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.