Build your high quality LLM apps with Prompt flow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
prompt flow is your One-Stop toolkit designed to smooth out the entire development process of llm-based AI applications it comes fully equipped with an SDK CLI and even a visual studio code extension ensuring a seamless development experience from the spark of an idea through prototyping and testing all the way to production deployment and monitoring we've got you covered with prompt flow superior quality is always at your fingertips when we talk about high quality it's not just about accuracy it's equally important to strike a balance between the accuracy and the token cost of the llm in this video we've walked you through how to facilitate high quality output by fine-tuning and evaluating prompts in prompt flow let's proceed with flow development in the prompt flow Visual Studio code extension which offers a more user-friendly interface for authoring firstly open Visual Studio code as you can see I already have the prompt flow extension installed I create a new flow from the chat template open the dag.yaml file and click the visual editor button above to view it in a more intuitive flattened UI View Now navigate to the chat node select the connection you've previously created select the model you want to consume then click the ginger file to customize your prompt for the chat API in this case I want my chatbot to answer any math questions so I'll modify the system prompt to Define its capabilities now let's put it to the test click run select interactive mode and input a math question such as what is one plus one and press enter as you can see it provides the correct answer Let's test it with another one great it seems robotic cells at solving simple math problems to truly test its intelligence let's ask a complex question I opt for the standard mode rather than an interactive run to test a single input session put the single input question into the question input box click run oops I only asked for the answer but look here my bot has provided an explanation as well that's not exactly what I wanted however this is because llms are inherently random so I believe here to let him answer what I want I need to add a process to handle the output extracting only the answer therefore I'm now adding a python node which I'll name extract underscore python opting to generate a new python file see a new python node is here I set the chat output as its input and edit the flow output to be the python output then let's click the python entry file and input my data processing code into the python entry file let's save that all right time for a test run fantastic it performed admirably this time isolating only the correct answer alright guys now it's time for the batch test on a wider range of questions to verify its true quality here I've got a test data set containing 20 data points stored in a song file each data point includes a math question the input the correct answer the ground truth and a detailed explanation the raw answer before run I'd like to change some model configurations I prefer the powerful gpt4 model which is super smart and to make things more exciting I set the temperature to zero by clicking on the batch run button we can select the test data set from a local song file and then set up the column mapping in this case won't need the chat history column so leave it empty next map the question input to the corresponding column in the data set once everything is set hit run and watch the magic happen during the running let's keep an eye on the log which shows the progress for each line and here's the best part once the test is done just click on the return link to check out the amazing output answers however it's very time consuming to manually compare the generated answers with the grout truth answers so how can we quickly evaluate the accuracy of our model here's where the evaluation flow comes into play I've prepared an evaluation flow consisting of python nodes specifically designed to calculate the accuracy of my bot to trigger this evaluation flow just like before click on the batch run button but here I can select an existing run this will automatically read the output Json file of that specific run next set the input data file that includes the ground truth in this case I'll use the same test data set then let's proceed to the column mapping and set the prediction in ground truth accordingly alright it's time to click run and easily wait for the result it completed in no time once it's finished we can head over to the Run history locate the run and simply right click to view the aggregated metrics of these 20 data points it's really not good only 35 percent accuracy it's clear that we need to make significant improvements to ensure better quality if we plan to ship this to other users with prompt flow you can easily whip your prompts into shape with multiple variants and test their performance let's Dive Right into refining our prop to achieve top-notch quality for production with prompt flow it's a breeze to create a variant for your prompt all you need to do is click on the show variant button then click clone to duplicate the current one now you're free to modify this duplicate and give the prompt a bit of a twist this will be your brand new prompt variant how simple is that alright folks I've whipped up three variants of my original simple prompt by clicking on the show variant button in my chat node you can see all the variants and their IDs I'm using the Chain of Thought method to feed the llm more examples complete with the question the reasoning and the final answer this way the llm will meticulously process the question step by step with logical thinking resulting in a more accurate and sensible answer but here's where it gets exciting prompt flow offers a feature that allows you to trigger batch testing on your three prompt variants using the same data set just repeat the previous step to Kickstart a batch test choose to run the three variants of the chat node it triggers three runs which automatically proceed one after the other next we'll move on to evaluating these three runs open the evaluation flow again click batch run select the existing variant underscore zero run then make your selection of the data set column mapping repeat these steps for variant 1 and variant 2. then it's just a waiting game until the runs are completed okay we're done as before head on over to the Run history and click to see the metrics wow check out the Improvement the accuracy is soared to almost 90 percent that's fantastic but hey you may be wondering if there's a way to view all the runs in one place rather than clicking on them one by one absolutely you can easily multi-check the runs you want to visualize then hit the visualize and compare button this will generate a local HTML page where you can scrutinize the Run results line by line you can examine the results in detail and effortlessly compare your different runs right here upon comparison we can observe that prompt variant underscore 1 and variant underscore 2 have the similar accuracy however variant underscore 1 stands out as the better choice due to its lower token cost just in a few steps we identified that prompt variant underscore 1 can best suits our needs with prompt flow you can easily test and evaluate different prompt variants find the optimal point to balance the cost and accuracy which enable you to facilitate high quality llm native apps to production
Info
Channel: chenlu jiao
Views: 1,783
Rating: undefined out of 5
Keywords:
Id: gcIe6nk2gA4
Channel Id: undefined
Length: 9min 19sec (559 seconds)
Published: Fri Sep 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.