OpenAI - Build an AI Solution on Azure Using Cognitive Search

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] foreign as you may have seen in my last video I've been spending a lot of time researching artificial intelligence in the Microsoft platform recently and I wanted to share some knowledge that I've gained from that as a as well as some use cases and a practical guide solution that you can use to kind of understand the concepts of how Azure is using uh artificial intelligence so if we look over here a few months ago Microsoft and openai extended their partnership and the big things about that partnership is that we are going to get new AI powered experiences in Microsoft products as well as Microsoft is going to be the exclusive cloud provider for openai services chat gpt4 is included in those services and we want to be able to leverage that so um you know the idea of how do I integrate chat gbt into my business data my applications is what we're going to be going over today so we'll start off with the diagram that I created to illustrate the idea because there's a lot of different ways that you could go about this but for this particular use case we're talking about a knowledge mining solution and why is a knowledge mining solution important to cyber security one of the biggest downsides in a business and one of the biggest risks is not understanding how things work how the business operates how systems are set up usually businesses have a massive amount of data that's quote unquote unstructured Word documents you know PowerPoint presentations vizios Excel spreadsheets tons and tons and tons of Excel spreadsheets and these are off on a file server somewhere and the business has operated and used that Collective knowledge to get to where it's at but you can't make business decisions off of it today because that data just sits in file server somewhere and we want to be able to use artificial intelligence to extract things that we can learn from that data and we want to also be able to Leverage so that we can use that data for operations going forward right if you created a process and you saved it on a file share somewhere and nobody ever accesses it that doesn't help anybody but if it's searchable if it's retrievable if it's enriched if it can provide meaningful context to somebody it might be able to help them you know if you're doing infrastructure or you need to understand how something's built because there was a cyber breach you want to be able to find the documentation related to that system who built it you know what type of you know even if it's just a word document or something or something scribbled away in a OneNote somewhere you want to be able to understand how that system works so that you can understand the nature of the breach so over here in number one we're going to start off with you know most of your stuffs on-prem and you know you might have files images they might be on file share yeah it might be a appliance or it could be like a file server for Windows but we somehow some way need to get it up to Azure whether we suck that up to Azure files or whether we use blob storage maybe we use data Factory to pipe in consistent data you might be entering that data on premise and it's day-to-day might live on-prem but you need to consistently pump that up to Azure so that you can leverage it in your application that's what we're going to be using the data Factory for um you might also just as a general workflow get your users to stop using file shares for team storage share documents maybe get them to start using Microsoft teams SharePoint storage you can sync that with your desktop so that it looks just like a regular file share you could also use OneDrive for personal storage and then that data would already natively be in the cloud so if it is in the cloud in some way shape or form what we can do is we can build a cognitive search to point at that data whether it's in blobs SharePoint SQL data Lake storage and we can actually create an index and use artificial intelligence to enrich that data so there's a lot of built-in skills with cognitive search it uses cognitive services to actually do the enrichment so it's going to be doing text analytics out of the box translation services computer vision OCR if you've got PDFs or pictures that have or maybe forms the OCR is really powerful of extracting that text and you know creating an entry in that object so that it's actually searchable and then on top of that you can also build your custom skills and we'll get into this a little bit more in depth later but the out of the box stuff just doesn't cut it in my opinion so you'd want to implement some custom skills you can use Azure as machine learning you can use Azure functions so the idea behind a custom skill is that you know there's some type of data that's specific to your business whether it's custom entity recognition or maybe you want to compare it to something that only your company has access to that's where the custom skill is going to come into play so it's going to do all of this work the AI is going to work in the back end and what it's going to do is it's going to create a knowledge store projections it's going to put that stuff on blob storage it's going to enrich that document it's going to extract all that text from the images whatever it can come up with and it's going to create a giant search index so once you have the search index you can query that directly and then you're going to use some sort of web application to interact with this because the the web application and the programming behind it is going to be critical that's kind of your pivot point between your on your company's data and the open AI service so understanding the development cycle of that what kind of codes going into it doing reviews making sure that you're using the Microsoft authentication Library so that you can enforce conditional access using application gateways for any public access with waffs attached to them and DDOS protection is going to be critical but infrastructure stuff aside generally that web application is going to be what your users interact with day-to-day and that doesn't have to be a full-scale web application because you can also use Microsoft power apps to interact with these services and I'm going to show you an example using both of those methods all right so once you've got this big old knowledge store and you've got the knowledge bit you've got your web application what you can do is you can query the knowledge store for information and then you can take that data and you can prompt open AI with it so you're giving it the knowledge of your company and you're using the chat GPT services on top of that and then you're going to get a response from chat gbt and your web application is going to create a little bit of a cycle right there to interact with the user but that's generally the workflow and how it's going to go so in the use case that I built it's going to be a search engine so I've got a big file repository on blob storage we can switch over to the infrastructure View all right so switching over to the infrastructure view what I've done is I've created a resource Group to house all of the resources associated with the solution you know I've got an action group I've got an app service I've got an app service plan application insights I've got the Azure open AP AI service excuse me Azure open AI service I've got cognitive Services over here on my back end I've got a function app performing my custom skill and then I've actually got the search service right here which is creating the index and all of that data is going off of these two storage accounts so the primary storage account is going to be this data Lake storage and that's on the Gen 2 platform and then the knowledge based storage is on the previous platform and that's really just for caching purposes and that's for search performance and then over here in our search what you need to do is build an index now when you go in here you can import data and you can select various different sources SharePoint and file storage blob data Lake storage gen 2. there's a bunch of different options in here but basically this is what you're telling the cognitive search go get this data and perform these AI functions on this data so you see over here I've got the KB data Lake storage in there and that's actually what the index is performing so we go look at the index I've got a KB ADLs Gen 2 index it's got 60 000 documents okay so it's a little bit bigger than I thought and if we click into it we can see the fields that this index is actually populated with and this is a configuration for what you're going to need to make the API calls for instance a field has to be facetable if you're going to make it a facet call but we'll get into that a little bit later so once you've got your index you're going to create an indexer and you're actually going to point it at your files so that it can perform the operations so I've got this one running every hour so if you make changes to that data it's just going to continually index it and then what we're going to do is we're going to apply AI skill sets to that data so if we go in here this is the Json representation of the skill sets that are being applied we've got you know V3 entity recognition amongst a bunch of others you can go in here and create custom skill sets using templates you know with custom entity lookups custom web API skills custom API skills for Azure functions there's a bunch of different things that you can do here and this is the layer at which you're using the artificial intelligence to enrich your data all right so once you have all of that stuff then you have an index right and you want to be able to actually explore that so if we go up here we can go to search Explorer and then we can just type anything in here and I'm going to say top equals one to only get one result and this is what the Json file is gonna or the Json response is going to look back in your application it's what you're going to get back from cognitive search so here we go so metadata storage name firewall change core 80 unify and vse3 rules signed so this is a PDF document and we can see its extracted key phrases it's extracted the language it's got a few important pieces of text in there we've got image tags coming up in there we've got captions we've got personally identifiable entities that are coming up in here we go okay phrases so you see a lot of repeated things in here so it's going deep into all of these different objects and it's analyzing them and producing text to kind of tell you how that works all right so this is this is super fine and dandy and you know if you've got a defined business process that sends data somewhere where you can suck it up into a blob storage all of that is going to be kind of easy mode it's just going to index and you're going to be good but what happens if you want to create new knowledge all right so your web front end needs to be able to inject information into your index as well and what I did for that is I actually created a power app let's go over here I mean I just created this knowledge base power app if you want to submit a link you want to submit an image you want to submit a file you want to submit a video it's actually going to use artificial intelligence enrich those objects before sending them into the knowledge store and then after they're in the knowledge store they'll then be indexed and they're able to be searched as well so if we just go over here to browse links I'll show you what I'm talking about this particular function simply I just give it a link it goes out on the internet and its job is to give me the key phrases main points a brief summary uh what topics are covered in it who the audience of that link is for and basically this is just me trying to be lazy when I because if you're an engineer cyber security engineer throughout your day you'll probably open up 150 web browser tabs not all of those are worth reading so I was just throwing links into this and like you know hey give me a summary is this worth my time to actually read the full document and it saved me a lot of time actually and then you know any final notes um I'm collecting who it was created by and I'm actually collecting the file object so in this particular routine it's going out to the internet grabbing that URL saving it to file and putting it on the blob storage as well so that I have that to reference all right so if we switch back over to the main menu and we go to submit link we're going to see the only input field here is the actual link so what is doing the logic behind it we're going to switch over here to our flows inside of powerapps and go into that a little bit so one limitation of the open AI functions is that they require approval after every single step so I've actually got this broke out into three separate flows but the first one is you know get summary so if you go in here and edit this take a look at some of the logic behind it so the trigger on this flow is when something gets added to the table so there's a table that's associated with that form you submit the link it creates a new entry with mostly blank values into the dataverse and then we're going to create variables for the link ID and the URL and we're actually going to prompt GPT here um and this one's pretty simple and just provide a brief summary in three paragraphs or less of the link provided and we're going to give it the URL so that it can actually go out and search that we're going to kick that through to an approval process and then we're going to instantiate variables for how the approval comes out and if it is approved we're going to actually go through and update the row with that link ID and we're going to put the summary put the summary down here in the summary field so that one's pretty simple the metadata one is a little bit more complicated so I'm instantiating a bunch of variables here at the beginning and then I'm checking whether the field enriched which is a flag that I created just to tell me that the enrichment process has been completed as long as it's uh not yes we're going to go through this and we're going to actually prompt GPT for the metadata so this is very important here whether you're using the big AI or whether using the power apps is a concept called prompt engineering you need to make sure that the text and the instructions that you're giving chat GPT are very explicit so that you can get an expected outcome out of it in this case I want a Json format of the data that I'm asking it for so that I can actually go down in here after the approval and parse that Json and create different variables so that I can update the table with that information so you get a little approval process and flow it gives you what the AI generated you can look at it and approve it or decline it and then it's just going to go through and update the table which is then browseable inside of this app as well all right so that's the power app side of how you would do that in this case it's using powerapps as the web front end and it's using the dataverse as the database backend but you can pipe this into blob storage and do other things as well which is what I've done in this instance I've just kind of layered on top of it so I have I have the powerapps version that's cool for me and I just want to put a link in on my phone or whatever if I'm out browsing out throughout the day and I catch something I want to save it for later I could just toss it into that app but then for the actual Azure infrastructure path solution type configuration we're going to actually use an app service for this and this is going to be tied on top of an app service plan and then for the web app Imperium KB I've got an application insights and then I also have one for our custom skill function that's down here so for the front end it's going to be that web app for the back end we're going to be using the open AI service we're going to use Azure cognitive services we're going to be using Azure cognitive search and we're going to be using these two storage accounts as our data backend all right so if we go into our storage account and we go into our containers so we've got the unstructured data so this is where I put just the onedrives I just took copies of my onedrives put them in there but then we've also got this cognitive search knowledge source so these are the actual um these are the extractions enrichments the projections that the artificial intelligence is creating for us if it's an image it needs somewhere to store the text that is actually created in that image and what it's doing is it's creating this knowledge store and then we also have one for image projection and if you go in here there's a whole refine process about how to correlate the projections with the actual file I'm not going to go into that in this video but it's there it's searchable and we have this web app front end so I'm not really a c-sharp guy so I'm going to go over a couple examples and I'll include those in the description for you Microsoft's done a bunch of great work creating working examples for us and I can reverse engineer almost anything so even though I don't I don't know C sharp I messed around with C sharp enough to make this web app work so let's switch over here to our.net samples so there's a bunch of great examples in here uh it's for C sharp to use Azure cognitive search the one specifically that I'm going to show you is the knowledge mining solution accelerator but you know I'm also incorporating Azure power skills in here if you wanted to be able to query multiple indexes there's examples of that in here um there is tutorials for doing AI enrichment creating a model view controller type app so a bunch of different.net things in here and then if you just wanted a modular search to add to your website there's an example for that in here as well as quick start so there's a bunch of custom skills and I'll go into more detail here in a few minutes about the ones that I'm using but on this Azure power skills repository there's a bunch of great examples of things that you can just pipe into functions or in Docker or however you wanted to play those Solutions but you can enrich your cognitive search on top of this as well going over here specifically to the knowledge mining solution there's a general workflow that's included in this and it produces a web front end that's searchable like this so you know you got to have Visual Studio you have to understand API calls uh you gotta mess around with cognitive search deploy your resources and we're going to pipe up the web app to that web app service we're going to take the c-sharp example that they've given us and we're going to pipe it up to that web app and then we're also defining custom skills so there's other things on top of the Baseline stuff that azure's done that we want to incorporate into our search results so we're going to add that layer on top of that as well and then we're going to be able to do visualization and Reporting on top of this we're going to be able to go off of the application insights we're going to be able to use power bi on our data store and we're going to be able to have a really powerful visualization source to where we can take a lot of Dorman data and hopefully extract some meaningful information from it so that we can improve business processes so what does that look like here to our private so if I go over here to and then private browser because I want to show you guys the authentication Library one of the improvements that I made to their working example was incorporating the Microsoft authentication library because I'm a cyber security engineer who focuses mostly on identity and it is my recommendation that any application you put on the public internet should be behind authentication if there's any type of sensitive data on there so uh let's go to knowledge base because the knowledge base is sensitive information I don't want just anybody in the internet being able to see this stuff there we go all right so once you get prompted just go ahead and sign in if you're authorized to the Enterprise application then you'll be able to log into the application [Applause] also two-factor and conditional access great thing to do with the Microsoft authentication Library and here's a web front end so this it's pretty cookie cutter out of the box like I said I didn't do too much tinkering with it I'm not a c-sharp guy but there's a basic search component there's a place to upload the files to The Blob storage directly as well as an interactive search experience so um so if I go into this search box and I type in backups um there's a whole bunch of stuff that I have in here and it could be in a file somewhere maybe I don't really know how to get to it but the search pulled it up ranked it and gave it to me top of mind now one thing I think is pretty cool is that my memes also had extracted the text from the images and put that in the search results so I get a little a few memes in here um and then if I click into the documents so it actually pulls up and it shows me the key phrases that it's extracted from the document if there's transcripts it might have transcript in it but it gives me all the metadata as well so if I want to go directly to this file this is exactly how I can get to it I can also search for firewall and then you know I get another meme it's not the firewall I send this to people when everybody always came to me and said it must be the firewall messing up and it in fact was not so we can see some diagrams in here there you go memes that are pulling up educational information that I've had from the past uh we're gonna access The Nine Pages it's been able to go through and extract a lot of information out of the text and give me relevant search results over here on the left side what you'll also see is the facets so I mentioned facets in your search index earlier if there's something specifically that you want to Pivot off of and use as a facet you would want to make sure that you include that in the index appropriately when you create it so that's what the web interface looks like at its core and you can take that and modify it as you need to maybe put a bunch of different menus on there but the key thing if you're not seeing the data back that you expect is that your search is probably a little bit um yeah maybe you're just using the default AI stuff maybe there's a little bit of garbage coming back you want to clean that data up and make sure that your search index is clean so that you can query clean data and get relevant results back and you're able to customize the search index if you want to put specific weights on certain types of data so that they show up higher in search results that's totally an option as well all right so going over the custom skills Pipeline and what that might look like one of the things that you need to do with your index is to find your schema and identify what relevant information you're trying to extract about different types of files or data um so for the knowledge base example I'm basing this off of those Azure search power skills that we looked at earlier the schema that I'm trying to pull back people organizations locations key phrases those are the ones that are going to come by default with a cognitive search additional to that with the power skills I also want to do acronym enrichment so if it recognizes specific acronyms it's going to give me meaningful information instead of just an acronym descriptions highlights summaries who the audience is tags for search topics any addresses that are found in the data geocoordinates and then just a few extra things is the Bing structured data and GPT miscellaneous so if there's any other you'll see in the prompt structure in a little bit but if there's any extra information that we can extract GPD is going to put it in there as kind of like a miscellaneous column and then the Bing structured data actually if you can take a known entity and Bing can get results back on it it'll pull back structured data into your cognitive search as well so what does this look like so there's an order of operations so the default enrichment is just what you know cognitive search is going to do out of the box uh it doesn't really do video very well from what I've seen so there's a Azure video index or service if you have video in your pipeline um and then we're going to do a custom entity lookup because one of the first things that we want to do is find all of the organizations and brand names that are associated with our business and we want to be able to extract those out of documents so then we're going to go back to the acronym Linker and this enriches definitions from known acronyms and populates you know the acronym field inside of the index text quality Watchdog this is one that goes back and it clears up all the garbage indexes or data entries inside of your index I should say and then we're going to go through and enrich addresses so if you can find addresses in the fields and then we're going to do geopoint so if there's named entities or addresses it can actually pull up the geocoordinates associated with that the being entity search like we said it's enriching the index with structured data on the public internet and then we want to go through and do a deduplicate of terms in there so then at the end for this enrichment pipeline we actually want to go out to chat GPT and ask chat gbt give us meaningful information about this particular object so have it go through and check the things that cognitive search did is that accurate compared to the data and then we're also going to ask it for the main points a description highlight and summary so the description highlight and summary are basically the same thing in different formats one of the things that you can prompt GPT to do is write something in 100 words or less or whatever so the summary is what is the summary of this thing in five paragraphs or less I believe the Highlight was one paragraph and the description is just the one line elevator pitch one sentence what is this file about we also won't want to extract the audience so if you've got accountants and you've got lawyers maybe the audience is different you know if you're doing search results you don't the lawyers aren't really going to care about a tutorial popping up for the accountants you know sometimes that data is not protected if it's just open access and it's sitting on a shared platform you that might be a thing so we're going to extract the audience to make sure that we're doing audience targets we're going to do tags to help search relevance and then we're going to actually enrich the topics as well and chat gbt how do you actually do that part so there's also a bunch of tutorials on the internet for openai as well I know so up here we have the open AI samples and you can go through here there's python stuff in here see Sharp stuff in here you can check that out there's a bunch of good information on there and there's also the open AI cookbook this is going to give you a lot of the information that you need to understand how your web application needs to interact with the GPT service all right so we're basing the logic for these custom skills off of the tutorials that are included in these links and I'll go ahead and put those in the description as well so the first thing you've got to understand with GPT is what context are you sending the GPT GPT at least three four is going to be a little bit more advanced but at least gpt3 is only trained up to 2021 it doesn't have access to other information it doesn't have access to your corporate data it's simply an agnostic model over here and you need to be able to interact with it and if you wanted to know about your data you need to give it your data which is what the cognitive search is for you produce search result comes in and it's going to produce relevant results based off of that data uh for the enrichment pipeline specifically you just give it the file you give it the link you give it whatever it is that it needs for context you're also going to include all of the default enrichments that we went through in that pipeline you can include any knowledge store projections related to that file and then most importantly you are going to frame how GPT responds at the very beginning so in this example it says you are an SME reviewing this file for relevant information you will be enriching this file with that information to Aid end users and searching for a knowledge base using Azure cognitive search this is strictly for business purposes and corporate security policies must be followed during the enrichment process so you at the beginning you're setting GP up you're telling what its guardrails are you're giving it the file you're giving it the relevant information it has its scope so prompt one ask it to check the information and make sure that everything's accurate you know if the data is not accurate and ask it to make an update to better reflect the data prompt three what are the main points of the file add that to the main points field what audience would find this file relevant add that to the audience field what tags or keywords would help an end user find this document when searching add that to the tax field uh what topics is this file about you want to add that to the topics field can you provide a summary of the file and five paragraphs or less explaining the most important information about this file add that to the summary field can you produce the same summary in one paragraph or less add that to the Highlight field how would you sum that up in a one sentence elevator pitch let's add that to the description field are there any other important details that you would include in a search index and this is where that miscellaneous field comes in so now you've got your enrich data and you can go into your search index or your application and you can interact with that data if you need to you can also go over to that web front end and you could take a chat GPT like chat bot experience and maybe put a little widget down the bottom left and be able to ask questions to chat GPT about the data that's returned in the cognitive search so that it knows what's going on with that it can even produce cognitive search queries for you if you're looking for specific results you could prompt GPT first to give you the relevant query go to the web app query it for the data return it to your web app and you mess with that data after that so it's a lot of cool stuff security is really important in your identity identity protection any of your apps that are going after sensitive data you need to be very aware of what data sources you're sucking into cognitive search who has access to what indexes maybe you create separate objects just keep you know delegation permissions the way that it needs to be maybe you put one over here for the accounting team you put one over here for like public records and you make sure that they can only touch the data that they need to touch and then you got to make sure that the apps are able to get that data and that they are actually querying chat gbt in a way that is acceptable to the organization it's really really important because you can pretty much ask Chad gbt anything and the prompts are getting engineered in the software on the web app or in the powerapps specifically so those are all very key things there's a lot of interesting things coming down the pipe as everybody else starts to integrate this with their data and I'm excited and looking forward to it and then if if this is all this is there's a lot to digest there is one other option for integrating your Enterprise data into chat gpt's model but it is prohibitively cost expensive so only if you're rich oh you have a lot of money can you do this I definitely cannot afford to do this in the lab but one of the things that you can do is actually train a GPT model on your data if we go over here to our openai instance there's actually a separate portal for the openai stuff all right so out of the box uh you can go in here and you can click deployments and you can do any of the cookie cutter models cookie cutter models don't know your data though which is why we had the cognitive search pulling the relevant data giving it to GPT so that it could be aware of our data but if you just wanted to be aware of your data all the time and then if it's for high performance applications or a specific use case if the costs are Justified then you want it to be aware of your data all the time so that you can just direct with openai you know directly you can interact with openai directly so in that case what we're going to do is we're actually going to go down here to our models and we're going to create a custom model and we're going to use yeah okay apparently the demo doesn't work so there's capacity constraints they're keeping open AI stuff variegated but I'll just walk you through it so you you create the custom model you give it a base model so for like GPT you only use DaVinci and then you're going to give it training data you could point it to an Azure blob for that training data which would be my was going to be the example that I walked through but then you have to pay for it so it creates a new model and it's going to have to train on your data it's also going to have to make inferences through GPT which is going to cost tokens and then you have to host that service so some of the calculations I've seen is that creating a custom tune model is actually six times more expensive so you're going to need to know where to draw that line between curating the data that you feed it or making it aware of specific data that you want it to be made aware of um and we can go over to the pricing calculator and kind of see what that looks like so for a fine-tuned model on The DaVinci Code if I took it and trained it for 150 hours a month imagine that we're going to be continually adding data to The Blob storage so we're going to probably want to train it somewhat every single month on the new data that gets ingested so if we did it for 150 hours a month we hosted it for 730 hours which is all month and we allowed it to make 1000 in token times 1000 tokens basically 1 million inferences or 1 million tokens I should say it would cost about 15 grand a month and that's pretty expensive so if if your business app requires it this is definitely an option but you're probably going to be better off going with cognitive search to curate your data and that is how you build an open AI Solution on the Azure platform so I hope this has been informative for you I had to do a bunch of digging and research and find a bunch of stuff and do a ton of troubleshooting to get this concept out there but I figured because I struggled so much that all of you could probably benefit from an actual example walking through front to back exactly where all these pieces go how they interact with each other what links to what how much this stuff costs etc etc so I hope that you guys enjoyed this and I look forward to talking to you in the next video [Music]
Info
Channel: Imperion
Views: 4,990
Rating: undefined out of 5
Keywords: cybersecurity, microsoft, microsoft365, security, identity, azure, openai, chatgpt, artificialintelligence
Id: ekN0C27WZIE
Channel Id: undefined
Length: 36min 12sec (2172 seconds)
Published: Mon Jun 12 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.