How to Build and Deploy a Document Anonymizer with Streamlit and SpaCy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone in this video today i will show you how we can build a web application to anonymize data with streamlet we'll use streamlet as a way of building web applications and if you're not familiar with this plugin i encourage you to have a look at my previous videos we'll also have a look and use the st annotated text plugin which is a great plugin from this guy and we can plug it directly within streamlit and we'll finally use the spacey library that allows us to use named entity recognition models in order to detect entities so if you install these three packages you're good to go and let's have a look so i'm ready i started my application now it's empty it's within a folder that contains my dependencies i also put the language models from streamlet from spacey sorry here and then i also put some test data nothing really fancy so this is the main script of the app we can start importing packages i will import streamlet then i will import spacey and then i will import from annotated text import annotated text so let's say for example we want to select a language from a list in order to process our data and we have two options either a french or an english language so selected language equal st sidebar select box and then here we'll put a label select a language and then we'll put some options so let's say en and fr okay so if we save this it's hot reloaded and we can have directly the modification on the web browser then we'll have to pick the entities that we want to extract so selected entities st cyber multi select because we can select many entities at once select the entities you want to detect for example let's put the options to lock pair and org so location person and organization and let's put the default values to this list directly okay now we have these selected entities here now i want to load the models before processing any data so i will create a function which i call load models okay i will load the french model so it's inside the fr holder i will do the same thing for the english model which is inside the en folder okay so now i will return an object a dictionary with en as key for the english model and fr s key for the french model okay now i will return this object here and then i can put my models inside the main script of my app so upload models so now as you can see the script is running and it's loading the models but if you notice that if we change anything on the interface the models are reloaded again and this behavior is unwanted actually we don't want to load the models every time we change something on the ui or interact with any widget for this purpose i will add a cache decorator on top of my load models function so that my models are only loaded once and yeah to do this i have to add a cache and then show spinner false let's put it to false allow output mutation equal to true and then the suppress warning equal to true okay so now if we look at this if we change something on the screen the model are cached and we don't have the execution of this function anymore okay so now let's get the data from the interface so i will put text input equal st text area type a text to anonymize okay so it's good now if i want some text from an uploaded file i can do this as well so let's call uploaded file equal to file uploader and then we put a label or upload a file and then we can specify the types of files we can upload so let's say i want some doc file i want docx for word documents and also some pdf maybe and the classical txt all right so now if you want to extract the data from the uploaded file we can do this very easily if uploaded file is not none i can get the value text input equal uploaded file get value this is a byte object i have to convert it into a string okay so if i want to do it i can have this method which is decode utf 8 and it should be good now if i want to print the text i type here or i upload inside the document i can do this very easily by st markdown input text equal text input i have to put a format here yeah so if i type something here it gets reflected in this area and if i upload some documents let's say this one i get this text which is inside the document okay it looks like it's working now i have to select a model based on the selected language so let's say the selected model equal models and then i call the selected model selected language sorry as a key to pick my model okay so now i want to process the text i type in here or i upload inside the document so that i can extract the entities to do this i will create a small function which i'll call process text this function will get a doc and the doc is basically selected model which i'll pass the text input into okay sorry about this let's put a pass here to save this okay so process document so i have to return a list of tokens that i'll have to pass inside this package so that i can print the entities so basically if you don't have any entities it's a simple text if the entity is detected within the token it's a tuple and basically it has three elements the first one is the text the second one is the entity and the third one is the color so basically we want to do this by iterating over all the tokens of the duck so for token in duck okay so if the token end type in the log is equal to person let's say okay then i will append a tuple to tokens and it will contain the token text then the person name of the entity and finally a color and let's let me pick a color for example yes this one okay and they'll have to do this for the organization and the location so if the entity type inside is inside the gp or the lock entities i'll have to change this location then i'll have to change the color let me pick another color okay let's call it fda and then do the same thing for the organization so if the token type equal to org we'll call this organization and then we'll pick the color a f a okay so let me put an alif here okay and then finally if we don't have any entity detected we can append the text so basically i will put a space before just to keep things clean and not go connect it to each other in any ways and then i'll put the token text okay and then add another trading space and if i put everything inside the tokens i'll have to simply return the tokens at the end okay so if i get my document i can process the tokens directly by process text this returns the tokens and then i can visualize them inside the so inside the annotated text like this all right so my text is my name is sarah i live in chicago and i work at microsoft this looks like it's working now i want to add a functionality that can anonymize each entity so let's do this we'll add a an attribute an argument to the process text function we'll call it anonymize we'll set it to false so if anonymize i will create anonymized anonymized tokens list then i'll go over my previously created tokens if a token has a tuple type so let's me let's call this token so we'll have to add some data inside the anonymized tokens so we'll put as a second element of the tuple we'll put the original entity name then we'll put the color and finally instead of putting the original text of the entity we will put the string composed of xs which has the same length as the original token so it's token zero and otherwise okay we have to return the original token okay so if we do this we have anonymized tokens and then finally i'll have to return anonymize tokens yes okay let's call this functionality here we'll have to create a small widget like a checkbox for example we'll call it anonymize and only st check box we'll set it to false by default so it's anonymize yes and if we check it it becomes a value it becomes a true value and we'll have to pass it to inside the process text function so anonymize equal to anonymize all right so now if we check this we'll have our entities anonymized so one last thing here if you want to select the entities that you want to detect and analyze we'll have to add a small modification to our process text function so i will put selected entities here so in this first case if i want to select the personal entities i'll have to add a small condition here and basically this condition will be p e r in selected entities and i'll have to do the same condition in the two other if conditions so for example in this case i will add this is lock and then for the third one i'll have to add remove this one and then put org here okay so basically it's normal that we got this error we'll have to add the selected entities as a second argument to the process text function okay so look at this if we remove the person the location it doesn't get detected and yeah yeah it looks like it's working so yeah as you see it's quite easy to build a text anonymizer in a few steps in python using streamlet spacey and the st annotated text package now if you want to deploy this application to heroku we'll have to change three additional files we'll have to put the requirements inside a requirements.txt we'll have to create a proc file which is quite generic you can have it from here and copy paste it basically it tells heroku what to run so it has to run setup.sh and then run the app.pi and then finally we'll have to put the setup.sh which puts some basic configuration within our app and basically you can copy paste this script and use this deployment method directly okay now before deploying to heroku we'll have to put our code inside a github project i have already done this and it's actually it's anonymizer and then we have to go to heroku yes i've already done this this process but if you want to deploy your app you'll have to go to create new app let's call it miser demo select a region and then create the app and finally you'll have to link your project repository so it's called anoni mizer it gets detected and then i have to connect it to heroku finally you'll have to enable the build i'm not gonna use the automatic builds but i will deploy the main branch which is the current branch of my project and then once you hit deploy i'm not gonna do it here because it's already deployed you will be provided at the end with the url that will give you access to the application once it's deployed and basically if you have a look at this app we have everything deployed in this section so yeah that's it as you see it's quite easy to build an application like this if you want to improve this application you can do a variety of ways you can for example use other language models in this demo i chose the english and the french models which you can grab from spacey i use the small models but you can use the large ones and you can also use more sophisticated models based on transformers and deep neural networks and so on you can also add some other functionalities to the app for example if you want to process the batch of documents if you want to add a download button you can also process other types of data in this demo i chose to process the txt files only but you can do the same thing for the documents the word documents and finally the pdf and if you're using pdf you can choose to process them by applying an ocr or that kind of stuff and finally you can if you want increase the capability of your machine behind this deployed app by going through aws for example if you want more power and in that case you'll have to deploy by hand by using docker or that kind of stuff so yeah i think i hope this app will give you an overview of what we can do with streamlet and other libraries and get you started prototyping as well thank you guys for watching if you like the content of the channel don't hesitate to give a like to share this video and subscribe to this channel see you next time for more videos
Info
Channel: Ahmed Besbes
Views: 604
Rating: 5 out of 5
Keywords: streamlit, build web app with python, build data apps, build data app with python, model deployment machine learning, simple web app using python, streamlit python, build a data science web app with streamlit and python, how to use streamlit, machine learning streamlit, machine learning web app, streamlit dashboard, streamlit machine learning, text anonymizer with Streamlit, Streamlit and Spacy, streamlit components, st-annotated-text, named entity recognition spacy, spacy
Id: lCbG05lyDNk
Channel Id: undefined
Length: 20min 44sec (1244 seconds)
Published: Tue May 04 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.