Word File Processing in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys welcome back in today's video we're going to learn how to process edit and create word files in python so let's get right into it [Music] all right so we're going to first start with the creation process we're going to learn how to create a new empty word file in python how to add some text how to add some headings some lists some tables and so on and then we're going to learn how we can extract information from an existing word file how we can get some information from an existing word file on our computer and for this video we're going to use an external python library called python docx and in order to install it we need to open up a command line we need to type pip install python dash dot x and in my case this is already installed once you have that what you can say is you can say from docx import document and we're also gonna need import paragraphs like this and then we can just start by creating a simple document we can say document equals document and to this document now we can add a bunch of things once we're done we can just say document.save and we can call this for example test.x i think actually this should already create a new file let's see if i run this we get this test.x file and if i open it it's an empty word file so this already works and now we can use this document object to add all sorts of things for example we can say something like create a new paragraph and edit to the document or maybe you want to start with a heading so we can say for example document dot at underscore heading and want to say for example hello world and um then i can just run this again and if i now open this here you can see that this is heading one now i think that if i pass zero instead of one so if i say hello world uh and then comma zero i think this should pick the title there you go so now i picked the title um and i'm not sure if one is heading one or if it refers to something else but i think one should be heading one if i'm not mistaken there you go and two should be heading two and so on so this is how you can do that let's see if heading two works as well there you go it works so zero is title and one two three four and so on are the respective headings um so this is how we create a simple heading we can also now add a simple paragraph of text we can say p equals document dot add underscore paragraph and we can say um hello okay not again hello world let's say this is a sample text and then i can just run this and if i open this now you can see this is a sample text um a sample text and now i can say 0 again to get a title so that we see that there is a real difference here so this works and now we can also say okay to this paragraph i want to add some more text but i want to add it in bold or italic style so i can say for example uh p dot at underscore run this text maybe with a space this text is bold and then a dot bold equals true like this and then i can do the same thing this text is italic dot italic equals true and if i run this this should already produce the desired text so you can see bold italic so this works easily here as well so this is how you can do simple text styling in a paragraph and we can do all sorts of things here so for example i can add now uh multiple paragraphs that i want to style as a list bullet so i can say here um [Music] document dot add underscore paragraph and i want to say for example this is item one and the style of this item should be uh list bullet i think it's a correct style here and i can copy this and you know change this year to two three four and five and i hope that okay this doesn't work uh what's the problem here add paragraph style list bullet oh not bolted but bullet had a typo here to remove the first t there you go now if i run this we can see that i have five list bullets here so this also works um and of course as always when i make a video on a library if you want to know all the features if you want to know all the styles that you can pass if you want to know all the keywords that you can have if you want to know all the ad functions look at the documentation when i make a video on a library it's usually to show you that this library exists and how it works in general but if you now want to know okay for example instead of list bullets how can i do something else just go to documentation for the sake of completeness in this video i'm going to show you a big picture perspective on this library and then you can go deeper into the individual features so what i want to show you here as well is how to create a simple table so let's say i want to add a table to our word document when we have i don't know people with name age and job or something like that and we want to have um a simple a simple table representing these people so we can say for example here table header equals and i'm going to say okay i want to have a name i want to have an h and i want to have a job and then i'm going to say sum data equals and here i'm going to have a list of lists and those are going to be the table rows so i can say for example we have john john is uh 46 years old and he's a programmer um [Music] then we're going to have let's copy this here we're gonna have uh mary and we're gonna have anna and we're gonna have bob and mary is gonna be i don't know 55 and she's also gonna be a programmer and anna's gonna be 27 and she's going to be i don't know accountant now i'm going to have bob bob is going to be 50 and he's going to be a chef so cook um and then what we do is we create a new table by saying table equals document dot add underscore table and we say we want to have one row and we want to have three columns like this and then we're going to say 4i in range in this case we're going to say um 3 i'm going to say table rows 0 dot cells add the index i so for the first row the text of that is going to be whatever the text is in table header so the basic idea here is that we take um [Music] table header i the basic idea is that we go to the first row so index zero which is the the header of the table um and for each cell for each column here we're going to set whatever we have in the table headers at that respective position so first column name second column age third column job um that's the basic idea now we're going to say here for name age job in some data we want to say cells are equal to the table we're going to add a new row to the table like this i'm going to get the cells here as a result of that and then we're going to say cells 0 is going to be where cell 0. text is going to be the name and then we're going to have 1 2 h job so i can run this now and of course i have a problem add table um got an unexpected argument row i think it has to be rose is that the problem that it's singular in object is not iterable which line is that cells1.txt uh oh this is a number so we need to say string off h uh permission denied yes because the file is opened okay third error are we good to go now there you go let's open it up again and now you can see we have a table even though the table doesn't have a style i can add a style here by saying this here for example but we have a table here we have name age job and we have name agent drop for the individual people here now before we go to the reading of word files i want to show you a bunch of more little things here for example we can do a page break we can say document.add underscore page underscore break to say okay now we're going to start working on a new site here so document.add paragraph for example hello new page something like this uh and then one thing that i want to show you here also is how to add images so oh i think i'm actually after the save statement so this won't work now it should work there you go so we have a new page break and here we have the new paragraph and on the new page and now what we can do is we can also add a new image here i have the neural nine wallpaper that i have on my desktop and all i'm gonna do here now is i'm gonna say document dot add underscore picture and i'm going to pass here neural wallpaper dot png and there you go we have the image in the word file so those are the basics of creating and writing into words file using python now let us look at how this is done the other way around how can we take a file that already exists a word file that already exists and extract information from it for example we might want to go through the individual tables and extract some table data we might want to go through all the headings and extract the titles of the individual sections if we want to do something like this we need to change the import statement to be from docx dot api import document instead of just from docx input document and then we can still just say document equals document and refer to the test docx file and then we can just iterate over different segments of the word file for example a very simple thing we can do is we can get all the headings um at least if they're actual headings right so you can format something to look like a heading and then it's not a heading so let me show you how this works if i go to the word file this is a heading because we can see here the style is a heading this is a title or a heading whatever but it's actually one of those styles i can also just go down here now i have the normal style but i can take this and uh choose a bold font or or a bold style and then i can increase the font size and i can center it and all that this is still not a heading even though it works as a heading it's not a heading because it's not from the style heading this is important we're not going to find something like this here we're only going to find actual headings um so what we can do is we can say here for example 4p in document document.paragraphs we want to find all the headings and how do we do that we say okay if the p dot style name starts with heading so heading one heading two heading three and so on and of course depending on your document you can change that so maybe you have different names that you want to look for um if that is the case or p dot style dot name is equal to title then in our case we're just going to print the value we're going to say p text however you can of course also store this somewhere do something with it in this case we only have hello world so let's open the file and add some headings so if i say down here for example heading 1 um what is up question mark and then maybe down here i want to have heading 4 this is also a heading whatever and now i close this and i run this again and now we can see here that we have uh more headings and i think this is a blank line that is still heading so that is why we see a blank line here but this is how you can do that for example you go through all the information you go through all the paragraphs and if the style starts with heading or is a title you print it you process it whatever you want to do we can do the same thing here for tables in a slightly different way we can do something like for table in document.tables for example this can actually be quite useful for data science even though word documents are not necessarily file like docx files are not usually files that we use to send data data frames or data sets in general but maybe you have data sets or previews of data sets in books and you want to extract them or something like that so what you can do here is we can just say for table in document tables i'm gonna say okay now we have a new table here that i'm printing and we can save four row in table rows we can just print um separated by a pipe symbol we're going to join here the following list comprehension the cell dot text for cell in row cells and if i run this here you can see our table name age job and our table here and of course if i open the file and somewhere here at the end i add a new table i insert a table like this and i'm just going to add some information here i'm not going to do anything meaningful i'm also going to skip some of them whatever something like this and then i save it and then i run this again you can see we have a second table here new table with the new information so this also works this can be quite useful for data science as well and we can of course also get all of the text um that we find in ordinary paragraph so we can say all text is equal to an empty string and then we're gonna say four paragraph in document dot paragraphs uh we're gonna say all text plus equals p dot text and then oh sorry i'm gonna say all text plus equals backslash n [Music] and then i'm going to run this here and of course we need to print all text in the end otherwise this is going to be a problem but then you can see here we have all the text of this document now of course this is without the tables but you can concatenate this with the tables and then you have all the text and one last thing i want to show you here is maybe you want to filter based on text size so maybe you want to say okay i'm not interested in all these um in all these paragraphs here what i'm interested in is all the paragraphs that have a certain font size for example down here i can say this is one of those that we're looking for then i can increase the font size to 16. and then i can copy this and maybe i can put one here as well in the list this is one of those then i can say here okay this one i want it to be font size 16. this is special i'm going to call this one i'm going to save this and now what we're going to do is we're going to look for paragraphs that have a font size of 16 points so we're going to save from actually we need to import here first from docx dot shared we're going to import pt um and we're going to say here that all 16 pt text is going to be an empty string then 4p in document.paragraphs i'm going to say for all the runs or so for run in p runs we're going to say if run dot font underscore size or not underscore font.size sorry it's equal to pt and parentheses 16 and then we're going to say all pt text plus equals um p dot text and then all 16 pt text plus equals backslash n and then we can print all 16 pt text then you can see that we get those here now i think why do we get this twice uh do we have a problem here let me just see do i have it twice maybe or maybe the reason is that for some reason those are treated as two things maybe because it's part of the list let's remove this from the list and put it somewhere down here um without it being a list let's see if this changes something okay now i have it three times what the hell is going on how does this print the whole thing three times okay i'm gonna i'm gonna find out what it does and i'm gonna come back to you okay i think i figured out what's happening now i changed this to this is special or so and now it's printed five times because i think that each of those things here each of those words is treated as a separate run in this case so uh this is not the case for the one down below so i think if i change this here to normal as well like this whole thing and then i change the font size again to 16 and now i save it and now i run it we see it three times still okay why do we see it three times okay now i changed it again i rewrote the text and i deleted some of the blank lines that had the same formatting now i ran this again and now i get it only once so i think it is because we go for each run in p runs and then we do we add the same p all the time so this can produce problems you can also try a different strategy but this is essentially how you can do that um and all sorts of things like those can be done you can look for styling you can look for color you can do all sorts of things again as always if you want to have the complete arsenal of commands and attributes to use you can just look up the documentation but this is how you process word files in python so that's it for today's video i hope you enjoyed it i hope you learned something if so let me know by hitting a like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this channel and hit the notification bell to not miss a single future video for free other than that thank you much for watching see you next video and bye [Music] you
Info
Channel: NeuralNine
Views: 55,807
Rating: undefined out of 5
Keywords: python docx, python word files, python word, python parse word, python parse docx, docx, doc, python doc files, python docx files, python read word files, python edit word files
Id: so2illANiRw
Channel Id: undefined
Length: 19min 43sec (1183 seconds)
Published: Mon Jun 20 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.