String manipulation in R with regular expressions using stringr and glue (CC111)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
something i find just super tedious is manually editing text that's going to appear in a figure or a table why is it so tedious well if you're like me you don't just do an analysis once you might do it five times and so if i'm manually editing that text five times it just gets overwhelming and invariably i miss something so that's why i really like to as much as possible code those modifications and so how do we code these modifications to update text and make text look pretty well we'll use something that's called a regular expression and today i'm going to talk about how we can do that in our r scripts for making really attractive visuals in ggplot2 hey folks i'm patch loss and this is code club today i'm going to let you in on one of my pet peeves what is that well what i've been noticing in a lot of figures recently for microbiome studies is that they'll have a bacterial name for any taxonomic level and it's clear that they didn't know how to italicize the bacterial name because it'll be in vertical font kind of a standard upright font rather than being italicized now journals vary in how they handle italization of bacterial names at the american society for microbiology asm their journal's instruction to authors calls for all taxonomic ranks to be italicized at the same times sometimes we might have unclassified and then you know bacillus right and so what i'll see is that you know sometimes both unclassified and bacillus will be italicized or neither of them and what's even worse is that sometimes i'll see an underscore between unclassified and bacillus come on people um as we proceed to working with data for operational taxonomic units we might also want to indicate the otu number and so you might want to have something like unclassified introductory ac and then in parentheses like otu 23 there's a lot of formatting that has to happen to make that all look good and that's exactly what we're going to do in today's episode we're going to use something called regular expressions using functions from the string r package we'll also revisit our old friend the glue package and we'll ease into this area of regular expressions because it's an area that i really find to be powerful it's also very confusing sometimes and very easy to screw things up but don't worry you won't really screw it up but you'll just kind of have to iterate a few times before we get it right so i'm going to introduce you to a few of the more basic things within regular expressions and these are called quantifiers if this all seems new to you don't worry i'm going to go through it with you all so let's head over to our studio so we can get going so i'm going to start with this vector of otu names uh in our actual data for the data we've been working with in the past episodes uh there's a few thousand uh different otus and this kind of gives us a general feel of what the different otu names might look like in our data set so again we have otus and what we'd like to do is i would like otu001 to instead be o21 okay so what i want to do is make at o21 otu 10 02 100 o2 1000 and i want otu to be in all caps and i want there to be a space between oto and the number and i don't want that leading uh zero as i mentioned we're going to get this to work using functions from the string our package the string our package is part of the tidy bar so i'll do library tie diverse i can use str replace and for str replace i'll give it the string that i want to match and so let's start with otu001 maybe one more zero in there and then we give it a pattern and then we give it a replacement value okay and so the pattern is what we want to match and the replacement is what we want to replace that with so if you've done find all replace all in something like microsoft word it's the same idea but this is going to end up being far more powerful than what you can typically do with microsoft word so the pattern that i might want to use to get otu0001 to be o21 i could do say tu0001 right or let me remove that one and then my replacement i'll do tu space and then let's see if we get it so that outputs it as otu1 good we're winning right all right well what if we had that second value in my vector of o2u0010 it doesn't do anything right because it can't find this pattern in my string that i gave it and we could do it with the other values of our vector as well well you know you could say well pat you could just rerun it with you know one fewer zero and then you'd get o2 10 out right so definitely that works but i don't want to make a different regular expression for each value that i'm looking at right so if i come back to this original pattern that i had something that i could put in here would be otu and then a plus sign so a plus sign means match the preceding character so that zero one or more times right so i want to match that 0 1 2 3 4 all the way up to whatever you want it to be times so now again remember this otu or tu00 didn't match anything up here right and so with that plus sign we should now get out o2 10. and sure enough we do right so that works and again we could replace this with o2 0 1 0 0 and that works as well right so that plus sign again matches one or more instance of the preceding character so if i come and do o2 1000 will this work no it should not work and why shouldn't it work well it shouldn't work because we're expecting the pattern to match zero one or more time right so we want to match that zero character one more time and there's no zero character after the u right so again that didn't work so what can we do in that situation well there's another um quantifier besides the plus sign that we can use which is the star and so the star matches the preceding character zero or more time so the plus is one or more time and the star is zero or more times so that works now right so that's that's pretty wonderful so let's put it all together and run sti replace on our o2 vector to use zero star because we want to match zero or more times and the replacement equals t u space ah it was otus not otu and so then we get the nice formatting of our four different oto labels right and so i didn't have to manually go in and change those at all so that was pretty nice one other quantifier that i want to briefly show you is the question mark and so the question mark means match the preceding character zero or one time where this is typically used is for things like color right so in u.s english it's c-o-l-o-r whereas in say um british english it's c-o-l-o-u-r right so you could say match color with a question mark after the u and that would match both the the u.s english and british english spelling of color so if i put a question mark after the zero that should only change um well let's see what this does because i think it's going to give us some funny results so the results are a little bit funky and so let's look at what it did right uh so the question mark again matches that zero zero or one time and so for otu a thousand it matched at zero times because there was no zero between the u and the one and it did it for o2 0 100 because it matched that zero one time right now it did the same thing over here it did exactly what we told it to do but it replaced that t u 0 with the capital t u space right but then it's still left in those leading zeros for otu1 and otu 10 right so again what we clearly want is that star character to quantify to match that zero zero or more times so again these are three quantifiers that are really powerful for telling your pattern how many times to match a preceding character and especially when you don't really know how many times we're going to be seeing that character so i could give this regular expression this pattern to any number of otus right i might have 10 0002's i might have 10 02s and this pattern would still work so as we go through today's episode try to remember these three quantifiers of the plus sign the star and the question mark as we go about modifying our figure from representing genus level data to otu level data all right so we're gonna go ahead and work with this code that we've been working with over the past episodes again if you want to get a copy of this i'd strongly encourage you to go to the link down below in the description where there's a blog post that's associated with today's episode you can get this starting chunk of code so you can work along with me as i modify this code you can then use that code to do your own experiments and then ultimately you can take this code that we work on together and apply it to your own data to really make it your own and i would say that if you can take the code that we work with and apply it to your own data to get the figure you want that is that that's perfect that's exactly what i want you to be able to do because not only will you have something that's useful to you but you will also have demonstrated some level of mastery of the material and that's that's what we want to see all right so in this code we read in these libraries we get the metadata we get the otu counts we kind of figure out our limit of detection we get our taxonomy information here as well and then we're kind of joining this all together to generate our otu relative abundance data further down in the code we see level equals genus and that means that we're filtering our data to only look at the genus column the genus rose i'm going to want to change this to be otu so we'll go ahead and change level to be otu and this then joins all of our data together with the taxonomy data right and we are um pooling to only include those taxes that um have a height the maximum median relative abundance within each of the three disease status groups of greater than one percent so if the taxa has a median relative abundance for all three groups lower than one percent we're going to pool those together and so again the three disease status groups that we're looking at uh these data were collected from a study looking at people with and without c difficile infections and we're looking for biomarkers to indicate you know can we predict who has c difficile so we have people that are healthy people with diarrhea but that don't have c diff and people with diarrhea who do have c diff okay so we join all this together and then this builds our nice pretty plot let's um i'm going to change schubert genus to schubert otu and let's give this a run and see what we get great so we have a figure like we've been seeing but instead of having the taxa names on the y-axis we have our otu names and we're going to use those regular expressions to see if we can't clean them up and make it look better so we'll go ahead and start by modifying these labels to be otu's space whatever right so remember what we did before so we'll come back up to our code where we ran the filter function and running these two lines we see that we have sample id disease status relative abundance the level and then the taxon name as o2001 right or o2 whatever right and so i think what i'll do is after that filter line i'll do a mutate and we'll do mutate and i'm going to do taxon equals and we will then say str replace and then our string will be taxon our pattern and this is what we practiced at the beginning there will be t u 0 star and then replacement equals capital t u space and be sure we got a pipe at the end there and so now if we look at ah it's not happy about something i forgot close off the pipe if we look at taxon relibund we now see that we've got our nice formatting of that taxon color column and let's go ahead and run everything else and see that our figure looks the way we want very good we have our otus labeled otu space 302 2 and so forth without kind of that weird capitalization and those leading zeroes so good this did exactly what i'd hoped it to do one thing i would like to do though is back up here where we're pooling our data um we're looking for things that have immediate maximum median abundance greater than one to not be pooled as we go to finer and finer taxonomic levels the amount of the total data that we can represent by pooling at one percent is going to drop so i'm going to reduce this to 0.5 percent and so we now see that we have a few more o2s included but this other category is still it's probably around 50 or 60 percent of the data and so that's kind of the breaks of what happens when you have a large number of features or otus there's just so many ways to split the relative abundance data so the next thing that i want to worry about here is that we've got our otus but we don't have any taxonomic names to go with them right so i'd like to combine my taxonomy with my otu information all right so i'm going to come way back up to the top here and i might end up revising the code we had just inserted so that we can combine both the taxonomy information with that otu and to remind you what taxonomy looks like is that we have an otu we have the kingdom phylum class family order genus blah blah blah right and we've got 5445 otus represented what i'd like to do is create a column that goes with all this for a pretty otu name that has both the perhaps the genus name as well as the otu so in here i'll put a mutate to create a new column that i'll call pretty otu and this is going to be uh the code that i had down below here um yeah this str replace i'll cut that out and move that up here and kind of do some more cannibalizing of the code here so we'll put that in here and so now if i look at this would i do wrong oh i didn't want tax on i wanted otu if i look at taxonomy i now have my otu all my taxonomic levels as well as the pretty otu code so we're in good shape what i would like to have is my genus name and my pretty otu merge together so remember that down below when we make the plot the y-axis labels is taken from the taxon column which we actually create down further below so i'm going to use mutate to create a taxon column up here and i will use the glue function which we saw a number of episodes ago we can do glue and then in quotes i'm going to then put in curly braces the genus column and then in space i'm then going to put in round parentheses inside of that i'm going to put pretty otu i also need to make sure that i've loaded uh the glue package and let's see what this all looks like if we go ahead and run our taxonomy data frame i'm running mutate from within mutate which is not right okay we then get our o2 all our different taxonomic names are pretty o2 and then the taxon which is kind of truncated to clean up that output a little bit i'm going to do a select with otu and taxon that way i'll have the pretty taxon name with the genus and the otu label associated with the original oto name that way when i do my joins with things like counts and whatnot that i'll be able to map those together and if i look at taxonomy i now see i've got that otu and the cleaned up name now one of the things i notice right off the bat are those blasted underscores right so i have enterobacteracy underscore unclassified now in a previous episode we did clean this up so i want to go back through that again because i'm spending a little bit more time in this episode talking about regular expressions and i will add a mutate for my genus i don't need to run mutate and mutate again for genus i'm going to do str replace and here we're going to use as our string the genus column and our pattern will need and a replacement we'll need and our pattern again it's underscore unclassified and one of the cool things that we can do with regular expressions is that we can match different parts of a string and we can save it to memory and so i can save things by putting the things in parentheses and a i'll do a character star and then underscore unclassified and so what period means is match any character right and then the star means match that zero or more times right and then we've got that underscore unclassified and what we're doing is we're saving the stuff before the unclassified and so we can then replace that with unclassified and then space and then we can do back back one so backslash backslash one means put in that stuff that was saved in that set of parentheses and then we'll put a comma at the end of that and so now we see we have unclassified enterobacter aca unclassified rumina caucasia and that's all good looking at this though there's one more thing that i'm worried about and that's my italization right so i want the focai cola to be italicized but not the otu1 i want the intro bactraceae to be italicized but not the unclassified so to fix this i think i'm going to modify our genus mutate line a little bit and so i'm going to do genus equals str underscore replace and then i will do string equals genus again we'll need a pattern and a replacement right and then a comma there so what i'm going to do is i'm going to take the genus name and i'm going to wrap it in stars so that we can use gigi text to make it italicized so i will then do again in my parentheses period star and we will match the whole string and i will then do star backpack 1 star and then that will come into this next line right where we'll have the underscore unclassified um star right so it'll it'll start and end with unclassified and maybe i'll put star dot star underscore unclassified star and the stars here are the actual characters and so where this gets a little bit messy for patterns is that this is not being used as a quantifier and so if i want to use it as the actual character the star i can put two backslashes in front of the star so the x back back star means match the actual star and then i can do unclassified and i can then put star around the star backpack one star and now we see that we've we got it right so we have our taxa name our genus name otu but our genus name is in stars and also down here unclassified enterobacter aca the enterobacteriaceae is wrapped in stars and so that's going to be italicized and so at our o2 relibond let's run these two interjoin statements and see what we get so it looks like we want so we'll kind of continue on with the pipeline here here we'll go ahead and get the relative abundance data and then we get sample id disease status o2 count taxon and relibund we can probably go ahead and get rid of the count column like we had here i don't need the pivot longer because i'm already looking at the taxonomic level i want i don't need to filter it further in the next step so now if i look at o2 relibund great so we have all the columns we were expecting i'm not totally sure i need this otu column but i'm going to leave it there just in case because you never know what might happen and so if we look at taxonorella bond this is where we did the filtering i don't need that so i can go ahead and comment this out for now one thing i noticed that we do do for relative abundance is we multiply it by a hundred to get it into percent so i'm going to put that hundred back up here where i calculate the relibund and so now i've got o2 relibund which is exactly what i wanted and here instead of taxon relative relibond i'll do o2 relibund where we'll then this is where we kind of figure out which otus to be pooling and then here for the inner join we have taxon roller bun still so we want otu relibund uh so we're getting a complaint about problem with mutate input taxon false must have a class character not class glue character and so where was i doing something up here uh so up here where i'm pulling things um if it was labeled as you know pool being true then it gets the name other otherwise it gets the name taxon but taxon is of type glue not character so i can do is i can wrap attacks on in as dot character and so that way then again when it gets to these two lines the output taxon will be of type character so this looks good we've got our genus name or the family name that you know we're best able to classify to along with the otu designation again we got that by using str replace as well as the glue package to kind of do all the formatting and make it look pretty one thing i'm not totally a fan of is having the other category in the middle it also doesn't seem like there's any great ordering to the data here so what i'm going to do is maybe order it by the maximum relative abundance of that otu in any of the three disease status groups so to fix the order let's come back up here and i did have a factor reorder fct reorder using the the order by the median in a descending non-descending sort and so i now see i have median up here and i'm using the minimum i think i'd rather have the max of the median and let me look where i'm defining median up here i'm looking at the median of the medians so here i think what i'll do is i'll go ahead and put the max of the medians and so now we see our o2s are ordered by uh the maximum median relative abundance for our three disease status groups and for those of you that haven't been watching i realized just now that i haven't told you what we're looking at here the ball indicates the median across all subjects in the study for that disease status group right and so you can then see we have kind of a nice line curve across our disease status groups kind of descending in terms of the median relative abundance for you know whatever is the largest across the three disease status groups and then we get other uh to be positioned at the bottom so we're in good shape there so i like the ordering here makes me happy one thing i'm not totally a fan of is that some of these names get rather long and so what i might like to do is to put a break in between the genus name and the otu label to do that if we come back up to where we had our regular expression that right here in my glue statement i could put in a br and so the br in the angled brackets uh greater than less than uh tells gg text um down below here in our theme we had axis text y element markdown that that then will go ahead and impose markdown or html formatting of our text so it's really slick you know just a little bit of html so that you can get the right look uh for your figure so that br will put in a break a line break uh between the taxonomic name and the otu and so that looks pretty good one thing i'm not totally a fan of is this unclassified rumino aca gets really long and so i'm of kind of multiple opinions about this i'm not totally sold that i need to say unclassified rumina caucasia i think if you're talking to microbiologists that study the gut microbiome they know that rumina caucasia is not a genus name that it's a family name and so unclassified rumina ka kca isn't totally necessary but at the same time i also appreciate that i know a lot right and so maybe not everybody knows that that's unclassified rumino caucasia so maybe we'll leave it but maybe so that i don't have such a long label for that one but not everything else i'll go ahead and put in another line break between unclassified and ruminica casey and again that was back up here where we are modifying the code and so in this replacement i can do unclassified break and then the name of the genus or the family or whatever it was that is deep is classified and so that looks a little bit tidier there on the left side with that y-axis label i could see this strategy of having multiple lines per label becoming a little bit unwieldy if we had more taxes than what we have here so again the challenge in this episode was how do we make an attractive plot at the ot level as we go to finer and finer taxonomic levels the relative abundances of those levels gets finer and finer and smaller and smaller saw that by going down to that half percent relative abundance you could probably go even smaller if you wanted but the challenge then of looking at otu data was that we have both otu information like otu1 as well as a taxonomic name and so we need the taxonomic name because the otu number doesn't really mean anything between studies right so otu1 in my study is this bug i've never heard of before and in your study it might be bacillus right the other place where it matters to have both pieces of information is because as we see here both ots 2 and 5 are bactroides right so if i talk about ots 2 and 5 across my study then i want to they might behave differently right and so it'd be nice to know that o2s2 and 5 are both bacteroides but they're perhaps different entities and so their behavior their frequency abundance and distribution might vary across the study and so i might want to talk about those o2s separately as we go through the study in this case it does appear that ots 2 and 5 kind of have the same um relationship to each other which i know causes me to do a little bit of head scratching um but anyway again that's the the value of being able to show both the taxonomy information and the otu information and again if you're doing something like amplicon sequence variants well it'd be the same idea except instead of o21 you might have asv1 whatever you want to do there right but it's again the formatting and the idea of mixing different types of text together is the same as what we've done in this episode so again don't settle for underscores in your figure labels don't settle for vertical text when it's supposed to be italicized those it's just not necessary right and so hopefully you get something out of this episode so that when you make your next relative abundance plot you don't feel the need to keep those underscores in there or to keep things in was it roman or vertical uh typeface to use the italization okay anyway i really hope you dig into this try to apply this to your own code and making your own figures more attractive and more presentable and give it just that little bit more of polish anyway let me know how you fare down below in the comments keep practicing be sure that you've subscribed and you've liked and you've told everyone you know about code club it's really been awesome to see the growth of interest in the channel more subscriptions and views and everything and i'm just over the moon and really happy with uh people's positive reception so keep practicing and we'll see you next time for another episode of code club
Info
Channel: Riffomonas Project
Views: 4,008
Rating: undefined out of 5
Keywords: r string manipulation, r stringr, rstudio stringr, r glue, r regular expression, r regular expression quantifier, r regular expression save, rstudio tidyverse string, r manipulate strings, rstudio glue, tidyverse glue, r paste, r paste0, microbiome, italicize axis labels, bold axis labels, r regular expression tutorial, italicize axis labels r, r glue example, r glue function, r glue tutorial, bold axis labels in r, bold axis labels ggplot
Id: u0yQfr7d0vs
Channel Id: undefined
Length: 29min 12sec (1752 seconds)
Published: Tue Jun 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.