CRISPR Screening Workshop: Running Your Own Analysis RStudio (Harismendy)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so who was there from last year it's quite a bit of a difference right so last year we had a hands-on approach from pineapple pie was quite I think we went through it okay so I mean thank you thanks a lot feel it one of our goal with this workshop is that the same people who are actually designing these crisper screen and conducting them in the lab should be able to analyze their own data without waiting forever for a computational biologist or bioinformatician to have time to look at it without even having the biological interests so that was the motivation behind pineapple pie this was the motivation behind sending it up as a web application nevertheless you might and I'm gonna leave that microphone you might be interested actually in going a little beyond the what the website is giving you and what the pineapple pie is giving you and and running your own analysis and so this is the part that I would like to to kind of teach you tonight for the next maybe 40 minutes you know we're quite ahead of schedule and this is gonna going to happen in the our studio environment so I don't know if most of you are familiar with our ok can i I mean who knows are ok good I mean who has some notion of our so this is one of my favorite environment to do the the post analysis after those large more computationally intensive analysis such as read alignment you end up with relatively small data set there still too big to be handled in an excel sheet so one of my goal as well is for experimental biologists to move away from Excel and tiny scale up their game a little bit by running some of these analysis in R so let's see how that runs because this is not a slide presentation here it's an endzone analysis so I would like everybody to run our studio and most of you have installed it because it was in the instructions for the website and I want this to be interactive ok so we're going to be also circling with microphone we can we can help you so we're going to take a slow pace because what's important is that you get a handle on a few of the packages that I would recommend to use this type of data and that you can even go beyond and that it is a crisper screen and any type of data you see in your lab you may be able to to move away from Excel or prism and and consider using R so for those of you who were not in the in the auditorium at 3 I recommended you download this folder this is on github repository of my lab it's called D so if you go to github on Kodiaks the repository is called workshop CRISPR R and to download it you click on this green button here so assuming you've done that or you're doing that right now the next step is to open our studio and let's see how how the resolution is here I would like to maybe do a more broader view is it is too small maybe yeah is there a way to increase the font yeah here you go all right so for those of you not familiar with our studio we've got a console here on the left hand side you've got your environment here on the upper right-hand side and you've got kind of the file system on the bottom right-hand side we're gonna make another panel appear as we go and open one of the file that was located in the folder you just downloaded so if you just go to that folder open file currently I think it's mine is in the download folder is right here and the file I would like you to open and it's called the our practice dot ARMD that's called markdown file so this is actually what's called a notebook a notebook is a combination of text and blocks of code so it will walk you through all the steps be when commenting between each of the step there is actually a way once you're finished with a notebook there is a way to generate an HTML file corresponding to that notebook you can see the corresponding HTML file is actually available in that same folder this is not the end b1 that's their new big one here and so okay I don't know oh that's sorry I click the wrong one see it's too small but you have the corresponding HTML file that has you know the blocks of code the figure and the table we will be generating so if you want to have a sneak peek or don't want to run the analysis yourself but just you know have a look at what I'm talking about you can just browse that HTML file okay let's go back to the our studio again you have this our notebook I'm gonna reduce this so we're gonna walk through through it so in the first step I recommend you install some packages that you will be probably missing from the default installation of our studio so this is this first block of code it's all commented out for now that means it's inactive and if you run the entire notebook first run and run all it's not gonna run that but I want to make sure you guys can all run this so if you uncomment that particular line technically we could run the entire notebook right now I suggest we only run that block of code just to make sure that everybody's on the same speed on the same page and all of you guys have the right package installed what are those packages I mean you can see they're separated by commas here there are packages to manipulate data frames type liar tighter their packages to reshape data frame form a compact version to a long version there's packages to plot like ggplot or P heat map and some more statistical packages okay so if you uncomment that line and you read just run install packages you can click yes I want to update these it should run all right so that that starts to be bad okay do you know I don't want to restart our so it is downloading it it's downloading and compressing and stowing it okay so at the end of this process you should have this downloaded binary package are in and it's good should go in the default folder if you're having trouble please raise your hand and anybody who knows are in my co-instructor can help okay I can help so we'll take it slow okay if you want to go out you're not interested can't but there's some interesting raffle coming up so in some science we'll see how much I can do only one by one basis player okay so download the the folder by clicking on the red button on the green button yeah the green button okay so it's just downloading the folder yeah so this this should download it into your default download location in your Mac okay that was an easy one alright so once we have installed these packages who's having trouble installing them or just running that bit of code anyway there's another thing we should do which is actually going to do lower right panel we're going to changed we're gonna set our working environment so the working directory should be the one in which we want to the one we just downloaded so if you can look here on the side I'm not in the right location here this is just our root location of my computer so in order to move to that just a directory that we just downloaded you can just move to it for your through this file system and the folders are usually at the bottom and here is my folder okay this is the folder I want to be in okay I have all my HTML at the the file we just open and especially important we have those two folders in which the input data is okay this input data so okay let's finish about this setting the folder so once you move in this file system to that folder I suggest you set as working directory so you go to Moore and set as working directory so now everything we're doing will happen this working directory okay so again let's browse a little bit the files are put in these two folders these are files that are really directly downloaded from the output of pineapple pie okay once you've gone time Popeye you download this big archive you're gonna see these files the private mine a pop I will put these files in separate folder for each our library or for each experiment so the little trick I did here is that I copied them in the same folder so that's it's easier for our to to kind of find them and stitch them together so there's a folder for the count which is the raw counts of every single guitar ne and then there's a folder for the result of these enrichment analysis which has in addition to normalize count the p-value etc so alright so we have now installed the packages that we wanted now we're going to load them so that's the second block of code second block of code we load these packages that should be trivial and fast so now they're all available in our working environment okay I move down again I can comment on what I just said I just wrote the experiment that I wanted to share with you is one of one of my own if you look in these folders again you may see the different type of experiments we have different results here in fact the experiments are I have we have four replicates per condition and we have three conditions one of the condition is called the baseline that's the the baseline that JP talked about the one before just right after the pyramids insurrectionary so that's really before any section that has happened another and it's the what we call the t0 here in this nomenclature and the two other condition is at t3 so a further time point in the future several weeks one is T which is treated we treated these cells with chemotherapy and one is you untreated okay so again three conditions and four replicate the condition does replicate are indicated by the the the name of these files its infection one replicate one infection one replicate two infection to replicate one infection to replicate one so this is a lot of the biology background of my experiment the next a couple blocks of code are going to import the count files so this is a small function you can try to decipher it but otherwise you just recycled one our rows with first teach are that these should be a function so that shouldn't do anything except create a function in this panel here and the actual import happen in the next block of code which I run now so this block of code gives the list of file to R and runs this function for each file in the list of file the result of this is actually a color list it's a list of data frames so every single big table is a data frame and this data frame are put together in a list a list is always a bit difficult to handle in our IDE like even larger data frame if I can build them so that's what the next block of code is doing this block of code actually removes this list and combined all the data frame into one giant data frame just add a new variable now which is the library name which was part of the data frame in the first place okay so here we are we are now stitching back its stitching together all these data frame so how do I view my samples so you can go on the upper left out upper right corner you can see they have a bunch of let me let me actually redo that and make some clean up a little bit this working environment it's not really clean right now sorry I have some old project there so let me rerun the few some the few steps I did please raise your hand okay if you're a bit lost again I want this part to be a general teaching about our more than just about the Krispy screen yeah dear the latest such thing so if you cannot open the RMD maybe you can open it in a text editor and then paste it in a new file can you do that so if you have the trouble opening the RMD file sometimes the fact that I uploaded to the github repository will actually change a bit the nature of this RMD file and our studio will not recognize it as a notebook so the way to trick it is to go in a text editor paste it and copy it back into the the our studio environment yes oh you don't have the it's called our GUI so we got the wrong program okay thank you Thanks all right so yeah we propose our studio there may be some other IDE an ID is an integrated development environment that could work for you I'm just very used to our studio and I think it's one of the better one out there so let's continue if you don't mind okay so how do you view now the files that have been generated so you can see here on this upper right corner the list of the list of file the list of objects the list of objects that are in your environment so you remember we have a function object that we did we have this list of files and now we have this new giant data frame that's called the count single guide okay so that's the result of the aggregation of all accounts for all the 12 experiments that I have so we can actually click on it and the nice thing about the our studio is that you can open that as a table so again it's really I did not change much the output of of a pineapple pie there's one color and this is single guitar in a name the second color on is the gene name the next color needs to count the row count not normalized and then the last name is this library name well it's not exactly library name it's the file name and all the path to the file so we'll make it a little prettier and that's what's next block of code is doing it's just X use the function separate to kind of read through that string and remove the slash and pick the actual string that corresponds to the library name so you can see the structure of these separate I specify the data frame I specify which color I want to separate and I specify the separator which is either a dot or a slash and then I specify what's called new colon I want a name these are the result of this separation and so when you do that it's gonna run through this entire giant file and now I'm a much prettier name at least under the library one I'm having a much prettier name so I can get rid of these other ones that are empty or that have one really irrelevant information in order to get rid of Collins I use this deep liar function that's called select with them and I select specifically the colon I want so again the first member of this function is the name of the data frame I select Collins from the countess G telephone and which colon well select the single gutter in a name the gene name the value of the count and another library name so and we copy that back to the actual data frame that we want so now we have a much cleaner data frame where we have all the information we need as you notice it's a pretty big data frame you can't do that in Excel or your computer we'd have a hard time okay it's a it's a table of four columns with 1.5 million rows and for for our it's a piece of cake okay whose needle about time just slow down who's up to speed okay we didn't find all right who do another you know 10 15 minutes that and and then we're gonna actually have some scientific talks another property okay what is it way behind me okay how much time do you need that's okay I'm sure some other people I'm happy happy that I'm stopping a little bit any other issue so some people forgot to uncomment the installation step so that means they didn't have any of the package installed so again you probably did not install the function are you it works it's probably yes if you if your error is cannot find a function separate because that's kind of the first function that we call from one of the package that we installed that means that the installation stepped in worms who didn't go through the import count the block with the import count doesn't do anything it just creates function so then you need to run the next block yeah so where you use the function okay so you don't have a problem tighter you specify you comfortable doing that yeah that'll be great yeah so some people are having she wasn't who's having a problem with a separate function I think there's one problem thank you I've got a nice volunteer so if this separate function doesn't work don't worry too much about it okay that means your library name will be a little funky with a lot of a lot of other character in it okay but don't worry too much about it so what else we do we do I mean we're still in the preparation step here but we have now finished that step okay so let's start computing a few of the statistics and normalization so as I mentioned this is a the unnormalized count so if you read the manual of the deep liar package the defier package has a lot of function to filter columns select column group by factors and the beauty of this des prior package is that a function like as a pipe meaning you can take the output of one function and pipe it into the next function and pipe it into the next function so that's what the syntax of this next block looks like you first take and I'm looking at the block that's here you first take your data frame okay and you first tell the plier and R that you want to first make a grouping by library because we're going to compute things that are library specific or sample specific so we specify the variable by which we want a group and little piping character is percent greater than percent okay that's that's a just as metacharacter don't mess it up you have a space percent greater than percent space and that's your pipe in the deployer syntax so we pipe accounts SG our data frame into this grouping so we just tell to group by library the next variable and the next function is summarized so that means we're going to create now some summary variable okay and I have a question here so I'll be right back you can run the block and see the results and figure out if you understand the code you're ahead of me so you're ahead of me but yeah it looks good looks good so the summary summarize function will allow you to create now new variable that does that out the room in which you put the result of a mathematical operation okay so here I'm introducing actually more than one new variable at first I'm introducing so total will take just the total number of reads that are in each library that means I'm gonna sum up this count : across the entire library okay so that's the sum of the variable value okay I'm gonna create another variable this is the total number of single got already and if you learn your lesson about the Gecko version 2 you see you should see 123,000 but the way to summarize that number of single gallery so the number of unique similarly is just going to be the length of that data frame pretty much so that's the length of you could have picked another of the variable but I picked the variable value could have said lengths of single gallery of some some other one the next one is a little more interesting which is the number of single gather any that are covered in our library so what does I mean covered that means they receive our number of reads that are greater than 0 and in order to calculate that we calculate the length of a vector and this vector is a vector that's determined by all the values that have a value greater than 0 and that's the syntax to calculate that and finally in order to be even more relevant we calculate the fraction of single god RNA that are covered so for this we just take the same syntax as before the length of the vector of value that are greater than 0 and we divided by the length of the value okay so if you run that piece of code so you we put the result of this in a in a variable let's call in a knob Jack that's called total count and then we just call total come to display so this notebook allows you to display a few of the the first few rows but you can release color through it we have 12 rows because we have 12 samples in our library okay so everybody understand there so as you can tell already every library was sequence that is different depth okay we have elaborated our sequence at 7.2 million and that's the minimal at least in this panel and we have another library that was sequence at one at 18 million okay so that's where you realize you need to normalize for the total number of reads that you generated because Kristen is really good but she's never going to be able to sequence the exact same number of reads from all your libraries okay or you're really bad at pooling your library because sometimes she doesn't pull you pool and so in any case we need to correct for that so the way to do that again we're going to use this magic piping from the player and that's the next block of code where this time we take the data frame we group by library again grouping by by our samples and then we're going to create a new variable this time using the function mutate and mutate we're going to create a new variable that's called rpm I think Phillip calls it CPM count per million or read per million which is pretty much the value of the count for each single gerund divided by the sum total of all the values by the total coverage okay and we multiply by 1 million in order to get the p.m. the per million so that's normalization so again pineapple pipe does normalization so we could have started directly from the normalized file but this is a way to teach you how to actually normalize this type of data okay so we the next block is now another kind of summary block so the difference between summarize and mutate mutate you create a new variable for each of your input data frame summarize you're going to create a variable but that's for an aggregation type of function you want to aggregate a number math mathematical formula our according to another variable the grouping variable so that was the difference between summarize and again this great tutorial but deep liar great tutorial by the players you don't have to listen to me too much and so we just verify that now the total number of reads when you sum up all the RPM is 1 million so we normalized all these libraries so they are now comparable to each other ok so let's do a little bit of plotting now that we have normalized our accounts so we'll take it slow we are going to focus first on one of the library ok we're going to plot the distribution of normalized count in one of the libraries so the way to do that we're going to create a new data frame a smaller data frame that's the C one data frame that's just the result of filtering these giant data frame for the library that has the name I 1 R 1 T 0 underscore 1 so that's the first replicate of the baseline of the of the baseline time point of the baseline condition so we have created now this C 1 data frame we can look at it if you want you just click on it it's it's now now we have a new current this RPM : and we have all these similar RNA and let's plot this C 1 data frames so in order to plot I encourage you to learn the package ggplot2 it's really flexible plotting a lot of the nice plot you see in a lot of the papers have been generated with G plot - it has this characteristic gray background so every time you see a figure with this gray background 3d plot - you can modify it you can remove the gray background if you want but the default is a gray background and you plot function in these two side of this function you first specify with the GG plot function what is your data frame c1 and what are the variable that you use for aesthetic is aesthetic so one of the variable we want to use is the RPM we want to plot the RPM the distribution of RPM y actually I change it a little bit here the RPM is pretty unevenly distributed you've got a lot of low read per million a lot of low coverage and a few very few high coverage so we protect read the log 10 of the RPM but in order to not miss on the one that have a zero coverage I add a small number 0.01 and just make sure that that number is lower than the minimal number of rpm so this way anything that in a log 10 is minus 2 will actually be my zeros in the distribution so that was the first term of that ggplot function but we haven't told you plot what type of plotting we want to do so after that you can specify many type of products provided by a chart you can specify a box plot the one I like to plot for distribution is called the cumulative distribution function and so this is the function stat a CDF so once you plot that it's actually very similar to this Lorenz curve that pineapple pie generates so low the axis are are kind of flipped but it's very similar to this Lorenz curve the way you interpret this particular plot the y axis is in fact the fraction of single girder and in your library so you got 123 thousand Sigma got RNA so you pretty much have 50 percent of your single gut RNA that have a coverage of 10 or less ok so locked N equals 1 or less so this is the way to look at this distribution now we can with this we can modify a little bit the ggplot function we can add a few elements to make it pretty we can add a label for x axis or label for y axis we can change the font size and there's lots of option again that you can learn online by learning deep liar and so once you do that you can now have bigger font you can have a nice label and you can almost publish that you can always export that plot if you want by just right-click you can export it and paste it in your PowerPoint or I guess you can also download it or something like that if you right-click it you can always download it okay so alright so we did it for one library now wouldn't it be cool to use that plot to compare the distribution between all our twelve libraries so ggplot is actually extremely powerful to do that it's a bit like prism you don't have to actually plot twelve individual plots the way you specify that is to add another aesthetic variable so again very similar function and one before but this time our data frame is the countess G the entire data frame and the aesthetic in the aesthetic parameters we added the color equal library color equal library remains is going to take a library as one of those parameters and it's going to plot a different color for each library but it's going to do the exact same plot for each of them and then the rest so you specify your grouping variable again with the color for the title and then you can then you obtain this community distribution for all 12 libraries so you can compare already if you're good at distinguishing all those rainbow colors you can compare already which one the baseline which are really nice and and almost like a step function from the treated one like the green one that already show quite some bias a lot of single Garrone are lost and if you are enriched okay it's not the most convenient way to look at this because twelve different colors in a plot is not great so ggplot can also plot one type panel so that would be quite useful for example we could have one panel where we are all for untreated replicate one panel where we are all for baseline so in order to do that we need to add another color in our data frame that indicate what type of library it was is it baseline is it treated or is it untreated right now the only column that does kind of that is the rename but that's not sufficient so in order to do that this is what the next block of code does we introduce now another variable using the mutate function this variable is called type and type in order to determine the type I'm just gonna interpret the string of the name okay this that's what this if else grep l does grab L test whether the string T 0 is present in my library name if it is present I call that baseline if it is not present I call that T 3 the next step is going to look whether the T 3 underscore T present my library name if it is it's going to call it treated T 3 otherwise it's going to keep the type so it's going to keep baseline for baseline and T 3 for untreated and then the last one is about the untreated if I see the string T 3 underscore you it's going to be untreated and then otherwise I'm keeping type okay so that's one way to add a new variable now that's derived from the library name and so now if I go back to my account we can see that this now type is baseline and if I scroll down I can see that though the other libraries will have treated or untreated so now that's great because now that gives me another plotting variable these type of libraries who now can tell GG plot I can separate those panel and I can have one panel for baseline and one panel for treated in another panel for untreated I'm going to skip that block I think that block is only just that was a kind of sanity check verifying that I assigned the right type to the right library by just selecting those two columns library and type and determining if they fit together so I can verify that treat it is really treated etc so that the next plotting function is now adding very much the same thing but the last term at the bottom line is this called face a trap so it's going to face at this in multiple panel and the variable according to which you face set is called type and you need this little tilde before because you can specify Collins versus row if you want so you could have one variable that distinguishes in Colin and another variable in Rose right now we only have one variable so we just use the right-hand term of that tilde equation and we can specify how many Collins we want and so we can just rerun that it's a lot of data but now the exact same plot we had all on one single panel are displayed in three different panel base line treated untreated makes it a little more easy to run okay who's totally lost okay take a two-minute break Matt I think we covered quite a bit already so it's just uh yeah if you didn't have the latest our studio version or the latest packets you may run into little issues but hopefully most of you are up to speed okay so two more minutes and then we're going to have awry and present some of these results so the next new type of data I would like to compute in addition to the total number of reads and the normalized number of reads is the Gini index so again pineapple pie gives you that Gini index but if what if you wanted to calculate it yourself well you can use again this nice deep layer piping strategy where you start with your data frame the chemistry you group by library and by type this time and and you can you can summarize you don't need to group by type actually you could have used only library but then you use that function from the Inuk package that's called the enoch function and you specified it's a genie you want to use the genie method the genie method for that function if you don't know how a function works or what is this syntax there's always a help menu okay so if I want to know how the Inuk function works you can go in the lower right panel and type this in AK and then you have a direct view of the manual so in AK takes this and the different method there are genie RS where entropy etc and it really describes how to use it so here is I use it I do the genie on the RPM and here are my genie indexes so we got a table this is nice we can put that in Excel and do a plot in Excel or we can use ggplot2 product and in order to see that there is actually a a significant selection happening so if I now take the result of the genie computation because I put the result into a new object that's called a genie that this is the genie data frame and it should have appeared here on the upper right side I have all my genie things how do i plug that into a ggplot I can specify ggplot again I want to apply the genie data frame without the aesthetic I want my x-axis to be the different libraries so it's library I wonder why axis this time we need to specify y-axis it's called genie I want to have each type of library to be a different color I would like to distinguish for visual esthetic and the type of protein I want to do is Aegean bar but duty blood is so powerful that you can even run some aggregation statistics if you don't specify but right now we don't need anything the number we have on our table this point 98 genie is the number we have on our power okay so in order to let ggplot do that you need to specify to use just the identical value in your data frame for the statistics so it is the stat equal identity that's a very specific way to run a bar chart and the other fields are familiar to you now and these are my Genie Plateau okay by Jeanne value so there's a few things that look ugly on there first all the x-axis is all intermingled together plus my libraries are not sorted by type which is a little annoying so you can actually modify that in order to modify and make sure your libraries are plotted in the order you want you need this little piece of code which is you change to specify the level of these library strings you want to tell are no no this name should go first and this name should go next so you really do one by one say ok this is the level and please modify my library parameter so that I can now plot them in the right order you can also rotate the label on the sorry I didn't run that so I'm gonna render now so I change the level and you can now rotate the x-axis and so by 45 degrees and then you have a much nicer plot and you can really see that these genie index increases as you select the sample so there's a little bit of increase between the baseline and the untreated but you really see a strong increase in the treated samples that's what you wanted and I think I'm going to stop here but I hope that in this past 30 minute I give you enough background about the a particular notebook and the way I wrote it and I hope there's enough comments in there so that you can walk down your way because there's a lot more plotting and statistics that I present how to do a principal component analysis in there how to perform a heat map and clustering but hopefully you will be able to take and use it for your screen or use it for your next microarray project so hopefully this was useful to you guys
Info
Channel: CCMI Admin
Views: 306
Rating: 5 out of 5
Keywords: R Studio, RStudio, CRISPR, Crispr screening, CCMI, UC SAN DIEGO, the cancer cell map initiative
Id: cZUT8I9RzYs
Channel Id: undefined
Length: 44min 5sec (2645 seconds)
Published: Tue Dec 19 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.