Analyzing Genomics Data in R with Bioconductor

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is Stephanie Hicks I am a faculty member at Johns Hopkins actually a little bit about me I've loved that everybody started with the about me sighs it's very helpful so I am effects member at Johns Hopkins in the bio stats department similar to Roger pang I have a couple interest one is data science education so if you have any interest in talking about that I would love to talk to you about that that's not what I'm going to talk about today but I have a lot of opinions about that what I am going to talk about today is specifically a little bit about what I do for my research so my research is in the analysis of genomic data who knows what genomic data is okay that's a great number of you awesome so if you do work with genomic data you may or may not have heard of the software project called bioconductor I'm going to explain a little bit about what that is it's kind of like the cousin or sister project to crayons for specifically for the analysis of genomic data and a few other fun facts about myself I recently founded the our ladies Baltimore chapter as of May 2018 so we are on our third event next week but I loved getting to meet so many our ladies here in the local DC area and then I'm also creating a children's book for women statisticians and data scientists so as you know I have as Jared mentioned they have young kids and I went on a mission to try and find children's books featuring statisticians data science and it's weren't really any there were some specifically for stem but I'm a mission to make that happen first a decision to do so I'm going to give you a little bit of a taste of that at the end of the talk all right another thing about myself is Jared mentioned I have two young kids Theo is my youngest happy birthday Theo he was born a year yesterday and alex is my oldest so he's two-and-a-half and if you look at this picture long enough you'll quickly realize two things one this was taken very recently it was the day we voted earlier this week so he's got his future voter sticker on and then too you'll see that he's got minions on his shirt or as he likes to call them aliens and so when I was preparing my talk at the lab sure is all of us did I decided to go with the minion theme so crayon so as most of us are here for an hour south conference we all know what cran is it's a great software project that's got over 10,000 our packages to the state and it does basic things such as data wrangling or creating plot but then it also can do more advanced things such as web scraping or analysis of financial time series data or analysis of clinical trial data and so forth we all love cram so if we look at the subset of people that are in this room for example half of us are on our phone either tweeting or looking at cool links or learning about new packages you might be Elsa be on Twitter and so you might be tweeting about these hashtags our sets DC and our sets and our ladies and you can go on your phone or computer to the our project org website and if you go there there's something called there's a link called other projects is this work yeah there's a link called other projects and if you go there you'll see a few things there's this thing called community services area which is wonderful and then right below it is something called special areas of application so under the first item under the special areas of application is something called bio conductor : bioinformatics with our so it has a list of broad goals the broad goals are to provide access to a wide range of powerful and statistical graphical methods for the analysis of genomic data facilitating integration of biological data allowing the rapid development that's scalable and interoperable software promoting high quality reproducer research and creating training in computational and statistical methods for the analysis of genomic data so that gives you a broad overview of what is bioconductor so for those of you that have not heard of bioconductor formal introductions Coran meet your cool cousin bioconductor okay more seriously bye director is an open source and open development software project meaning open source anybody can go look at the code anybody can go use the code open development meaning anybody can contribute code at the end in 2001 under Robert gentleman from Genentech now at 23andme I believe and now it's today I think it's got over twelve full-time employees that are funded by the bioconductor project that work on developing the core infrastructure in the maintenance of the packages and bioconductor has some big priorities I would argue that those are reproducible research and high-quality documentation so for example every software package and bioconductor comes with a vignette and that is really helpful when you don't necessarily want to scroll through just the reference manual the vignette is meant to walk you through an example analysis or to demonstrate actually how to use the package which is not the same thing of just looking up a function inside the reference manual it has a pretty diverse community support group so we have an online forum that you can go ask questions and you can get experts to respond to your questions we also have a slack team that you can join and you can ask those questions in the slack environment workflows so bioconductor is really nice because they've developed these users users have developed and contributed workflows specific for different types of genomic analysis so if you've ever heard of the analysis of gene expression data or DNA methylation data or mutations in your genome there are workflows if you are new to the field of genomics and you want to figure out how do I analyze my fancy genomic data you can just go download a workflow and follow the steps that they have demonstrated for you to analyze your own data and then teaching resources so by a conductor has spent a lot of time and effort developing teaching resources that you can use in the classroom to teacher students so when you think about bioconductor image that by your connector aims for is that when you when you are analyzing a lot of genomic data there are a lot of great and unique packages out there and tools that you can use to analyze the data with a very unique similar to a musician a musician is very you and bioconductor serves as the conductor of the set of musicians and an orchestra to keep things on pace and makes their shirt everything works together in a flexible way okay so in the next part of my talk I'm going to give you a demonstration of a little bit about by a conductor like what's in bio conductor and then a package that I think you'll find really slow so first is a package called bio C package tools and this contains functions too that assess metadata around bioconductor packages and it's in a tidy data format so if you load in the library packages or library bio C package tools there's one function bio C download stats and in one line of code you can have a table that's loaded into R with a row containing information with the name of the package the number of distinct downloads IPS for a given month and year and there's this last column called repo so this is the type of package it is so you could ask well how many types of packages are there so for example because this is a table then the world of Tidy verse is open up to you so you can apply anything you want in the tidy verse here so for example you can use the filter function to select the rows from 2018 select just the package and repo column ask for the number of distinct rows group by repo whatever that is and then summarize by the total number of packages and you'll quickly see that bioconductor has three main types of packages one software similar to the software that's available in crayon and then two we have these things called annotation packages and experiment experiment data packages if you've ever done any type of analysis and genomic data you'll realize there's a lot of bookkeeping that has to be done so for example if you want to look at genes that are expressed in the human genome you have to ask what hewmet reference human genome it is for example every six months to a year the reference genome changes so like the position where you would expect a gene to be in one reference genome may be slightly different than the shouldn't you expected indifference Renji a different reference genome so there's a lot of bookkeeping and the annotation packages basically streamlines that process and makes your life much easier the experimental data packages they contain essentially process data that is great for teaching because the data is just kind of ready there for you to go yeah you could also ask so crayon as I mentioned has over 10,000 packages currently available bioconductor has around 1700 but it's been steadily increasing so inside of the bias e package tools function it only loads in data from I think 2009 I'm not sure why there's no data before that I know there's data somewhere but it's not in that function but starting in 2009 it looks like they're around 400 or 500 packages and we've been steadily increasing through 2018 and this is just software packages so we have a really big diverse community of packages okay so what is the standard object when I talk about genomic data how does our think about that so there's this thing called genomic ranges and I would argue that that's one of the the most standard objects that bioconductor uses genomics ranges is essentially a data frame a constrained data frame so you can appear there's a chromosome so this is not a piece of data this is like and you can imagine a chromosome in our body chromosome one and then inside of this region right here this is a genomic location in that chromosome and so when you translate that to our or when you translate that to something on the computer we want to record for example in that one specific genomic region we want to record a few things one we want to know what's the chromosome number so in this case it's chromosome one - what's the start of that genomic location so how many bases in starting from one two three four or five all the way down does my genomic region start and does my genomic region end in the world of genomics there's positive strands and negative strands I'm not going to go into that but it's again bookkeeping essentially and then there's this thing there's a dotted line here and everything to the dotted line is considered metadata that can be whatever you want so on the left side of the table that is a constrained data frame a G ranges object expects those four columns every single time you create a G ranges object on the right side you can put whatever you want there so if you think about this in the world of the tiny verse this is actually already a tiny data frame because each row is one observation namely a specific genomic region and then every column is a specific variable so I'm gonna explain why that's useful in a minute so how do you create a genomic ranges object it's really easy there's a package called genomic ranges and there's a function called G R ranges so you can create a genomic ranges object by providing those four column names that it was looking for so seek names contains information about the chromosome location strand contains information about the positive or negative chromosomes direction of the chromosome and then there's this thing called ranges ranges are essentially a simplified version of G ranges it doesn't care and for it doesn't have information about the chromosome name it just contains information about the start and the end of the genomic region so thinking back to this figure that I was showing you before it was starting at this position for these two genomic regions and ending at this base position and then we had these two pieces of metadata gene ID or the gene name you can think about it like that and score score it a super generic but you can just think about it as something that you've measured about that genomic region so in a really quick way we've now loaded up a standard infrastructure I'd argue the standard infrastructure that bioconductor uses to build almost all of its packages when you're talking about genomic data ok so what can you do with that I mean that's great but what can you do with that so for example it's common to ask what's the width of that genomic region how long or why does that genomic region so for example you are literally subtracting the end value from the beginning value and you can get the width of that for those two genomic regions you can also do things to select or filter for a set of rows this is base our notation so gr which is the new genomic ranges öppet object open bracket this is a classic base our way to filter a set of two filter set of rows looking for all the genomic ranges that have a score greater than 15 and so similar to the reasons that the tidy verse and deep play are was invented that's not super human readable and so there has been recent efforts to try and make the analysis of genomic data more human readable specifically with what I've discovered recently called the apply ranges package so if you're interested in getting into the analysis of genomic data I would suggest checking it out because I've incorporated into my workflow and it's been wonderful and the whole goal was to just make the genomic data analysis more human readable so this is not my work this is the brainchild student Stuart Lee from Monash University die cook and Michael Lawrence he's one of the key works at Genentech but he's one of the main core by a core bioconductor developers it's the idea for this is to define an API meaning literally extends the DPI our package that map's relational genomic algebra to verbs similar to the way deep layer does but act on this tidy genomic data another great idea to just straight-up borrowed apply our syntax and design principles and another great idea compose verbs together with a pipe operator from a critter that would make this process of analysis of genomic data much more human readable so for example you can load in the ply ranges object and instead of doing the base our way to selecting a set of rows you can take the G ranges object now pipe it into the filter function so for those of you familiar with the DPR package you'll recognize the filter function so you were filtering for the genomic regions that have a score of greater than 15 so we had one genomic region that met that criteria you can pipe that genomic ranges object that's now filtered and to the function called width and asked for the width of that genomic region and so forth so we're really taking the analysis of genomic data or the the packages that have been developed on the G ranges object and making them much more human readable there is a whole set of verbs that have been developed for the pie ranges package for example some of these you'll recognize the ones in bold where literally the origin of them came from the D PI R package so you'll recognize summarize mutate select range group by and filter and then the ones that are not in bold are very similar they're just some unique things to genomic data that make the analysis more fun as I would like to say and so they are doing very similar things like for example there's a whole set of joins and unions there is a set of filter by overlaps filter by non overlaps and so forth so for example if I were interested in joining two genomic ranges objects for example if I in the figure a I have a genomic ranges object X and a genomic ranges object Y depicted by the pink bars so if I wanted to join those two genomic ranges objects there are a couple ways I could think about doing that I could join them by taking the overlap of the intersection or of the inner part of the genomic ranges object or the genomic regions or the intersect or the left they're just a lot of different things that are unique to genomic data and then because everything is in a tidy framework again the world of the tidy verse is open to you so for example you can make fantastic juji plots and so forth and that's it so this is my hat tip to Gaby so she's the founder of the global organization called our ladies and I mentioned that I'm creating a children's book so I'm working with a sketch artist who works at Hopkins and she's been doing a fantastic job so if anybody's interested in getting involved in this project I would love your help I can tell you all about it and thank you for your time you [Applause]
Info
Channel: Lander Analytics
Views: 9,916
Rating: 4.8620691 out of 5
Keywords:
Id: l1MQ7x8cn7Y
Channel Id: undefined
Length: 17min 47sec (1067 seconds)
Published: Tue Mar 26 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.