Demystifying Conda (Anaconda, Miniconda and Bioconda) and Virtual Environments

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi guys welcome back to my channel thank you so much for tuning in for another exciting video this week so today we are diving into the world of package management cond anaconda miniconda bioconda and virtual environments with a focus on our environment so in this video we are going to demystify these essential tools for bioinformaticians and data science professionals and I know a lot of of you have previously reached out to me regarding your struggles with installing packages maintaining dependencies or maintaining consistent developmental environments or ensuring your pipelines are reproducible so whether you are an expert by informatician or just getting started the knowledge of these tools can be a game changer in your um analysis journey and especially when you are working or building some pipelines we often work with various projects each with its own set of requirements tools and dependencies and it's like having different plant species that need specific conditions to thrive so I like to draw the analogy of virtual environments with terraniums for those who are not familiar with terraniums these are essentially self- sustaining ecosystems at its most basic forms the teranium is defined as a seal transparent globe or similar vessel in which plants are grown so similarly in the coding World a virtual environment is like a controlled um isolated space for specific project it's uh like if we could um create separate little rums for each project with their own um requirements own conditions to nurture the software we are working on so changes in one terranium won't impact the others similarly for virtual environments um changes in one environment does not affect another so for example let's say if you're working on a project and you need a different version of python let's say you need python 3.6 with specific libraries uh while another project requires python 3.8 with and with their different set of depend dependencies and requirements so with virtual environments you can create these isolated spaces to keep your projects healthy you can have different versions of these tools and softwares in each of the isolated environments and prevent them from interfering with each other let's consider two real world scenarios in context of bionformatics analysis so in the first scenario imagine you want to try three different analysis workflow one for jwa study and you want to try two different rant calling um analysis uh workflows unfortunately these work work flows require similar software but different versions in the second scenario um you perform an analysis which requires to install certain R packages to run a workflow it runs fine on your computer and you share it with a member in your lab to reproduce the analysis and unfortunately it fails on the other person's system as the pipeline requires a specific version of R and the r packages now the um alternative solutions to these are for scenario one you could either run each of the workflow on different machines and for scenario two you could ask the other person to install reinstall and change the package versions to match the package package versions that are present on your um system of course both of these uh options are not feasible and are not the Optimal Solutions and it would be better to have a system where you could create um separate isolated environments on one computer where um IND individually um the requirements and dependencies and packages could be in installed of specific versions in each of these isolated environments and these environments do not interfere with each other so that would be the optimal solution so there are various uh software installation and virtual environment managers one can use depending on whether someone's um someone's using the python environment or the r environment uh but another option here is to use cond cond is essentially a package and environment management system so basically cond is operating system agnostic meaning that it works on Windows Mac and Linux it enables more simplified software installation along with dependencies and it allows you to install the binary that is previously compiled versions of the program it also helps you to search for compatible software versions and it has access to uh many software packages including packages for Python and R um it the one of the advantages of using K is that it gives you the admin permissions which is good uh for use on high performance computers uh otherwise you would need to request software installations or specific versions by the HBC admin today I will be demonstrating how to create virtual environments using cond um we will be creating two environments in which in each of those we will um install the same tools but different versions of the same tool and you will see how convenient that would be uh to have the different versions of of the same tool without them interfering with each other which could essentially u mean that you could have different versions of the certain tools that you require for your pipelines without them interfering with your with each other or with the tools that are already installed on your system so I'm sure at some point you might have heard about uh these terms anaconda and miniconda and you must be wondering what are these different um condas and how different they are from the from each other so basically anaconda and miniconda are distrib tions or variants of cond and Anaconda is a fullblown um P python distribution um including a lot of open- source packages and miniconda comes in two versions miniconda 2 is a Python 2.7 distribution and miniconda 3 is a python python 3 distribution so mini condas as the name suggests are smaller um and take up much less drive space compared to the cond Anaconda because they do not include all the packages that are found in Anaconda and both of these python and distributions include K which is a package and virtual environment manager and K installs programs from these repositories called channels and bioconda is a channel that is devoted specifically to bionformatics programs a cond channel uh refers to a repository or a location from which cond can fetch the packages cond doesn't just have a single repository where all the uploaded packages live when a package is uploaded to cond it must be uploaded to a specific Channel which is just a separate URL where package is published to that channel reside so there is a default cond Channel where the stock cond packages live and these packages are maintained by cond and a generally very stable so uh a lot of this will make more sense when I demonstrate and show you what these channels look like and there are certain priorities that you can set to these channels and whenever you want to install a certain packages K will first look at a specific channel that you've given a high priority to and we'll talk about this in more detail when I demonstrate this to you you might have also come across Mamba uh Mamba is nothing but it's exactly what K does but it's a reimplementation and a better version of K is much more faster more efficient and do not carry certain bugs that are present in K so in addition to K I also want to talk about our environments um so R EnV is a package that helps you to create reproducible environments in R um I usually use K to install B informatics tools that I run on command line like bwa gatk tools Pickard bet tools Sam tools um I also create environments when I am trying to build a pipeline like a weant calling Pipeline and I will try to install all the required tools and dependencies in that environment which prevents it from being interfered or interfering it with other tools and packages that are already installed on my system um and I'm talking about R environments because I'm majorly code in R but there are equivalents in python as well to create virtual environments in Python but I will be talking mainly about environment today and demonstrating how to create virtual environments andr so reproducible environments andr projects can be created using RV package and rnv package U makes our projects more isolated portable and reproducible it provides isolation by allowing us to install or update one package in one project without breaking another project and vice versa that's because R EnV gives each project its own uh private Library it provides portability to our projects which allows us to easily transport one our projects from one computer to another even across different platforms uh R EnV makes it easy to install the packages um that the project depends on and lastly it provides reproducibility so R EnV records the exact package versions in a specific file and ensures that those versions exact versions are the ones that uh get installed every time that specific file is used to recreate the environment it is important to emphasize that R EnV is not a a One-Stop solution for all problems surrounding reproducibility rather it is a tool that can help to make projects more reproducible by helping one part of the overall problem that is our packages there are number of other pieces that RV does not uh currently provide much help with um like rnv tracks but does not help with the version of the r that is used with the package rnv can't easily help with this because it is run inside of R uh but you might find other tools that allows you to have multiple versions of RN switch between them on one computer so for today's demonstration I will be creating two cond environments with different versions of the same tools to uh prove the point that we can have um the different versions of the same tools uh in different environments without them interact interacting with each other or interfering with each other or interfering with other uh programs or tools that are installed uh on the system I will also be demonstrating how to export these environmental F environments to a y file uh which can be shared with others so that they can recreate the same environment and lastly um I will also be demonstrating how to create an environment in R using R EnV package and I will briefly be introducing the concept of projects in R uh where we can we will be using our environment to track the package versions and these are the requirements for today's demonstration so this is the official documentation provided by K which provides the installation instructions to obtain k uh depending on your requirement you can install either distribution the miniconda or The Anaconda but they recommend the fastest way to obtain K is to install miniconda which only includes the cond and his dependencies but if you prefer to have all the packages that are part of anaconda then you can also install Anaconda as you can see that miniconda takes up less dis space and Anaconda takes up a lot more dis space compared to miniconda so depending on your um available dis space and your requirement you can choose either either of the distribution to install uh cond can be installed on Windows Mac or Linux so depending on um your system you can open up the instructions for installation for either of this system so since uh I have a Mac I'm going to open up the installation instructions for Mac I've already installed K so I'm not going to install it again as I have some environments which are set up and I do not want to mess up with those uh but I'm just going to walk you through this process so the first step requires you to install an installer the installer is nothing but it's just a bash script so since I am walking you through uh the process of installing minond I'm going to click on the miniconda installer for Mac OS so basically it links you to another document which allows you to download the bash script depending on your system buildup once you have the installer downloaded you can follow the instructions that they have provided here which are pretty straightforward and once you follow these instructions K will be installed on your system once K is installed you can open up your terminal and you can check for cond by typing cond and when you do that um this help um page should open up basically this command line should display the help and you can also or you can also test it out by typing k space hyphen capital B which gives you the cond version so this indicates that youve successfully install cond among the first things that you will be required to do is to um pre configure some channels on your cond so these are basically the locations from where you will be installing your uh packages you will have to configure and set up some priorities for your channels um so that K could um uh resolve conflicts of the packages which of the same version that exist on multiple channels uh I'm not going to go into the details of that uh but I will definitely be linking a documentation if you're interested to read more about that in the description So currently I have uh um these channels so these are the channels that I've set up for Mya and I've set up bioconda as my highest priority Channel and it set cond Forge as the lowest priority you can change that according to your preference and again I'm not going to go into the details of this priorities uh but I will be adding the documentation that explains various parameters that can be set to um to uh to prioritize the package uh installation from these channels especially when the same package exist and the same the same version of the package exists in different channels and how you would want cond to prioritize the installation having said that the next thing that I want to prioritize is to show you how to search for certain packages and look up the versions and what channels they are available in so there is a command in cond called cond search and you type out the package name so let us try searching for gatk packages um and when you do that it should give you a list of the packages with the available versions and what channels these packages are available in this is helpful when you would want to install certain packages it will give you information on what versions are available in what channel so here we can see that there are various versions of gaty available in the bioconda channel and you would expect this because gatk is a b informatics package and bioconda is a channel that is um specifically um uh that specifically holds B informatics packages so before going into the details of creating environments I want to show you the environments that I already have set up on my system so we can take a look at the list of environments by running cond EnV list command and when we do that you'll be able to see the environments that are present on my um system and um you'll also be uh able to notice that there is a base and environment and next to it is an asterisk and if you notice that Bas is also present in parenthesis at the beginning of the command line prompt and base is nothing but it's a default um environment which in which which includes python installation and some um core system libraries and dependencies for cond and it is best practice to avoid installing any additional packages in the base so we do not uh install any additional packages in the base environment we always create a separate environment to create uh to install additional packages and if you activate any other environment this base will be replaced by the name of the environment so that's how you will know which environment you are currently present in so right now we are in Bay and we do not install any packages here so for the demonstration we wish to create um two environments so before creating cond environments let us first check um whether we have any of these tools that or packages that we're wishing to install are they present locally and what versions of them are present the only reason that I want to do that is because I want to show you that um these um creating cond environments and installing the same version or a different version of the same tool should not interfere with my local installation and I should be able to have uh multiple versions of the same tool on my system so I I said that we would be requiring three uh tools or packages the Pearl um FC and piard uh to be installed on my cond environments and I just want to make sure whether I do have that um locally so the first thing that I'm going to check is whether I have fastqc installed and I'm going to use a wi command which basically um gives me the location of where if I have this tool installed where is the tool installed and this indicates that this is in a folder that's not uh miniconda so basically this is a local installation and I can type in um hyphen V to get the uh version of the fastqc installed inst locally on my system uh let us check for Pearl so Pearl is also installed on my system and when I type in Pearl version it should give me the version of the PE that is installed uh let us check for Pickard and it seems that I do not have Pickard on my system so now um let's go ahead and start creating two environments uh where we will want to install each of these tools and specific versions of these tools so before creating uh before going into um creating the environments let us also search for the available versions so as I demonstrated this previously let us use cond search command so these are the fast QC versions in uh bioconda that are available um let us also check for Po and let just check for Pick card so just want to make sure that these tools are available on bioconda and we also get the information on what versions are present so we can pick and choose which versions We would want to install in our Conta environments so now let us go ahead and create cond environment let's name the first environment as env1 um as a good practice avoid giving cond environments um ambiguous names because um down the line you might not remember what um what you have installed in a particular environment unless you go into the details of the yaml file um so it's better to give some intuitive names or programs or the pipeline that you have um that you're downloading certain programs for so that will be helpful to identify what the programs must must have been installed in a particular environment so this is just for the demonstration purpose I'm just giving a very generic name so for pick R let's pick version 3.1.0 which is the latest version for Pearl let's pick 5.34 point0 and for fast QC let's pick 0.12.1 so here you will be able to see that these are the packages that will be installed the dependencies and the packages that will be installed for um the packages of your interest so you proceed with yes and it looks like the package uh the environment has been created so in order to activate the environment you run the command cond activate emp1 and to deactivate you type in cond deactivate so let us activate the command um I'm sorry the environment so now notice how the base changes to A1 so now you know that your present env1 and let us check um where the fast QC is installed now you'll see that the location of the fast QC has changed because this is uh installed in the environment and it's stored in a different uh location than the one that's uh that was installed um on my local system and you can also check the version and you'll see that this version is different than what was on our local system uh and this was essentially the version that we chose to install so we should expect to see the same version so now let us deactivate the environment and get back to the base and now let's create the second environment so here I'm just going to change a few values so just like what we did for environment 1 we are going to go ahead and allow all these packages and dependencies to be installed and it looks like the second environment is also created again we can go ahead and activate the second environment notice how the base changes to nv2 and now let us check where fastqc is installed so again it's installed in a miniconda folder but it it's installed in a folder that's specific for um the env2 that is the second environment and now let us check the version and this is a different version from um the version that is installed in environment 1 and again we expect to see this because we installed this version so now we have three set of tools that are installed um on both these environments with different versions and there are also certain tools or packages that are installed on my local system so basically I have different versions of the same tool installed on my system right now so now that we have created these environments um let us export this environment to a file so that this uh file could be shared with others and they can recreate the same environment so we will be exporting this environment into the yaml file so now that we are in env2 from within the environment we can export and create a yaml file so we use the command cond EnV export and we redirect or save it to an a file so I'm just going to name it environment to. yl so yaml files are essentially these files are basically these file formats that are used to set up and create configuration files so let us save um this environment and um all the details of the packages that versions the channel everything all the information about this environment into a file so now let us take a look at this file file now so when we open this file you can see the list of all the packages dependencies tools as well as the name of the environment and the channels being present in this file so this file can be shared with anyone who wants to recreate the same environment and they will have the same set of tools dependencies and the same versions that are present in your um environment so if you have been shared a yl file for an environment and you want to reproduce an environment you can run k a can we create file and provide the name of the yaml file uh and when you run this command you'll be able to reproduce the same environment so now let us switch gears to our environment and R as I said previously I use cond when I'm using B informatics packages that I run on command line but there are a lot of analysis that I do in R uh using R studio so um it will be helpful to have something like our environment uh where I can create isolated environments for my projects and I can have my analysis pipeline in those isolated environments so that it will be easy for me to share my analysis with someone who would want to recreate my analysis um using the same version of the packages that I have on uh my environment so let us open our studio so before going into that detail I want to um quickly explain um so in our usually um there is like a global repository so there are like one or two locations where uh whenever you download or install any packages it would be installed in those locations so let me quickly show you um so there is a command called lip parts so basically lip Parts would give you the locations where your packages are installed so every time you load a package it is loaded from either of these locations and this is like a global repository or Global location from where your packages are installed and loaded whenever you uh write type in the library and the package name this is these are the locations from where R would go and look up for these packages um just to quickly show you what packages are installed on each of these um locations so as you can see that each of these locations there are a lot of packages that are installed which I use for a lot of my analysis um so every time I try and um every time I type in the library command to load a package R would go and look at these locations and pull pull out the packages and load it into my R environment now the problem with this setup is that every time uh I require a specific version of a package or I need to update one package all the other packages that are dependent on it will start having issues or it might break and and let's see I had an analysis previously set up on which required a certain version of the package and if I later update it for another project the original analysis might not function anymore because it does not find the particular version it requires so there are a lot of um issues with uh having packages uh at a central location and all the other projects or analysis pipelines accessing them and each of them having separate requirements it is better to have something which our environment provides which is a essentially having these analysis into separate projects and them having a separate library or a location where these packages can be installed and each of them can be maintained and can be can have like a different version without them interfering with each other so our studio provides this great feature of creating projects our projects are great uh and it ensures reproducibility because it creates isolated working environments so basically it creates separate folders and it captures your working environment so all your analysis scripts your figures um the objects that you create it maintains a history all of that uh is captured and stored in isolated environment and that can be shared with your collaborator or anyone who wants to reproduce your analysis or um reproduce um the environment that you have so our environment can be used within our projects and today I will be creating two Environ uh two projects basically and in one of which we will be using our environment so so basically I want to show you how our environment adds in another layer of isolation in terms of um the packages so it provides you with a separate uh location where the packages can be installed and updated without it having an effect on the global packages so let us create a new R project and let us name this test one uh I'm not going to check the use R environment with this project because I want to create one project with and without our environment so this is this is the one that is going to be without without our environment so let us create project and when you do that you'll be sort of redirected um the there will be a new screen and you'll be able to see that here we have test one which is which indicates that we are in this project and within the working directory you'll be able to see that um a r project file has been created and this location has been set as our working directory so let us um write some script so first let us check uh where our packages will be loaded from so this is essentially the global location so I'm just going to paste it here so that and just comment this out so that it's easy for us to visualize so basically this is these are the locations where the libraries will be loaded from and these are our global locations for packages so let us load um the package tid was because I want to create a g GG plot and I'm going to use um Iris data set this is uh like a very famous um open source data available within R so I'm saving it to a data frame called DF and I'm just going to plot some figures um some some figures or some data from this uh data frame so let us plot CLE length with CLE width and color it by species let us do a Dot Plot so this is essentially the figure so let's say that you have a script I'm just creating a very basic script so that we can recreate this in um the second project as well with the r environment so here um we loaded the library tidy ver we used the um IIs data set and we created a plot and here I also want to do session info the reason is that because I want to see what version of Tidy verse um is being used here and what versions of ggplot and deep lier have been uh loaded here as dependencies or as part of Tidy wor package so let us save this script as Iris script 1 and let us save this now let us create another project and this one is going to be with the r environment let us create a new project and let us name this as Test 2 and let's check the box for our environment use our ENB with this project and something to notice here in this uh folder it created uh a folder called RV and it has additional uh files in addition to the r project file um and if you go back you'll be able to see that uh where we created where we chose the location to create project it created basically separate folders so this is the folder that corresponds to the project test one and this is the folder that corresponds to test 2 and now it has set this uh folder as a working directory so let us create the um script so let's do lip Parts first so now I want to know where um the packages will be loaded from and this time you'll be able to see that this is a different location from our Global path so this is not the location we got for um when we did lip Parts in our test one project so basically this indicates that any any of the packages that we install here will be located or will be stored at a different location and any updates or changes to these packages should not be affected uh or not affect any of the global packages or other packages uh that are installed on my system so now let us repeat what we did uh in project uh one that is test one so let us um load the library tidy verse and this should give me an error the reason this will give me an error is because now it is fetching from a different location and at this location we haven't installed tidy verse so let us install tidy verse using RN install function and T type in tidy worse so now it is going to give me a list of all the packages and dependencies that will be installed for tidy worse along with the versions so we are just going to proceed with yes and it did install 103 packages in uh . 29 seconds which is quite efficient and fast so now let us load the library now and repeat what we did in the first project so let's readen the iris data set and let's plot the SQL length let s withd and fill with species should be color and create a Dot Plot so now it created a similar figure what we saw before and let us do session info now so now these are the packages um that are um in that are loaded in addition to tidy verse and these are the versions of them and now let's say that you want to share um this environment basically with someone so that they can recreate the they can run the script and get the same result and can recreate the analysis so just like in cond how we created the environment and we exported the environment to a yaml file which could be shared with anyone who wanted to reproduce your analysis and recreate the environment similarly in R we have something called as dolog file so the dolog file essentially stores information on the r version and metadata about the packages that are loaded in your environment so this file could be shared with anyone who wants to reproduce your analysis and it would install the same um uh packages that are mentioned in the dolog file and the same versions of it that are installed on your system so in order to uh and here you can see that there's only uh metadata about the r environment package uh we do not have any information on the Tidy verse or any other packages that are currently loaded so in order to um record that information into and update the dolog file we have to run a command called snapshot and when you run snapshot it will ask you whether you want to um add information about all these packages and their versions in the dolog file and you proceed with yes so when you say yes the dolog file is updated and let us take a look at the dolog file again and now you'll be able to see there are a lot more metadata added about the other packages in addition to the r environment and this is essentially uh all the packages that are currently in your syst system with the exact version and this can be shared with anyone who wants to reproduce uh and recreate this uh analysis and recreate the environment as well so that brings me to the end of this video um this was a short introduction on cond and virtual environments in r i will be adding links to several resources in the description section below I will also be uploading uh and adding the commands that I ran today on my getup repository and you will find the link for the same in the description section as well if you enjoyed this video please consider giving it a thumbs up don't forget to subscribe to my channel for more bioinformatics related content if you have a question or a topic that you would like me to cover please leave that in the comment section um share this video with your friends and co-workers and please stay tuned for more bioinformatics related content I appreciate your support and I'll see you in the next video
Info
Channel: Bioinformagician
Views: 3,490
Rating: undefined out of 5
Keywords:
Id: 2bXGm0ZnJ38
Channel Id: undefined
Length: 36min 16sec (2176 seconds)
Published: Tue Oct 31 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.