DataOps 3 - Databricks Code Promotion using DevOps CI/CD

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi welcome back. Today we've got another demo  about data ops. Today's demo is going to be about   Databricks so the idea here is exactly the same  as we had with data factory. What we're going to   do is we're gonna have a Databricks workspace for  development purposes this is where you're going   to use feature branches in Git. You're going  to write some code then once you've saved it   and you commit it, do a pull request that's going  to kick off a build process in the collaboration   branch where all of our code merges together.  That collaboration branch build is then going   to produce an artifact. This artifact is going  to contain the notebook. Normally we would be   using things like libraries as well and deploying  those but just for the simplicity of showing the   workflow around Databricks today we're just going  to concentrate on notebook deployment. The same   exact methods would would work with libraries and  stuff so that may be coming in a later video but   for now you can you can absolutely work out how  to do that from this demo. Then we're going to   take that artifact we're going to have a release  pipeline in the demo today we're just going to   have a single staging area which is going to be  testing in your production environment you'd also   have QA or pre prod and you'd have a production  so you'd have multiple release stages. These   are shown in a in a screenshot at the bottom of  the instructions on the on the site. We're then   going to push the notebook into the Databricks  workspace in this instance I'm using the same   workspace just because it's easier to deploy  one but the way that this works you could just   apply a second a third or fourth in different  environments with different tokens to achieve   this. The demo today is also going to use Azure  Key Vault this is so that we're not storing the   access token in the in the PowerShell script  that we're using to deploy the notebooks so   this is kind of a best practice way of storing  those credentials. It does mean that you have   to rotate those credentials every now and then  you need to understand how to use these things   that may be the subject of another demo another  day but for today this is a nice secure script   that should do everything that you need it to  do right out of the box so with that enjoy the   demo and let me know in the comments below  if there's anything else that you like like   to see on the channel also please hit that  like button make sure you subscribe so that   you see any new demos as they come along  and hope to see you next time thank you. So first things first if you log into your  Azure portal and we're going to create a Key   Vault. The Key Vault is going to hold the  token for Databricks so that we don't need   to store that in the script file within Asure  DevOps so give your Key Vault a name set up a   new resource group for the demo in this case  I've used DataOps and I've just gone KV with   the current date in reverse order that generally  gives you availability of a name even when things   need a public name so I've shown this access  policy side that configures who is allowed   to get keys and put keys and that kind of stuff  within the Key Vault for the time being I've set   that just to myself in a production environment  you'd probably want to set that to to the team   or whoever's in charge of the Key Vault and later  on we'll we'll enable Azure DevOps to have access   too. So once that's done start creating an Azure  Databricks workspace so again choose that same   resource group I've used ADB and the current  date for that for the name choose the location   I always use East US just because everything is  nice and available there so once that's validated   click create we don't really need to setup many  settings for this this environment isn't going   to really do anything it's just there so that we  can use the Git integration for the demo so then   go over to Azure DevOps and log in and create  yourself a new project. Again I'm using the   DataOps name saying it's a private here probably  you'd be setting it to enterprise in it in a real   environment just so that other people have got  access to them and then we're going to click on   repos and just initialize that repository with a  button at the bottom with a readme this allows us   to then use the repository for other stuff. Once  that's done hopefully your Databricks is created   by now just log in to the workspace that takes  a second or so to login and once you're in click   on your user icon at the top and go to user  settings this is where we configure the git   integration I'm just showing you on the screen  here you can use Github and that will require a   token you can use Bitbucket but we're going to  use Azure DevOps. So click on home or workspace   and create a new notebook and here we're going  to create one called demo notebook we're going   to choose Python just because the code is set  up for Python you can change that obviously to   the language of your choice just be a bit careful  about using the scripts that I've provided because   the release script does actually specify Python so  then we're going to click on the git integration   and click link go back into your repository  and copy that that URI and paste it back into   here. For the branch we're going to create a  new brand called feature branch so just type   a name in there and then a create branch button  appears. Hit create and go and then the notebook   will be synchronized and we just put in a comment  here we're going to say in it feature branch and   click Save and that does a commit of this notebook  into the repository so that it's captured within   Git we then hide the revision history and we're  just going to type in a comment the only reason   we need something here is to show that it's the  same notebook between the environments so don't   worry about what you're typing here just make sure  it's unique so once we've done that we're going to   click Save we're going to make sure that also  commit to Git is ticked you don't necessarily   always want that ticked because sometimes you're  just working locally and that's fine just leave it but do commit when when you've done some actual  work so we're showing here now in the master   branch there is no code in the feature branch is  the commit that we've just done obviously so click   on that and we can see that code that's been  written it does say Databricks notebook source   at the top that's not visible in the notebook but  don't worry too much about that we can then create   a pull request by clicking that link within  Databricks that will just launch the portal   there's nothing stopping you doing a pull request  from within DevOps just give it a name make sure   that the from and to branches are set correctly  and click create in this instance we're just going   to approve it and then click complete obviously  you'd have some work flow around that in in your   final environment so when we complete the merge by  default that's going to remove that feature branch   you then want to go and configure your Databricks  to use a different new feature branch but here we   can now see that master has got the code in it  and the feature branch has disappeared so now   we're going to configure the key so back into our  user settings go to access tokens and generate new   token name it Azure DevOps or something relevant  in your environment and click generate you must   copy this key while this box is up on the screen  it's never going to be shown again so we're going   to go back into our resource group and go into  the key vault and click secrets generate an   import and we're just going to give that a name  of Databricks and we're going to paste the value   in there this is gonna then store the the value  so that we can request it but we've got to have   permission to request it so that people can look  into the the git repository they won't be able   to see the token they won't be able to break into  our systems but when Azure DevOps tries it will be   able to go and request this and therefore will  have access I'm setting the activation date an   expiry date here just so that we've got a record  of when this expires and we can look into this   one Key Vault centralized space and and see when  everything is going to expire so that we know to   renew it obviously you'd probably want to automate  that in a real environment but here you can see if   you've got permission you can come in and actually  see that secret anytime generally you would never   access that secret this way again you're always  going to use it through Azure DevOps or some other   mechanism so in Azure DevOps click on pipelines  and library and here we're going to create a   variable group to link to that Key Vault so we're  just going to give it a name of Databricks var   group "link secrets from an Azure Key Vault" and  then we need to authorize our subscription and   once that's authorized we can then select the Key  Vault from the drop-down list again it authorizes   so what this is doing is giving permission to the  Key Vault from Azure DevOps then we need to select   the variable so all of the variables you select  here are then available to your to your pipelines   and you can create multiple variable groups you  could have one for dev for one for QA one for   prod and keep them nice and separate and then  just align a variable group to that particular   stage rather than to the whole job so here we're  creating the build pipeline this is exactly the   same as in the recent Azure Data Factory pipelines  so all we're going to do is give it a name of   build notebook artifact add a job to publish build  artifacts and we're going to point that at the Git   repository so in this instance there is no actual  build process we're just purely copying artifacts   because they're just scripts when we get on to  the more advanced topics here we're going to   start using Python notebooks at that point there  will be a process to take the code of the Python   and turn it into a Python wheel and then use  that Python wheel as the artifact rather than   the code but in this instance we don't need to do  that so we're going to set the trigger to enable   continuous integration this means that every time  we do a pull request into the master branch which   is specified there that's going to automatically  kick off a build and here we're going to do save   and queue because this is the first time that  auto trigger is not going to happen we've already   done the pull request so we want to manually  queue the first one but next time it's going   to happen automatically and we're going to just  next through that and then it comes on to this   build process and just runs through the first time  this is going to create a build artifact that just   contains everything that's in the Git repository  so here we can see those jobs running job is   initialized it checks out the code publishes  the artifact and then does some some other   bits and bobs for Azure DevOps and when that's  finished, the job will complete and we'll have   our artifact and this artifact we're going to then  take and deploy into three different environments   in this instance those three environments are  the same Databricks workspace but there's no   reason they have to be so if we have different  keys for different workspaces they could be in   different regions we can just set up each of the  stages to have a different workspace possibly   deploying and then creating jobs with different  clusters so here we're creating a release pipeline   giving the stage name testing for the first one  later on we're going to create other stages but   for now we'll leave it blank because what we  want to do is clone that testing one once we   completed setting it up so we're going to call  the pipeline notebook release pipeline to show   that we're taking the notebook and releasing it in  a staged way click add an artifact and then we're   going to add in the build so you can see there  of highlighted it's actually telling us that   it's created something called notebooks and then  again we're going to enable continuous integration   using that little lightning bolt that means that  every time a build finishes that build will be   put into the pipeline to go into our testing a  pre prod and our production environments then we   go to the variables tab and add a variable group  and just link up that one we created earlier it's   as simple as that so we can then just start using  those variables within the job there's no complex   setting up of key volt or anything like that we  can just access those because Azure DevOps has   permission there which allows us to access them  so here we're just got to add a PowerShell script   we're using inline you could just as easily be  using a script from a repo or something like that   but in this instance I'm just doing it in line  because it's a nice small script and it's pretty   easy and it's it demonstrates what we need so just  expanding this out so you can see the whole script   on screen at the top here it just says where the  documentation is for this that's always handy in   your scripts then with the filename is where we're  copying from you need to make sure by looking into   your artifact which you can do in Azure DevOps by  clicking artifacts just make sure that this path   is correct for your Python file I hit an issue  when I was doing this that I'd used a different   name and therefore it wasn't finding that file  so you can just come in or you could look in your   your repository to see what the path is under  that we've got the new notebook name this is   what it's going to be called once it gets into  Databricks beneath that we've then got the the   secret set up using the dollar bracket Databricks  bracket that's getting the variable from from the   Key Vault that the DevOps is taking care of for  us so it is that easy we're putting Bearer space   just because that's what the API requires then  we're seting the URI of the workspace you can get   this from your Databricks workspace beneath that  we then have to encode it into base64 this again   is a requirement of the API so we're reading the  file contents and encoding that then the body is   just putting all these things together so these  are the options for the API overwrite is true   languages Python that's where you change it for  Scala etc. here I'm just showing you where you   can see the the username that's used so you would  need to put your path to set that up to put it in   the right place then we convert the body to JSON  again just the requirement of the API set up the   headers and then invoke REST method just calls  the API so you can copy and paste these bits to   do multiples you could use a for each loop to do  all of the files it's up to you whether you want   to do it automatically we can then click the Add  button and clone so this is going to clone the   entire job we're calling it pre prod and then  once we go into the task there we'll see the   exact same PowerShell script so here we're just  going to call it a different name realistically   in a production environment you would be sending  this to a different Databricks workspace using   a different different key and then we're going  to create one called production so you can see   here that we're going from testing to pre prod  to production you can see the lines there so you   can set prerequisites for each of these and you  could have it deploying in parallel to do two   different environments you can have different  preconditions and stuff so here if you click on   the sort of lozenge shape at the beginning you  can then set pre deployment approvals gates all   that kind of stuff so here I'm just going to say  a pre approver setting it to myself just for the   demo purposes and saving that so I've only done  that on pre prod so what's going to happen is   it's going to go to testing then I'm going to  be asked to approve the deployment of pre prod   I did not set anything on production obviously  I would in real life but in this instance it's   just going to go ahead and deploy it as soon as it  finishes the pre prod so we then create a release   that then takes the build artifact and pulls it  into this this flow so we then wait for the agent,   the agent then runs the job again does an  initialization downloads the artifact downloads   the secrets from the Key Vault. If you use the  secret by the way it just shows up as three stars   on the logs so your secrets even if you explicitly  say write-host and then write the secret it just   shows up as three stars so then you can see I had  to I was asked to approve that normally I would   wait until I've tested the testing environment but  you can see here that we've now got the imported   notebook as per the script the deployment of pre  prod is in progress so I'm not going to click into   it this time but you can if you wanted to see all  of the the fancy text on the screen scrolling past   it looks nice and techy as soon as that succeeded  though you can see just queues into production   because we didn't have any preconditions but  you can see the notebook is now deployed with   the pre prod script and the production is being  deployed so I'll stress again do not deploy to   the same workspace if you're doing this for  real that's purely for the demo in reality   you would have different workspaces so that you  have full separation between your environments. So that's the end of the demo for today hopefully  you've enjoyed that as I said at the beginning   please hit that the like button below if you  enjoyed the video also do subscribe to the channel   just so that you don't miss anything in the future  thanks very much and we'll see you next time
Info
Channel: Dave Does Demos
Views: 7,683
Rating: 4.9487181 out of 5
Keywords: Databricks, Azure, DevOps, DataOps, Code Promotion, CI/CD
Id: R7tJZelEt-Q
Channel Id: undefined
Length: 19min 59sec (1199 seconds)
Published: Tue Feb 11 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.