DataOps 3 - Databricks Code Promotion using DevOps CI/CD

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi welcome back. Today we've got another demo about data ops. Today's demo is going to be about Databricks so the idea here is exactly the same as we had with data factory. What we're going to do is we're gonna have a Databricks workspace for development purposes this is where you're going to use feature branches in Git. You're going to write some code then once you've saved it and you commit it, do a pull request that's going to kick off a build process in the collaboration branch where all of our code merges together. That collaboration branch build is then going to produce an artifact. This artifact is going to contain the notebook. Normally we would be using things like libraries as well and deploying those but just for the simplicity of showing the workflow around Databricks today we're just going to concentrate on notebook deployment. The same exact methods would would work with libraries and stuff so that may be coming in a later video but for now you can you can absolutely work out how to do that from this demo. Then we're going to take that artifact we're going to have a release pipeline in the demo today we're just going to have a single staging area which is going to be testing in your production environment you'd also have QA or pre prod and you'd have a production so you'd have multiple release stages. These are shown in a in a screenshot at the bottom of the instructions on the on the site. We're then going to push the notebook into the Databricks workspace in this instance I'm using the same workspace just because it's easier to deploy one but the way that this works you could just apply a second a third or fourth in different environments with different tokens to achieve this. The demo today is also going to use Azure Key Vault this is so that we're not storing the access token in the in the PowerShell script that we're using to deploy the notebooks so this is kind of a best practice way of storing those credentials. It does mean that you have to rotate those credentials every now and then you need to understand how to use these things that may be the subject of another demo another day but for today this is a nice secure script that should do everything that you need it to do right out of the box so with that enjoy the demo and let me know in the comments below if there's anything else that you like like to see on the channel also please hit that like button make sure you subscribe so that you see any new demos as they come along and hope to see you next time thank you. So first things first if you log into your Azure portal and we're going to create a Key Vault. The Key Vault is going to hold the token for Databricks so that we don't need to store that in the script file within Asure DevOps so give your Key Vault a name set up a new resource group for the demo in this case I've used DataOps and I've just gone KV with the current date in reverse order that generally gives you availability of a name even when things need a public name so I've shown this access policy side that configures who is allowed to get keys and put keys and that kind of stuff within the Key Vault for the time being I've set that just to myself in a production environment you'd probably want to set that to to the team or whoever's in charge of the Key Vault and later on we'll we'll enable Azure DevOps to have access too. So once that's done start creating an Azure Databricks workspace so again choose that same resource group I've used ADB and the current date for that for the name choose the location I always use East US just because everything is nice and available there so once that's validated click create we don't really need to setup many settings for this this environment isn't going to really do anything it's just there so that we can use the Git integration for the demo so then go over to Azure DevOps and log in and create yourself a new project. Again I'm using the DataOps name saying it's a private here probably you'd be setting it to enterprise in it in a real environment just so that other people have got access to them and then we're going to click on repos and just initialize that repository with a button at the bottom with a readme this allows us to then use the repository for other stuff. Once that's done hopefully your Databricks is created by now just log in to the workspace that takes a second or so to login and once you're in click on your user icon at the top and go to user settings this is where we configure the git integration I'm just showing you on the screen here you can use Github and that will require a token you can use Bitbucket but we're going to use Azure DevOps. So click on home or workspace and create a new notebook and here we're going to create one called demo notebook we're going to choose Python just because the code is set up for Python you can change that obviously to the language of your choice just be a bit careful about using the scripts that I've provided because the release script does actually specify Python so then we're going to click on the git integration and click link go back into your repository and copy that that URI and paste it back into here. For the branch we're going to create a new brand called feature branch so just type a name in there and then a create branch button appears. Hit create and go and then the notebook will be synchronized and we just put in a comment here we're going to say in it feature branch and click Save and that does a commit of this notebook into the repository so that it's captured within Git we then hide the revision history and we're just going to type in a comment the only reason we need something here is to show that it's the same notebook between the environments so don't worry about what you're typing here just make sure it's unique so once we've done that we're going to click Save we're going to make sure that also commit to Git is ticked you don't necessarily always want that ticked because sometimes you're just working locally and that's fine just leave it but do commit when when you've done some actual work so we're showing here now in the master branch there is no code in the feature branch is the commit that we've just done obviously so click on that and we can see that code that's been written it does say Databricks notebook source at the top that's not visible in the notebook but don't worry too much about that we can then create a pull request by clicking that link within Databricks that will just launch the portal there's nothing stopping you doing a pull request from within DevOps just give it a name make sure that the from and to branches are set correctly and click create in this instance we're just going to approve it and then click complete obviously you'd have some work flow around that in in your final environment so when we complete the merge by default that's going to remove that feature branch you then want to go and configure your Databricks to use a different new feature branch but here we can now see that master has got the code in it and the feature branch has disappeared so now we're going to configure the key so back into our user settings go to access tokens and generate new token name it Azure DevOps or something relevant in your environment and click generate you must copy this key while this box is up on the screen it's never going to be shown again so we're going to go back into our resource group and go into the key vault and click secrets generate an import and we're just going to give that a name of Databricks and we're going to paste the value in there this is gonna then store the the value so that we can request it but we've got to have permission to request it so that people can look into the the git repository they won't be able to see the token they won't be able to break into our systems but when Azure DevOps tries it will be able to go and request this and therefore will have access I'm setting the activation date an expiry date here just so that we've got a record of when this expires and we can look into this one Key Vault centralized space and and see when everything is going to expire so that we know to renew it obviously you'd probably want to automate that in a real environment but here you can see if you've got permission you can come in and actually see that secret anytime generally you would never access that secret this way again you're always going to use it through Azure DevOps or some other mechanism so in Azure DevOps click on pipelines and library and here we're going to create a variable group to link to that Key Vault so we're just going to give it a name of Databricks var group "link secrets from an Azure Key Vault" and then we need to authorize our subscription and once that's authorized we can then select the Key Vault from the drop-down list again it authorizes so what this is doing is giving permission to the Key Vault from Azure DevOps then we need to select the variable so all of the variables you select here are then available to your to your pipelines and you can create multiple variable groups you could have one for dev for one for QA one for prod and keep them nice and separate and then just align a variable group to that particular stage rather than to the whole job so here we're creating the build pipeline this is exactly the same as in the recent Azure Data Factory pipelines so all we're going to do is give it a name of build notebook artifact add a job to publish build artifacts and we're going to point that at the Git repository so in this instance there is no actual build process we're just purely copying artifacts because they're just scripts when we get on to the more advanced topics here we're going to start using Python notebooks at that point there will be a process to take the code of the Python and turn it into a Python wheel and then use that Python wheel as the artifact rather than the code but in this instance we don't need to do that so we're going to set the trigger to enable continuous integration this means that every time we do a pull request into the master branch which is specified there that's going to automatically kick off a build and here we're going to do save and queue because this is the first time that auto trigger is not going to happen we've already done the pull request so we want to manually queue the first one but next time it's going to happen automatically and we're going to just next through that and then it comes on to this build process and just runs through the first time this is going to create a build artifact that just contains everything that's in the Git repository so here we can see those jobs running job is initialized it checks out the code publishes the artifact and then does some some other bits and bobs for Azure DevOps and when that's finished, the job will complete and we'll have our artifact and this artifact we're going to then take and deploy into three different environments in this instance those three environments are the same Databricks workspace but there's no reason they have to be so if we have different keys for different workspaces they could be in different regions we can just set up each of the stages to have a different workspace possibly deploying and then creating jobs with different clusters so here we're creating a release pipeline giving the stage name testing for the first one later on we're going to create other stages but for now we'll leave it blank because what we want to do is clone that testing one once we completed setting it up so we're going to call the pipeline notebook release pipeline to show that we're taking the notebook and releasing it in a staged way click add an artifact and then we're going to add in the build so you can see there of highlighted it's actually telling us that it's created something called notebooks and then again we're going to enable continuous integration using that little lightning bolt that means that every time a build finishes that build will be put into the pipeline to go into our testing a pre prod and our production environments then we go to the variables tab and add a variable group and just link up that one we created earlier it's as simple as that so we can then just start using those variables within the job there's no complex setting up of key volt or anything like that we can just access those because Azure DevOps has permission there which allows us to access them so here we're just got to add a PowerShell script we're using inline you could just as easily be using a script from a repo or something like that but in this instance I'm just doing it in line because it's a nice small script and it's pretty easy and it's it demonstrates what we need so just expanding this out so you can see the whole script on screen at the top here it just says where the documentation is for this that's always handy in your scripts then with the filename is where we're copying from you need to make sure by looking into your artifact which you can do in Azure DevOps by clicking artifacts just make sure that this path is correct for your Python file I hit an issue when I was doing this that I'd used a different name and therefore it wasn't finding that file so you can just come in or you could look in your your repository to see what the path is under that we've got the new notebook name this is what it's going to be called once it gets into Databricks beneath that we've then got the the secret set up using the dollar bracket Databricks bracket that's getting the variable from from the Key Vault that the DevOps is taking care of for us so it is that easy we're putting Bearer space just because that's what the API requires then we're seting the URI of the workspace you can get this from your Databricks workspace beneath that we then have to encode it into base64 this again is a requirement of the API so we're reading the file contents and encoding that then the body is just putting all these things together so these are the options for the API overwrite is true languages Python that's where you change it for Scala etc. here I'm just showing you where you can see the the username that's used so you would need to put your path to set that up to put it in the right place then we convert the body to JSON again just the requirement of the API set up the headers and then invoke REST method just calls the API so you can copy and paste these bits to do multiples you could use a for each loop to do all of the files it's up to you whether you want to do it automatically we can then click the Add button and clone so this is going to clone the entire job we're calling it pre prod and then once we go into the task there we'll see the exact same PowerShell script so here we're just going to call it a different name realistically in a production environment you would be sending this to a different Databricks workspace using a different different key and then we're going to create one called production so you can see here that we're going from testing to pre prod to production you can see the lines there so you can set prerequisites for each of these and you could have it deploying in parallel to do two different environments you can have different preconditions and stuff so here if you click on the sort of lozenge shape at the beginning you can then set pre deployment approvals gates all that kind of stuff so here I'm just going to say a pre approver setting it to myself just for the demo purposes and saving that so I've only done that on pre prod so what's going to happen is it's going to go to testing then I'm going to be asked to approve the deployment of pre prod I did not set anything on production obviously I would in real life but in this instance it's just going to go ahead and deploy it as soon as it finishes the pre prod so we then create a release that then takes the build artifact and pulls it into this this flow so we then wait for the agent, the agent then runs the job again does an initialization downloads the artifact downloads the secrets from the Key Vault. If you use the secret by the way it just shows up as three stars on the logs so your secrets even if you explicitly say write-host and then write the secret it just shows up as three stars so then you can see I had to I was asked to approve that normally I would wait until I've tested the testing environment but you can see here that we've now got the imported notebook as per the script the deployment of pre prod is in progress so I'm not going to click into it this time but you can if you wanted to see all of the the fancy text on the screen scrolling past it looks nice and techy as soon as that succeeded though you can see just queues into production because we didn't have any preconditions but you can see the notebook is now deployed with the pre prod script and the production is being deployed so I'll stress again do not deploy to the same workspace if you're doing this for real that's purely for the demo in reality you would have different workspaces so that you have full separation between your environments. So that's the end of the demo for today hopefully you've enjoyed that as I said at the beginning please hit that the like button below if you enjoyed the video also do subscribe to the channel just so that you don't miss anything in the future thanks very much and we'll see you next time

Info

Channel: Dave Does Demos

Views: 7,683

Rating: 4.9487181 out of 5

Keywords: Databricks, Azure, DevOps, DataOps, Code Promotion, CI/CD

Id: R7tJZelEt-Q

Channel Id: undefined

Length: 19min 59sec (1199 seconds)

Published: Tue Feb 11 2020