Everything as Code

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Applause] hi everyone alright let's try that again hi everyone awesome it's the final talk of the conference who's excited awesome I'm the only thing that stands between you and alcohol so I recognized that so I will go as slow as possible I'm just kidding so today I'd like to talk to you about everything is code and for those of you that have ever seen me give a presentation before you know that I like to do live demos and so I'm just gonna set expectations from the beginning that there are no live demos I'm sorry but there aren't any but that's okay because the content is still really interesting Thank You Nick for the introduction for those of you that don't know me my name is Seth I'm a developer advocate for Google prior to that I worked at Hashi Corp many of you may recognize me I wrote tools like console template and a bunch of other things but today I work at Google primarily focused on the high shakur tools and the DevOps ecosystem to make sure that GCP Google cloud platform is a really excellent place to use these tools but I don't really want to talk about that today what I want to talk about is code and particularly everything is code and and what's next as code so for a little bit of background whenever we time-travel back say 10 or 15 years so if we were to go 10 or 15 years into the past operations was really easy relatively it was hard but it was really easy we had one server in one data center we called that server a mainframe and it was very large and it required a team of people but those teams were actually managing the server at a very mechanical layer right they heard they were fixing the glass tubes that would explode and rewiring things and these servers were huge and they had punch cards that you programmed on them but they were relatively easy to maintain and then we take a step forward and our data centers move from mainframes to servers right so we go from having this huge machine that takes up an entire room - you know blade servers that we can put in a rack and we can network them together and this was still relatively easy to manage by hand most companies even large you know fortune 500 fortune 50 companies would only have you know ten fifteen maybe a hundred of these servers you could name them you could treat them like friends you could invite them to dinner right they were they were easy to to maintain by hand even with a relatively small number of people each of these servers generally had one purpose they ran one COBOL application one java application and they had you know decent CPU and RAM but they weren't be using being used to their full potential but then the world got a little bit bigger a lot of things happened we became global the economy became global the need to connect systems internationally became global we had to deal with things like regulation where the content that I served to people in a country has to differ based off the laws of that region additionally we want to deliver content closest to the request if I'm based here in Europe I don't want my requests to have to go to a data center in Australia and back again in order to receive a response we want to deliver content and process content as close to the edge as possible so we started building out multiple data centers both for disaster recovery but also for the ability to deliver responses and process requests as close to the origin as possible and this was great but it's slowly becoming impossible to manage this by hand now I have to get on a plane to tend to my servers or I have to have multiple employees and multiple different time zones to tend to these different data centers and another thing we recognized is that we have these huge fleets of compute and we would look at the dashboards and we would see that we're using 10% of our CPU and 5% of our total memory capacity so this is where hypervisors came along so instead of having one application on one server now we're able to virtualize the operating system and virtualized the kernel and we can provide an isolation layer that was never really before possible so we can isolate things in a secure way or in a more secure way and we can you know basically have 15 or 16 different VMs running on one of these servers and then things get even more complex because in this model we're running on shared hardware and we're relying on isolation in the operating system layer and this definitely requires a team of people it definitely requires some type of automation right we went from thousands of servers to tens of thousands or hundreds of thousands of VMs overnight so this is where tools like chef puppet ansible and salt really come in to help manage the complexity of an individual system of one VM of one server and tools like terraform come in to help manage the complexity of provisioning all of this infrastructure and keeping it up and running well more recently we've entered into an era that looks something like this which is this hybrid collection of containers and VMs and bare metal and containers on VMs on bare metal that you may or may not own and it's a really really complex world and it doesn't actually end here because with the proliferation of containers and microservices we also see what I like to call the proliferation of asses that's where you have your software as a service your platform as a service your databases of service your object storage as a service your DNS as a service your CDN as a service so instead of you being a company that runs a globally distributed content Network you can offload that into a service provider like a cloud provider or someone like Akamai that provides a CDN as a service but just because you offload that work doesn't mean that the configuration goes away you still need to configure the the caches you still need to configure the content expiration so we live in this amazing world where if we go back in time to 15 years ago literally none of the compute is possible in my pocket I have more compute than was ever available 15 years ago on something like a mainframe but we also introduced so much complexity in in this new kind of architecture and it's this complex mix of these different tools and these different technologies and trying to find a way to bridge the gap between legacy applications and cognitive applications and VMs and containers and on Prem and cloud and we need a strategy for managing this complexity and what's worse is that this diagram doesn't even include some of the most critical parts this diagram doesn't include anything about security it doesn't include anything about policy or compliance or regulation there's no monitoring logging or alerting right these are all third-party subsystems that we have to maintain we have to keep up and running right who monitors the monitoring system how do we integrate all of these components together in a global way the point here is that we live in a world that demands tooling to manage complexity it is no longer an option you have to have tooling and automation and order to manages complexity you will go insane by trying to manage this by hand so we have to ask ourselves like ask ourselves why like why did we make things more complex it seems like mainframes would have been easier yeah they were slower but like they're easier and it it turns into kind of what I wanted like to talk about which is the apron cycle how many people are familiar with the apron cycle you shouldn't because I made it up so the apron cycle is very straightforward it corresponds to kind of these four pillars of the evolution of infrastructure acquire provision update and then delete or destroy look at that animation and when we take a look at kind of the modern world we live in in this this hybrid data center in multi cloud world we we live in a world where in the past in order to acquire compute we had to pick up a phone and call a vendor and that vendor would then process a purchase order and then like six to nine weeks later we would get boxes that had cardboard packaging in them and we would have to unpack them and put a server in a rack and screw it in and connect data cables and like we don't live in that world anymore but that was a real world and there are still people who work in data centers that are still doing that they're unboxing you know Dell and IBM servers and putting them in a rack and then we had our data center operations team that would come in and once that server was connected to the network or to the to the local network or whatever you might call it they had to provision it they don't put the initial users on put the initial operating system the initial software packages etc and then there was probably another team that would go along and manage that system over time open SSL is vulnerable for the fourth time this month gotta update again right there's a team that's constantly managing that and then we have our data center operations team which was responsible for decommissioning those servers and fifteen years ago this was a very painful process the the the vendor acquisition process would take months or weeks if not months to go through legal and purchasing and shipping to get these these actors to acquire new compute and the process of data center operations could take days or weeks depending on the backlog just to provision a new server and then the updating process would also take you know hours or even days depending on the backlog and then the the decommissioning process or the destroy process would take days so what changed well probably the biggest thing that changed is the introduction of cloud cloud technologies and hybridization technologies have allowed us to start treating things like compute networking and storage as resources and we don't have to think about the underlying physical machines anymore so what do we look at like cloud providers cloud providers help shift the acquisition of compute storage and network from weeks or days to minutes and even seconds I can fire off an API call right now and have more compute than ever existed a year ago in a single API call and that's what's available to us today and then tools like configuration management tools took the middle part they're the provisioning and the updating down from days to minutes or even seconds to update these systems and keep them updated so we have strategies for managing this complexity they're kind of all over the place and there's a lot of crossover between all of these disparate systems but there's also a heck of a lot going on so this is a word cloud of the 2017 top hacker news posts filtered by IT if you look very closely the word buzzword is on that word cloud but if we just take a look at some of these here we have you know obviously kubernetes Terra firme darker cloud DevOps server list and there's so many things going on and there's so many tools and there's so many choices that sometimes you feel like your heads gonna explode it's like my favorite gift ever so why do we feel like this and more importantly what was our original goal like it seems like mainframes might have been better right like mainframes did not have all of these problems yet we're in a world where we demand tooling it's not optional we have so much tooling and so much automation just to get through our daily lives so like I said before we have to have a strategy for managing this complexity and time and time again that strategy has been code and this this vectors in from the application world we see this time and time again with tools in this ecosystem where code is always the natural strategy for managing complexity so what is codification well the term codify is very straightforward it means you're gonna capture a process routine or algorithm in a textual format so it's basically write something as text that codification may be declarative or imperative meaning it may describe the end result or it might describe the series of steps to take so for example a recipe to make a cake or a casserole is an imperative right it says do this then that step 1 step 2 step 3 some systems are declarative they instead just say this is the desired state I don't care how you get there terraform is a great example of this in terraform everything is parallelized by default so you don't actually have a lot of control over what happens before other resources unless you explicitly require those dependencies so let's look at some existing ways in which we manage complexity with code so configuration management is obviously one of the biggest ones to manage the complexity of a single machine or operating system or VM we have tools like config management tools like chef puppet ansible and salt that codify and automate our machines definition chef's recipes puppet modules ansible playbooks whatever salt calls their things those are the codification of these systems and then chef puppet ansible and salt are the technologies or the tools that enforce that codification they might be executing a Python file or reading a ruby script it doesn't matter what the implementation is but they're the enforcement of that codification when we look at containers and things like docker and OC I we have a tool for managing the complexity of application requirements my application needs this exact version of Python with this exact set of system dependencies a containers are great way to help ship that application and all its dependencies in one the darker file is the codification that docker file is the textual format which is capturing all of the application and its dependencies and then docker or OCI or whatever your container orchestration is our container runtime is is the automation that is enforcing that docker file and is building and running the application in the infrastructures code world we have tools like terraform which manage the complexity of infrastructure at scale terraform configurations are the codification they're a single text file or multiple text files that describe infrastructure and their relationships between them and then we have the tool again terraform which is applying those configurations it reads them and enforces them to bring about a desired result in the correct order and these are the ones that were probably most familiar with but we're starting to see some emergence in other spaces so in the CI CD world for example there's the jenkins file and the travesty mo file where now we're using code to describe our build systems previously you know if you've ever used jenkins in the past you know that most frequently jenkins is configured via the web UI but now there's a desire to start configuring and capturing these configurations and the build steps and the build output as code and versioning that with the application in these examples this you know the yama file or the jenkins file those are the codification and then the tool jenkins travesty is circle CI that's the automation or or the tooling that is driving the instantiation of that file we also see security and compliance complexity with things like api's and policy files as the surface area for micro services grows and gets larger and larger the desire to secure that becomes difficult to reason about we need a tool and a strategy for managing complexity of security at scale it's very difficult to reason about these things so for example vault is the API and the configuration is the codification which describes how our services should be able to get credentials and secrets and then vault is the automation or the enforcement of that policy of that code and then obviously we have container orchestrators like kubernetes is a nomad where you know you have a yellow file or an HCL file that you submit and the orchestrator runs that and executes that in a manner such that the the code the end result is an application or a service or a load balancer that is running so there clearly exists this incredibly well-defined pattern for using code applications use code all of the time but we've seen it in config management we see it as infrastructure as code we see it everywhere which begs the question like why why is code such a valuable tool for us to manage complexity and it turns out that there's a few really important reasons the first is linting very straightforward is that once something is captured as text as code we can enforce our own opinions on that code we can do things like static analysis we can do linting we can do alerting we can we can go as little with a regular expression or as complex with machine learning as you want but that analysis of that code leads to linting and that linting can enforce consistency and especially in a large organization or a broad community consistency is a key component for adoption if you want a tool or a technology to succeed in its adoption it has to be consistent you can't have team a doing something some way and team be doing something another way and expect them to be able to collaborate effectively you have differing opinions and this we're tools like go and etcetera have built in for matters to help enforce that and alleviate those arguments once you have linting you can take that a step further and bring testing the moment you capture something is code you have a significant ability to test those configurations whether it's just a client-side test which might be a lint and then you fail your exit one if if that lint fails or it might be something more difficult and time-consuming we might be able to spin up an entire copy of our production cluster in a different environment in a different region and run a bunch of tests against it but because we've captured that as code not only do we guarantee that we're getting the same result but we can iterate at it and automate it overtime code also gives us collaboration and this is a really key benefit for capturing something as code is that once we have a common format once we're once we're speaking the same language we can do things like pull requests change requests merge requests we can automatically test integration with things like CI and CD workflows to automatically build a workflow that works for collaboration collaboration is a key piece of why we capture things is code I wouldn't say it's like the driving motivator but when we talk about complexity and the reason why we're trying to to manage these complex things is that it's difficult for one person to reason about these so having collaboration having checks and balances having the ability for people to be on the same page and share the same ideas is a key reason why you might capture something is code and capture everything is code on the exact opposite point we have separation of concerns so a lot of these tools provide some sort of modular ization chef recipes puppet modules terraform modules right they provide an isolation layer where if we're in a large organization or we're trying to support a community we can define our problem domain very specifically we don't have to solve the world we can solve our domain and we can say this is the set of problems we solve here is the module or the you know the ruby jam or the the Python package that does exactly what we and it's your job to use that I call this the lego principle right we can't all be experts of every domain we can only be experts in red Legos with four dots so you pick your red Lego and you build the best red Lego that you can and you expect other people to build like the Star Wars spaceship that's awesome and amazing or the roller coaster right because you can't possibly reason about all of that complexity and be a domain expert in all of those fields code also allows us to do really cool things with modeling abstract concepts when you take something and you capture it as code there's a lot of third party resources and tools that let us then visualize that code so take a look at like the terraform tool for example you can graph all of the relationships between all of your different nodes and dependencies and terraform and this is a great way to visualize all of your infrastructure in one PDF or dot file but you can actually take the output of that which is dot which is an open format and feed that into other tools in the academic world there are tools that accept dot input and provide like a 3d visualization world that you can actually like explore your infrastructure in a 3d manner instead of a 2d manner and this is a really helpful way to model relationships and dependencies in this incredibly complex world when it was a mainframe everything was you know right here but now we have things all over the world with different orders and different you know egress and ingress and we need a strategy to be able to think about that sometimes we have you know we struggle to just think about these things so really what am i saying what I'm saying is that the entire talk is actually about theft I am 100% encouraging larceny not grand larceny just like mini larceny and what I mean by that is if you take a look at what we're doing today in the infrastructure in the operation space it's everything that application developers have had for the past five to ten to fifteen years right see ICD not a new concept for application developers code that is application developers write source control pull request collaboration these are all things that came out of the application developer workflow right this idea that we should be working together this idea of breaking things down into microservices these are all coming from the application development workflow so we have to ask ourselves if if if what we see happening now in the infrastructure world is what happened 10 to 15 years ago in the application development world what does infrastructure look like in 5 10 and 15 years so what is next so for the rest of this talk I'm going to pose to you that based off of what we current see currently see in the application landscape what is next for infrastructure and to date to the best of my knowledge none of this exists so in a very high level the number one thing we're gonna see in the operation space is less operator intervention there's a kind of a famous quote that's like my job is to make myself obsolete and I think more and more as technology evolves we're going to see less and less operator intervention operators are going to be creating fires not putting them out so we look at companies like Netflix that have chaos engineering I think they just renamed it but this idea of instead of just constantly responding to fires my system is so stable that I now inject fires in order to see how my system responds and is resilient so to a certain extent we have auto scaling and auto scaling exists today but not at a level where we have deep application insights into exactly what we need to scale do we need a skill CPU do we need to scale ram do we have direct insights between our monitoring logging and alerting to know exactly what we have to scale exactly when we have to scale it and for how long based off of historical data if I'm an e-commerce site can i preemptively auto scale on the holiday season so that I can maintain capacity I want to be proactive not reactive right now auto scaling is almost entirely reactive you know based off of a current angry you know ingress from my load balancer kick off some type of you know auto scaling I want to be proactive I want to already have the capacity before the load hits so that's what we're going to start to see coming in the next five to ten years another one is automated security scanning like wait this exists today yes automated security scanning exists today many cloud providers provide some level of automated security scanning for containers and applications there are tools like like black duck which will do analysis of different software and software dependency packages but there they're not in a state yet where they know everything and they primarily rely on CVEs or reports from external people to say that software is vulnerable right they're not using fuzzing they're not actively pen testing your applications instead they're pulling from a CVE database parsing your entire dependency tree and saying ok 15 layers down the stack there's a vulnerable version of a package you should update it and you get an email and you wake up in the morning and you're like oh I should patch all of my systems and and that's really great and that's gonna get a lot better because we're gonna start seeing these these security scanning tools really start integrating with things like fuzzing we're no longer are they just scanning a vulnerability database that a human put some type of CVE into but instead they're actually penetration testing it goes back to the chaos engineering right no longer are we actively putting out fires we're trying to cause them so that we cause them in a controlled environment before our users can and before our users can be brought malicious harm so imagine a world where you wake up and you're like you get an email from some automated system and it's like hey Seth you have 50% of your applications have a vulnerable version of this package you should patch them and that's a great world and like some of that exists today but it's not in a mature state where we can actually rely on it with confidence so let's say we get to that world and I wake up that's still an operator right that's still an operator who has to go ssh into systems and push chef configs or public and fix everywhere there's still manual tasks that have to take place so the next logical step is automated security patching and this is not happening to the best of my knowledge today and the reason I think this isn't happening is not due to lack of technology available is due to lack of willingness to change culture so how many people here would feel comfortable if tomorrow morning you woke up from an automated system 8 an email from your logging or monitoring system that said hey there was a vulnerable version of a package I patched it on 98% of your fleet and the 2% that weren't patched I removed those nodes from the load balancers so they're not servicing or requests let me know when you fix them like how many people would feel comfortable receiving an email like that like that's some Skynet right there like where as an industry some of us are like yeah all in I'm so in on that and then a bunch of people are like but job security right there's there's a cultural movement that has to happen where our our job as systems engineers or systems operators or DevOps engineers or whatever your business card says has to shift from putting out fires to causing them and only when that happens does something like automated security patching make sense only then will we be confident enough to say oh yeah that's great and then you know I'll deal with it in the morning we have to be comfortable with these systems using automation the last thing we're gonna see is more intelligent insights and this was probably the deepest thing now if you were to look at what is the biggest thing right now in the developer landscape I think many of us would agree that it's like AI and ml if you're applying for VC funding you have to put AI ml or blockchain somewhere in your pitch or else you're not gonna get funding and I refuse to talk about the blockchain and any public manner so we're going to talk about AI and ml so artificial intelligence and machine learning are really really hype in the application developer landscape right now right I can like on my phone take a picture of a receipt and it uses AI and ml to decipher all of the text and uses OCR to like auto submit that so I don't have to type in numbers and like that's a great world but can we leverage that technology in the operation space like what does what does that look like what does tensorflow in operations look like so instead of analyzing pictures to decide you know which one is the best of your cat or your dog or instead of analyzing photos to find you know this is mom this is dad this is kids right instead of using tools like tensor flow to analyze photos and texts what if we're analyzing logs and metrics and we're using machine learning and AI to do intelligent alerting based off of advanced heuristics that aren't available based off of some regular expression engine or some complex rule-based thing we do now there are certain things that humans are really good at like empathy and there are certain things that machines are really really good at like anomaly detection there's a really fun video on YouTube how many people have ever played those like spot the difference picture games where they give you a side by side picture and you have to like tap on the differences between them on average with tensorflow we can solve that in like less than 100 milliseconds but like humans the clock runs out you lose every time it's it's a game but like machines and computers are really really good at anomaly detection so when you have like a hundred thousand log lines that are spewing past you because you have that many micro services distributed across the world like how do you find that one log line that looks different that one log line where you know oh there was a blip and maybe that blip only happens every once in a while but it led to a poor user experience or worse that blip is actually an attacker or a hacker or an intruder who is doing some type of a malicious operation but there's so much data there's so much noise that you're never gonna see it and in attackers relying on on that fact so being able to use AI and ml to ingest all of our logs and all of our data and actually point us to the things that matter right and you might argue well you shouldn't be ingressing log data that doesn't matter and that's a whole different discussion but we need anomaly detection as our system scale as our services scale as the amount of data and the number of users scale we have to rely on technology to find the anomalies next we're looking at distributed tracing so we already have tools like open census that are really great at doing distributed tracing but applying AI and ml to distributed tracing to detect rogue actors what does a normal request path through my applications look like no normally they hit micro service a then micro-service b then micro-service c and then the database persistence and given that data and I learn using machine learning on that algorithm i pipe that into my open census or whatever I'm using for distributed tracing if all of a sudden I see some traffic that's going from microservice A to D to Q all the way back to the database that's anomalous that's something that I should alert on because that could be a malicious actor in the system it could be a bad deploy it could be a miss configuration but these are ways in which we can use AI and machine learning again technologies that application developers are already leveraging today to bring this to the infrastructure in the operations world and then lastly everyone's favorite thing is functions or serverless and we're seeing a lot of adoption from application developers using serverless functions I hate the word serverless that's why it's in parentheses here and I say functions instead because it's not servers it's not server lists it's someone else's servers I've seen them you just don't have to manage them anymore and that's ok and there's a lot of benefits to using server lists there's cost-saving benefits there's time-saving benefits but we introduced a lot of problems when leveraging serverless technologies as well like when that serverless function dies it is gone and you are now relying on historical logs and data to figure out what the heck happened inside of that function or inside of that stored procedure and all of these tools and techniques and functions are really just a way to offload operations and it goes back to really the crux of my talk which is we're trying to capture everything as code and the reason that we're trying to capture everything is code is so that we can evolve to the next operator what is systems administrator 2.0 what is DevOps engineer 3.0 and it is only when we switch from putting out fires to causing them that that shift will happen and in order to stop putting out fires we have to embrace code we have to embrace automation and we have to embrace new technologies and embrace an acceptable level of failure one of my favorite analogies is like when we talk about system availability and someone says my system must be 100% available there is no such thing as a 100% available system because they're inherently exists unreliability in the ecosystem between your user and your data center if they are on any you know home network whether it's fiber or cable or a dial-up line right there's an SLA that that content provider is giving maybe it's three nines or two nines of availability if you're on a mobile network right there's an SLA for the availability there so even if your service is 100% reliable a end user is never going to see that and between you and me they're gonna blame Verizon and Comcast far before they ever blame your application so accept risk we have to be willing to accept risk right define what is an acceptable level of availability for your application so that you can be risky you can automate things that have never been automated before you can try new things and if you go down for five seconds you go down for five seconds and unless you're in a world with mission-critical time-sensitive data that's okay and we have to be willing to accept that so to conclude here today what's next for operations is everything is code I think you're all of this conference here today because of one of the the philosophies of the Tao of Hoshi Corp is everything is code it's a key pillar of all of the products that Hoshi Corp builds you already believe in that but some of your peers might not so everything is code everything has to be code whether it's security policy automation compliance infrastructure it all has to be code for all of the benefits we talked about once it's code we can start moving from create from putting out fires to creating fires with less operator intervention and once we're intervening less on the typey typey we can do more on the strategic things more intelligent insights our systems are only going to grow in complexity right nothing has gotten simpler in the past 20 years all right that was the whole point of that beginning is like mainframes were far simpler we're only gonna see increasing complexity and we need to leverage new technologies and new techniques to manage that complexity and lastly we're gonna see significant rise in server lists or functions or this idea of just run my code and I don't care about the system that it's running on or the requirements of that system thank you very much for having me here today I'll be around if there are any questions thank you for coming to the conference [Applause]

Info

Channel: HashiCorp

Views: 6,348

Rating: 4.7685952 out of 5

Keywords:

Id: HcmPi7-IVQo

Channel Id: undefined

Length: 35min 41sec (2141 seconds)

Published: Wed Jul 11 2018