Managing a Large and Complex GCP Migration (Cloud Next '19)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] VENK SUBRAMANIAN: So you know any migration is always complicated because migrations aren't a thing that we do as a business, right? When you start a technology business, your goal isn't to do a migration. So it's a rare occurrence in anyone's lives, and it always brings with it anxiety and a lack of understanding of what we need to do to really be able to push this ball forward. So we're not here to tell you about what technologies to pick, how to avoid that weird GKE load balancer issue that happens every now and then. But rather, if you're an engineer here, if you're a manager here, if you're a leader here, each one of you today owns a piece of your business, right? You're no longer just doing work. You actually own a piece of your business. So our goal here is to teach you, as an owner of that piece of the business, what can you do to plan and execute effectively on the piece that you own. And also how you are able to collaborate better so that, as an organization, you can manage something that can be pretty large and complex and a fair amount of unknowns. So the way we're going to do it is give a little bit of information about who the key players were here, what Google brought to the table, and who Unity is. And we're going to talk about the why, the vision, what was our partnership, and what was the goals of what we were trying to achieve here. And then, finally, we're going to give you details about the approach-- what we did, how we did it, what worked for us, what didn't work for us, and hopefully, takeaways that you can walk away with that will help you make your next migration a lot easier. SOM ROY: So as we start, I just wanted to quickly talk about Google Professional Services Organization. There might be some other folks in the room who are working with your TAMs or your PSO team on the ground. If not, you know, PSO's mission is to help customers get the most out of Google Cloud. And basically, we go to market in three separate pillars. We have our consultings on migration services. We have our Technical Account Managers, the TAMs, which some of you are working with already. And then we have our training and certification. So these three together forms our Google Cloud Professional Services, and our aim is to work almost as an extension of your team and get all the feedback back over to our product and engineering. So as we start, we'll talk about the Unity and Google Partnership first. Venk, do you want to? VENK SUBRAMANIAN: Yeah. So who is Unity? I'm hoping a fair amount of you know us, but I'll cover it anyway. We are the world's most popular 3D content creation system. And we started with games. We expanded into the 3D content creation space. But really, what we are is a complete game creation platform. Our pillars of create, operate, and monetize allow game developers to build a business. Because today, game development is not just about building games. We have to build a viable business around it. And game developers, what they do best, is build games. So Unity's goal is to take away the rest of it and make it really easy for them to build a viable business around their games. Now, we've been in Play for a while. And we are wildly popular in many spaces. We've had over 29 billion installs with some kind of Unity experience in it within just the last 12 months. 60% of all AR and VR content today is powered by Unity. Now we're 50% of all new mobile games are today built on Unity. And then the Unity Google Cloud Partnership? SOM ROY: Yes. So as you can see here, I'm already sporting it on my jacket. So Unity and Google Cloud also partner together on this thing called the Connected Games. The whole point of Connected Games is to connect players to each other and players to developers as well and basically provide a more enriching experience to game players. So Google Cloud and Unity specifically are partnering on that. Again, as you can see, we have a lot of initiatives with Unity going on across the company. But this is something we are very, very specific on Google Cloud and Unity. And if you need more information, you can read up on Connected Games on the Unity blog. All right, so as we start, Venk, do you want to give an overview of how your engineering side? VENK SUBRAMANIAN: Yeah. So one of the unique challenges we had was that our team is extremely global. Unity today is in over 30 offices, in over 22 countries worldwide. We have close to 10 key locations across the globe. So a company that is built like this cannot operate in a extremely top-down structure. So we focus very heavily on collaboration and empowerment. And I'm bringing this up because, as you were starting out on your migration, it's very important for you to know how your organization is structured. This actually plays a very important part in how you're going to plan and how you're going to execute on the migration. So for us, this was a unique challenge that I wanted to highlight. SOM ROY: While it's really cool on Unity's side to be globally distributed, it becomes really a big challenge when you're a services arm and you're trying to achieve the migration. So we started thinking about, from the Google PSO side, how do we map a team to address Unity's global and distributed nature of business. So we started with teams in San Francisco, Seattle, and Austin in the Americas. We quickly realized that we have to scale up our team in the EMEAs, across Stockholm and Helsinki, where one of their key business units are. We also had a team out there in London. And as you can see, in the previous slide, Unity does have an office in Shanghai, China. So Google doesn't have a PSO team in China, so to map to that in the same time zone we had to have a team in Singapore, which was in the same time zone. Now, this is really important because local time zone interaction during a tight schedule and a timeline is really important. And having that local support in where the migration is happening, this is really-- that's why we went with this global nature of the team. Also, sometimes, when issues come up, the teams cannot wait for eight hours for US to wake up and actually get their questions answered. So removing the local blockers was-- like, removing blockers in local time-- is a really, really important thing. The third thing is the camps across the globe were kind of always keeping a tab on how the migration was going on. So, again, we had TAMs in all the regions that we saw. All right, so when we try to summarize the whole migration journey in one slide, it's really, really difficult. Because there are, as Venk said, so many business units across so many cities. Again, there was a very tight timeline. The migration has to be completed within 2018. So we tried to represent the whole journey in at least one slide. The important thing that I want to draw your attention is, if you see the red box there which says the pre-migration, and then the green box, which is the migration. This is really important. As Venk could kind of agree to, we spent a lot of time building out the foundation in the first three months. Now, there were concerns. Are we moving fast enough? Are applications coming on to cloud? And we basically, as a joined team together, we had to push back and say that it's really, really important to set up your foundation first. You need to set up your network, your IM, your security controls. And what we saw is, because we spent three dedicated months in the first phase of the migration, the phase 0 and the phase 1, the next six months were highly, highly accelerated. So if you are embarking on a migration or you are just about to kind of start a migration phase 1, please do build your foundations properly. Because, I think, if you do that, I think it's just easier to bring the applications and the workloads later in the remaining half of the year. VENK SUBRAMANIAN: Yeah. This plan at the beginning looks counterintuitive, right. In fact, when we went and pitched this, what it essentially looks like, that we're basically training for half the time on a eighth month migration for what is a extremely high scale set of services. But there was a plan behind it. Because when the teams were ready to go, they were ready to really run with it. We had made sure to account for all of the major blockers. We're going to cover this in more detail about what we did within that pre-migration period that really set us up. But this point is so important that we're going to highlight it a few times through this presentation. There is a lot of value in taking the time to prepare. SOM ROY: And one more thing, what kind of caught us off guard a little bit last year was GDPR was the hot thing. A lot of your-- if you're a customer, you've been hit with the GDPR requirements. So again, it goes back to the foundation. If you set the foundation right, I think getting compliant on many of these standards are pretty much easy later on. So we also did spend a bunch of time on the GDPR part because that was literally the time when everybody had to be GDPR compliant. So given such-- and I see this across all the customers more so for a very, very tech savvy digital native customer like Unity-- is they are pushing our products and our platform to the boundary. Like Venk said, 29 billion downloads-- there are millions of transaction, even billions, in a minute, per second. So it's a very intense application stack that they have. So as part of that, there are many, many things that came up. The TAM team and the PSO team overall worked very closely with our engineering and product teams to actually unlock a bunch of features. GKE, shared VPC was kind of [INAUDIBLE] at that time. Cloud Composer, GKE private cluster-- these were absolutely a major blocker to the migration. And kudos to our product teams who were able to get those products [INAUDIBLE] during the migrations and we were able to move all the workloads over. But there is always newer things coming down the pipe, as you saw. I think I just saw an announcement around Traffic Director. So it was really important, and I think all thanks to the Unity team. We said that when we have a ten feature request, it's really important to prioritize those feature requests. Like, what are the things that will block the migration? What are the things which are kind of nice to have, that in the next six months, if they go [INAUDIBLE], that will be fine? And what are the things that you would need one year down the line? And I think that collaboration with the Unity team worked really well. They said the private GKE cluster and the shared VPC were really, really critical items that needed to be in the migration. VENK SUBRAMANIAN: What this slide also highlights, really, is the fact that a lot of you are going to think of your migration as your burden to carry. But it's not. It's a partnership. There is a team of people out there that don't just sell this product, but they're proud of this product. They want to learn more about it. They want to know what you need and what you're missing. They want to come to the table. And what it does for you is it actually makes it a lot easier for you to understand the insides of the product too. But for us, we really focused on understanding not just how the product worked but also what didn't work for the product. Because it wasn't important for us to pick it if it worked perfectly but rather, just like any business today, what's down the roadmap, what can we use today, how do we set it up so that we can work around any potential issues. And that doesn't happen in a silo. So any partnership for any migration requires that you actually invest the time, as part of your pre-migration strategy, to really learn with the team about what they have and how it works. SOM ROY: So if you look across the different Unity business units across the stack, they're using a lot of products right now. Even after the migration was complete, newer work teams are getting kicked in. I think 2018-- correct me, Venk, if I'm wrong-- was a focus was really, really on lift and shift, move the workloads over to GCP. 2019 is more around enhancing the data, looking at email use cases. So I would say last year was very, very focused when we did the migration around compute, storage, networking, all of the core foundations of the stack. And 2019, again, as I said, the story only starts after the migration when the workloads actually come over. So this year, we are actually working with Unity to enhance that and spend our time in the data and email space. VENK SUBRAMANIAN: And once again, this was not a generic decision that we took by looking at the technologies. We look deeply within our architecture and evaluated per situation what makes sense to take over. It became very quickly clear to us that compute was a central piece of our technology. So it wasn't enough that we just lift and shift it. We had to look at Google-managed technologies and figure out what we could use there in order for us to scale. On the other hand, Unity has always been very, very passionate about data and about machine learning. But those were things that were very customized within Unity to date. So in that case, it didn't actually make sense for us to try to bring it over within the Google setup because a re-architecture would be too much for us to handle at the same time as a migration. So we looked at each piece individually to make a decision about how we had to move it over. SOM ROY: So why was this important, Venk? Can you give you an example? VENK SUBRAMANIAN: So I don't know if any of you know this game "Apex Legends," but once we finish the migration over, we had our multiplayer platform now able to support workloads through GCP. And when "Apex" launched, it was a huge launch. For those of you that remember it, they had over one million players within the first couple of days. Today, they have over 50 million players in and out of the game, 2.3 million simultaneously. And this is around the world. So at peak times, we're actually using 230,000 compute cores. I mean, pretty sure we've maxed out the Google quota multiple times through this process. But it was possible to really quickly launch this and scale it because the backing and the foundation of GCP and the way that we architected it was solid. SOM ROY: Yeah. This is really important. Like Google Cloud with scale up, and it will scale up to handle such launches. But again, because we kind of set up the networking and IM, and the security, collaboration with the Unity team, that just adding cores was relatively the easier part of the problem once we set up the foundation. Then again, we keep harping on that point because, I think both for me and Venk, that is really close to our heart that we spent that time in the first quarter. VENK SUBRAMANIAN: So let's dig really deep into how we actually planned this migration. So here's the background. Now, we talked about the partnership. The partnership was not just a PR spiel that Som and I came up with because our companies told us to. The partnership is important here to mention because we actually had to take this vision not just to the executives about the importance of this partnership, but you actually have to do that for each one of your teams. You cannot expect your teams to believe and move independently on something as key as a migration unless you explain to them the vision, the goals, and the business value. And if you're not doing that with your teams, then you're not empowering them to be part of the business. So when we set up this partnership. We took that to the teams. And we empowered them by explaining the problem. And after that, we got out of the way. We empowered them with the problem and asked them to come up with the right solution. We told them all of the business needs, the deadlines, what happens if we go over x versus y, how is spend affected through this year. We laid it out for the teams. And when I say teams, I don't just mean the directors. I mean every engineer, every lead, had access to this information and was encouraged to know about it. When you do that, now you have teams that are truly bought in because they're aligned with that vision. And then it becomes a self-fulfilling prophecy. The second thing we did was we used this as an opportunity for evaluation and refresh. One of the things that you hear a lot in the industry is lift and shift or transformation, right? And it becomes this binary thing that you're solving for. You're either taking all of your crap that you have in your house and moving it over, or you're just scrapping it entirely and just buying everything new. But that's not really true, is it? Every situation is unique. You have to look at every scenario and evaluate using the guidelines and the goals that you've outlined as part of the empowerment to understand what makes sense. So for example, in our case, we had been using our own Kubernetes stack. But we took the time to evaluate GKE, and we realized the value that it was going to bring because what we had was essentially many, many sets of isolated Kubernetes clusters. But what GKE allowed us to do was really centralize that system, and it actually fundamentally changed how we operate as an organization. Today, instead of having every team figure out their own GKE, we've centralized these things back to teams that are dedicated to it. As in, they are experts in GKE. They centralized the modules. They roll out the clusters. We use different models to help other teams scale. But when we made that decision, it actually helped us move faster. We did the same thing with logging too. We had an in-house stack that we had always been a little troubled by because it wasn't super effective. When we looked at Stackdriver, we evaluated the advantages we get in terms of gained people back into the workforce because they don't have to worry about managing an in-house stack versus what it would cost us to run it. These are real things. These are practical things that each team should not be looking at. It shouldn't just be happening at the executive level. On the flip side, we have a highly customized data pipeline. And that was not something that we wanted to take to chance. We evaluated Pub/Sub and our evaluation was that we're going to move it over, or we're going to evaluate how to move it over next year. So that one, we actually knew because we'd done the math behind it that it made more sense for us to take as is. Now, one of the things that was to our advantage was we were a distributed architecture. We've kind of been very microservice mesh from the beginning. Everything that we do is API and message based. So essentially, we already had a bunch of independent layers that we could figure out how to migrate them over. Now, I call this out because this worked to our advantage. But there are technologies that you're working on that maybe a monolith, that maybe multiple monoliths, or some version of a monolith and a distributed architecture. These are important because you have to do your own independent evaluation. It's not just enough to look outside and pick the standard that everybody else is using. And then, global teams, again, we've talked about this before. This was a particularly unique problem for us, which is we're trying to do a migration in a very short timeline with teams that are across the world and how do we coordinate that. So what we did, and I'm going to cover this in the next slide in planning, what we did essentially was really kind of focus on the results rather than the process. We focused on empowering these teams so that they could move on what they needed to independently. But we had to come up with ways to be able to track the system. And finally-- and this was the toughest part-- business continuity during the migration. There was no stopping for the business. We continued to work on new products and features and rolled them out. We had no downtime while we migrated a high-scale system. I mean, we have 2 to 3 billion users in and out of this system every month. We processed 20,000, 30,000 events every second. And all of this is at low latency. In some cases, the latency requirements were so strong that we had to pick data centers on both sides so that we could route in under 10 milliseconds. And we did all of that without experiencing any downtime. SOM ROY: And that, I think, Venk, is another key point that always comes up. Again, from the PSO side, we always see the customer saying, or our engineer is going to build the next product, or are they going to focus on migration, because they are both equally time taking. And while, on one side, you have to make the business go forward, come up with the new features, you also have to finish this migration. So I think just the balance that the Unity team, with the help on the PSO side, we were able to create was really, really a-- that's one of the key things why these things went well and we were able to complete this within a short time frame. VENK SUBRAMANIAN: And again, for us, it was a practical decision because we looked at the logistics of it. If we're building a feature, what value do we get with differing sections of the migration versus what do we get from speeding up the migration in certain aspects? We always went back to the basics. We always looked at the goals when we were trying to solve these day-to-day tough problems. So migration planning-- we already covered this, but plan twice, execute once. I cannot stress this enough. Now, what did it actually mean when we did planning? We started with the very, very basics. We put the people into Qwiklabs and Coursera. We gave them access to online training that they used to pick up at least the basics of a new system because most of us were actually unfamiliar with GCP to begin with. Then, we went into the next stage, which is, now that we understand the basics, we started to work closely with the Google TAMs to actually understand certain technologies very, very deeply. We'd already identified things like GKE. But it wasn't just enough that we looked at the documentation. So we would sit down with the TAMs and walk them through our use cases across the board and explain how we were planning to use it and what made sense. Then, we went even one more level deeper. At this point, we had each of the teams that owned pieces of the architecture sit down for days on end with the Google TAMs. And the TAMs would bring in their own experts based on specific technologies that we wanted to use. If we needed a Postgres equivalent, so Cloud SQL, they would bring those experts over. If we were evaluating Pub/Sub, then we would make sure we had a Pub/Sub expert there. And we would spend all day in a room, walking through the stack, breaking it down over and over, digging into how they're supposed to talk to each other, how the firewall rules work, how does access flow-- everything from where do we set up the projects to which services run on which cluster. We tried to lay as much out as possible. The advantage of doing this in a very collaborative fashion as opposed to having people go out and write documents and review it over and over is that it really sped up the time to get to the ideal result. So we actually focused a lot on just bringing people together. We actually had people fly to different offices. We had local TAMs, and this was another big advantage of local TAMs. Then, this actually gave us time for learning and exploration. So as I mentioned, we covered a lot of detailed content with the TAMs. Once we had gotten to the point where we understood what each team had to do, we went to the next level, which was we actually started doing cross team dependencies. So we would have teams sit down together where known dependencies existed, once again, in collaboration with Google, to talk through the details, like, OK, how do we manage latency here? How do we manage the data workloads that are going to flow through at peak periods? This forces you to do a lot of research. This forces you to gather the right metrics, to understand your system better. It's not just enough for you to have the basics. Everybody knows, well, how many requesters their service deals with, but you don't necessarily know what your error rate is, what your peak loads look like, what patterns your traffic flows through, how does it deal with specific types of exceptions. Because when you move from one stack to the other, you're actually going to see a change in where the errors are generated because different clouds handle resiliency differently. So we really focused on those aspects. Now, while the teams are off doing that, we were focusing on phase 0. What is phase 0? Compliance, security, SRE, network, and developer tools. So while the engineering teams were off figuring out how they could migrate, we had actually set off the phase 0 teams, which, really, if you think about it, this is your foundation. This is what creates the right GCP foundation for you to use. We had these teams off and running in parallel, setting up the actual stage and production infrastructure. So the engineering teams were off in sandboxes having these detailed conversations and learning more about the tech. And in the meantime, security was coming up with how do you set up the Google Groups to set up the right access. How do you do editor versus read only? How do you do billing access? Where do you give super admin? How does SRE get access to systems across different organizational units at your company versus what you do for your own teams within your own organizational unit? We also covered compliance. GDPR was a big one that we were dealing with. There's a lot of little nitty gritty that you figure out as you're digging into it. For example, GCP, by default, does geoblocking of certain countries that are embargo. And these are pieces that you're only going to figure out when your compliance team gets in there and asks the right questions because that's what they do. At the same time, network was heavy, heavy focus. We pushed bandwidth like crazy. So we definitely had the network team in there, not just looking at how they were going to use the existing setup, but how we can improve it and actually build a much stronger foundation for what we knew we were going to scale in the next three to four years. Now, workstreams and dependencies-- so this is a tricky one to talk through, but I'm going to hand-wave my way through it and hopefully you guys are going to see. Because this was actually a critical part of our success. What we focused on in this whole pre-migration area was building out workstreams. Think of your workstreams as layers within your stack. At the very basic, you're probably going to have network. Then on top of that, you're probably going to have your foundational pieces of how do you start up the clusters. Then you're going to have the actual services that run between them. But the services are also layered, right? You will have your backend services that are talking to the data pipelines, that are actually talking to a middleware, that are talking to a pure frontend service. So there's all these layers that you have. Now, if you're able to neatly draw them out so that they create these parallel paths, what you've effectively done is, one, you've identified workstreams. Each of these workstreams is a cohesive unit that can move independently. Your data pipeline is different from your services, is different from your frontend. The other thing that you're going to end up doing when you draw this out, you're going to have lines that crossover between workstreams. Those are your dependencies. So now you know exactly who talks to whom because you've leaned this out in charts. Towards the end, we had giant charts that had a ton of different workstreams in them. And so we actually ended up creating workstreams within workstreams in some cases. But when you're doing a migration as a company, you have the bandwidth. You have the manpower to be able to go and break it down. So what we did was each workstream that we started with was a business unit. And then that business unit would go in and break down their workstream into further layers. But these layers were very important. One, we always had visibility into what the layers were and where the dependencies lay. Two, we never ran into the issue of people not understanding how the architecture was flowing because this was always a great reference point for us to be able to lay back to. Finally, tracking the milestones-- so after we have these workstreams and dependencies laid out, it actually let us track two milestones. Now, I'm going to say this to you a little differently because a lot of us, when we're doing this migration, we're going to focus on the process, and we're going to focus on the deadlines. That's natural. That's what we do. But when we have a global set of teams and our goal is to empower them, there was no way for us to be able to pull it off by saying, hey, this is the date you got to hit it by, these are all the processes that you have to follow. So we kind of flipped it on its head. We focused on milestones, not status. And we focused on results, not process. So what that means is every team was able to own a piece of the workstream independently. All we did was create standard milestones that we could track across these workstreams. This is something as simple as your stage prod being milestones or a security review being a milestone. Like, pick your milestones, right? But they become standard across the teams. The second thing we did was, when we would collaborate, when we would review updates, we never focused on the status. We focused on the results. We focused on the milestones. So what have you achieved so far and what are you going after next? How can we help you get there? Where do the dependencies exist, and how can we help bring the collaboration together? What it did was it kind of changed the framing of this whole migration. It became less of a process and it became more of a shared goal that we were all going after. The other thing that was unique, at least in our situation, was our teams used different supporting stacks. So not everybody is on the same ticketing system, not everybody is using the same developer tools. So once again, in that case, how are we going to actually track to this? How are we going to know when an engineering team has finished x milestone because how the traditional process grows, right? You put everything into Jira. You create this giant chart. And then everybody's looking at it every week going, oh, we slipped by two days here. But that was not what we wanted to do. Because, again, the focus was not the deadline. The focus was the shared vision, this goal. So we actually took that work into, believe it or not, we spent a week creating this tracker that essentially took all these disparate pieces of data and just kind of rolled it up into a simple workstream-based progress. And we shared that every week across the board, the executives, they all knew what was going on because it was a very simple way to view it. In fact, I think-- and I'll say that out loud-- I believe Google liked our tracker so much that it may be showing up now as a template for other migrations. So if you see one that looks kind of like a set of workstreams, that was us. And then, finally, the cadence for the global teams-- now, as I said, it's very hard to track these global teams, so we focused on the goals and the results. We tried really hard to stay away from the process. So at the end of the day, this is kind of a very, very high level of what the workstreams look like. These cohorts are essentially big milestones that each one of these people owned delivering. And if you notice, it was possible for them to be extremely parallel. Big ones we even broke down into further workstreams. And we had dependencies tracked between them. But the reason why we could move on all of this in parallel was because we focused on making these workstreams as independent as possible. SOM ROY: And I think the cohort was also really defining what goes in cohort one and two and three. That was really also very important because it kind of gave us a staggered approach. And when we went live, it wasn't all or nothing. We went live at the first set, everything worked well, we went live with the second one. So I think that was really, really useful. VENK SUBRAMANIAN: Yeah. And we're actually going to talk about some of our learnings now because we also stumbled along the way. There were things that we learned. And they're kind of the nitty gritty. They're the practicalities of when you try to do a migration. So we'll start with the don'ts because we always want to know the don'ts first. OK. First of all, don't migrate all of your baggage. But also, don't migrate none of your baggage. You've got to pick. Think about it like moving into a new apartment. When you move into a new apartment, you don't just take all those boxes that have been sitting in your garage for months or years. But you also don't just leave them in the house. You want to go through and clean up as you go. So do that. Be practical about it. And more importantly, don't try to put a ton of process around it. Your engineering teams know these things. They know where the baggage practically can be moved versus not. Just let the engineering teams be empowered to do something like this. Two, don't build a snowflake. If you have a snowflake, don't migrate the snowflake. We all have stacks, especially if the technology we're working on is a few years old, we all have these snowflakes sitting around. But the world has changed. So much is out there now. Google aggressively looks across and tries to find these common patterns. So there are known best cloud practices. There are known application patterns to follow. Use them. This is actually a great time for you to be able to reinforce them. You know that there is a VM out there somewhere that is open to the internet and nobody knows about. You know there is that one guy whose laptop has access to the production infrastructure. Take the time to clean it up. Take the time to put better practices in place. Don't migrate the monolith. Now, this is a harder problem than just putting it up as a point on a slide. So when you're looking at a monolith, do your best to apply the strangler pattern. Pull out what you can that's safe, what's feasible, and try to move that over. It's actually going to have a twofold result. One, it's going to make your monolith smaller and easier to migrate. But two, it's also going to provide a proof of concept for how to be able to scale to bigger migrations because you're going to pull out really, really small pieces. You can start with simple things, like your config service, or your identity, or just a connector to the database. Separate out your database through an API and just move that over. Now, don't retain single points of failure. This one was especially important for us. For example, I told you we used network bandwidth like crazy. So when we were moving over to Google, we really buttoned that up. We created multiple points of failures and redundancies in there that has actually allowed us, in recent times, to deal with network outages that we've seen that were on the third party side. But we've been able to deal with that because we actually took the time to evaluate our bandwidth, understand it, and then put the right redundancy in place. We did the same thing for Kubernetes. Multi-zone's always a great idea. Multi-region is even better. But multi-zones are always a great idea, so we took the time to do that too. And finally, don't re-architect everything. Especially not your plumbing. Don't try to move over to that brand new monitoring system while you're trying to do a migration. Move parts over that need to be migrated and deal with everything as a separate initiative. Now in some cases, they may be tied together, in which case, use your workstreams and dependencies to be able to separate them out. If you treat your developer tools as a later workstreams have a dependency on, you're actually going to create a phased approach of how we're doing things. SOM ROY: So, like Venk talked about, the very specific on the technology side, I'm going to talk about the people and process side as well because, until all these three together come together, you won't have a successful migration. So the first key don't on the people process side is please don't migrate in a vacuum. Think about the downstream dependencies. Think about what and other teams downstream are going to be impacted by this migration. And please involve all these cross-functional stakeholders when you do so. If you don't do that, it's like, even though your component is successfully migrated, the shared vision will not be met. So no migration in a vacuum. I think, don't focus on deadlines. I think Venk talked about it. I'm not going to go into the details. I think it should be shared goals. It should be talking about shared focus areas. And everybody should feel as part of the migration and just saying, I have to do it by October 30, that's not going to cut it because you have to take everybody along. Another thing which is important is the lift and shift versus transformation. I think what the point we are trying to make is there is no one correct answer. Some of your components will be lift and shift. It makes sense to just take what you're running and just move it to GCP. Versus some, you should take the option that you are doing a migration, why don't you transform? So each workstream should be treated very differently. Each component should be different treated and evaluated differently. Across all the workstreams, I think, there is no mandate that everything needs to be either lift and shift or transform. I think the Unity's migration was a very good example of a mix of both these approaches, where it made sense. We had a lift and shift where it made sense. We actually looked at a transformation approach. And then, just don't assume that planning leads to success. While the planning is really important, but following up and executing on it is really, really important. And iterate on the plan. In every migration that we are seeing, there are unknown blockers that will crop up. There will be churn in terms of folks joining the team. So you have to keep iterating on the plan because if you don't iterate on the plan and you just stick to something that you planned six months back, that's not setting up for success. So that is a really, really key don't in terms of from a process perspective. VENK SUBRAMANIAN: So now, what to do from the technology side. We've covered phase 0 in detail. It is very important to establish a solid GCP foundation. Because here's the thing to remember-- the migration is not the end of the journey. It's the beginning of it. You are tying yourself to a technology for a long period of time, hopefully. Hopefully migration is not what you do as a company. And when you're doing that, you want the foundation to be extremely solid. The more debt you accumulate, the harder it will be going forward. Now, some kinds of debt are isolated, and you're going to actually make calls on them. But foundational debt is different. If your access controls are not in place when five people are in the system, it is going to be a lot harder to do it when 300 or 3,000 people are going to be in the system. If your network firewalls aren't in place to segregate the environments when there is no services in there, it's not going to happen when there's thousands of services in there. And what also changes is the manpower needed to do it at a later time exponentially goes up too. So it is very, very important that you establish this foundation. Automate everything-- this is the key part. So we actually took this time to automate everything that we found that was manual. And we actually had it in three different styles. So everything was infrastructure as code. That was kind of the principle that we aligned on that we were going to do. But also, when we were looking at infrastructure and we realized all infrastructure was being written as code, we centralized a lot of the pieces that we knew were in use across multiple parts of the company and we offered them as infrastructure as a service or infrastructure as a framework. And what infrastructure as a framework-- I'm assuming all of infrastructure as a service. But what infrastructure as a framework does is it actually take some of the burden of maintaining the service and passes that on to the user of that service. And this works really well for internal teams. So as a good example, we had many, many uses of Mongo. And they were all isolated, some of them were manually set up, some of them were set up in different kinds of automation, and they all had different configuration, different use cases. But we spent a time to have a single team actually pick up all of that, understand from each of the users what they were trying to do, take it all back, and rewrite it as a centralized module. Now, this module is a lot better maintained because we actually use automation to not just deploy and run it but we also use automation to test it. We have a team that's dedicated to continue to improve it, which means that you don't fall behind on versions. You don't fall behind on new functionality. You don't have bad bugs sitting around in your system. But also, it didn't make sense for a team like this to try to run every database in the company. You're just creating just an unhealthy central dependency. So we offered it as a framework out. We maintain the module, we improve it, you go use it. You own the cluster that it runs on. It also passes the ball a little bit of making sure everybody knows how to run infrastructure. Most of us today understand the concept of, you build it, you run it, you own it. In fact, at Unity, we have this joke that says, you build it, you run it, you pay for it. Because our engineering teams actually know what it takes to actually-- in terms of spend for their services to run. They use that to continuously optimize their services. And this is the kind of empowerment that you want to make sure that you offer to your teams. Minimize technical debt-- now, we've talked about technical debt quite a bit. Specifically, what you want to look at is just old debt that's sitting around that's going to severely hamper you. In one of our cases, we had a old, highly customized load balancer that had been written in house that was sitting on a system. And when we looked at GKE, we knew that we could re-architect the system a little bit and actually get rid of this custom load balancer. So we took the time to do it. So minimize technical debt where feasible, especially fundamental technical debt. Don't take it with you. Iterate and learn-- so the way the workstreams were set up, they were actually set up in increasing levels of complexity and collaboration. So the first systems to go out were extremely simple. Most of them didn't even have a back end database. It was like a GCS. But after a while, we started to get into bigger services, but those that were completely independent. Then you start introducing a dependency layer. So now you have multiple systems trying to go live at the same time. So we did it slowly with the goal of accelerating closely towards the end because you start to gain momentum once confidence builds from the deployment. So our migration actually looked like almost no work done within the first three months because we were focused on the learnings and the migration, very, very minimal services migrated over the next two months because we were getting our confidence and understanding how everything worked in stage. And then, all of a sudden, you start seeing you high-scale services just ramping up really quickly, and all of that happened within the last two, maybe three months. And finally, train, plan, and prepare-- the reason why I'm calling this out is, separate the artificial pressure of learning from execution. As an engineer, you know this. When you are told to practically learn on the go or you're being asked to deliver something that's brand new, it creates this artificial pressure of, I need to do two things at the same time. So we just set it up in a way that it took that pressure off, the teams were able to focus on learning and training first, and then when they were ready to go, they were able to execute. SOM ROY: And finally, from the people and process side, what are the things you should do? I think this is very, very like what we covered in the don'ts is, don't start in a vacuum. So you should align on overall vision and goals for GCP migration. Gaps in understanding, it has to be top-down as well as bottom-up. Like, if the engineer who is actually working on the migration doesn't align to what the vision of the company is to why big GCP and where the migration is going, that leads to issues and conflicts unnecessarily and it will delay your migration. So then alignment across top-down, bottoms-up is really, really important. Establish a migration PMO. This may sound like a really, really old school term, PMO. But project management office does help when you are doing a large migration with multiple workstreams with so many different stakeholders involved. And absolutely identify the right stakeholders, both from the Google side as well as from the Unity side. That's really, really important. Identify clear owners because responsibility and accountability, again, a very old school term. Who's responsible? Who's accountable for the success of that workstream? That is really important to identify. The fourth one is, I think, the one that is really close my heart, which is set realistic migration goals. Aspiration versus reality, like we keep talking about, yes, let's use this migration to do something dramatic. Your entire stack will change, something will change drastically versus, let's be real in what can be achieved in the six months, seven months or one year of migration that you're going to do. So this alignment is really, really important. And you as a company need to take a call in what is the realistic goals. Establish and track milestones-- I think Venk talked about-- and I'm not going into the details again because you don't have so much time. But have proper tracking and make sure that you're looking at it on a weekly, bi-weekly, monthly cadence. Prioritize for risk issues and features, and this goes back to the slide where we talked about all the feature requests that Unity had. Like, please identify what is P0 and P1s for you. What is P2 and P3, and can it wait for six months, one year? Please do try to actively identify. It always works better both for Google and the customer if we have a prioritized list. And on the last point, Venk, do you want to-- I know this is-- VENK SUBRAMANIAN: Yeah. This is the one I really want to hammer home because a lot of you are going to treat a migration as a project that needs to be done. But it's a unique event, and it's an important event. It actually sets the stage for your company. And it's not going to happen without your people. It's not going to happen without your engineers. So celebrate success. We talk about people, process, technology quite a bit. But if you think of it in terms of people, planning, and empowerment, you're going to get the right technology as a byproduct of that. If you focus on letting your smart engineers be empowered and then get out of the way and focus instead on how you can unblock them and how you can help them get to the right kind of tracking and milestones, they're going to get the right results for you. Deadlines are never the focus. The goals and the vision are. And that's it for us. [MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 8,562
Rating: 4.9148936 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate
Id: WdqQEcEUwRE
Channel Id: undefined
Length: 46min 56sec (2816 seconds)
Published: Thu Apr 11 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.