AWS re:Invent 2020: CloudEndure Migration Factory best practices

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello, everyone, welcome to the CloudEndure Migration Factory Solution. My name is Wally Lu, Principal Consultant from AWS. In the next 30 minutes, I will introduce CloudEndure Migration Factory Solution and talk about typical large migration challenges. And in the end, I'll share some best practices and lessons learned from our customers. Let's get started. So first, large migration challenges. Have you thought about maybe migrate 1000 servers in about six months, or maybe 3000 servers in about 12 months? To migrate one server may be very easy, but scale really changes everything. As simple five minutes tasks such as restart server, if you repeat that 1000 times, that's 5000 minutes. A simple 10-step process to migrate a server, again, 4000 servers that could be 10,000 steps. We want to design a solution to help you simplify the migration and reduce the number steps for large migrations. So, this is the desired state. Ideally, we will design a solution that's simple enough that you push one button today, you can migrate all your servers tomorrow from your data center to AWS. However, every customer is different, there's no one size fits all. In reality, there are a couple of things we need to consider because for large migrations, we use different tools to support migration and different customers use different tools. You may have discovery tools, migration tools, CMDB, and data in Excel spreadsheet as well, you may also want use the project management tools to manage a large migration as well. Now, people say things as well, you may have lots of people, different teams to support large migration, infrastructure teams, cloud teams, maybe application teams, and testing teams, and so on, so many people involved. And, the third thing here is there are many small tasks as well part of a large migration. For example, you may want to check a C drive freespace. And you know, you may also want to install agent on a source machine. And how about select a target instance type for your servers. But repeat that 1000 times is really a big deal. So, we want to design a solution that is simple enough, but also flexible to help a customer solve all these problems. So what do we do? So, let's revise change the design a little bit, here's a revised desired state. What if we split a big button into 3, 6, or 9 smaller buttons? So in theory, if we push the right button at the right time, in the right order, we can achieve the same result, right? Or even better, because you can add a new button here or replace existing buttons to integrate with your existing systems. That's even better. So that's what how we want to solve this problem here. So, let me introduce you CloudEndure Migration Factory Solution. But before we talk about a solution, I want to spend one minute to quickly talk about what is CloudEndure Migration. CloudEndure Migration is a re-host migration tool, it helps you migrate from anywhere to AWS. It was designed for rapid mass scale migration and it is a block level replication tool that replicates every single block from source to AWS. And, it is also agent-based migration tool as well. So that's CloudEndure. And, what is the Migration Factory Solution and why do we need it? Just like any other services and solutions we develop at AWS, we always working backwards from customers to think about how can we design a solution to solve customer problems. So this specific solution, we try to use that to solve the large migration challenges. So CloudEndure Migration Factory Solution, or CEMF, it is an automation engine built to accelerate your CloudEndure Migration using APIs. It is also metadata store to help you save all the data in one place. You know, you may have your server data, application data and source data as well in one place in a single source of tools. And the third thing here is, we want to share with you is a perfect use case for the solution is if you have more than 100 servers to lift and shift to AWS and using the solution will help you accelerate your migration to AWS. So from the solution design perspective, we try to solve two problems here. One is integration as we talked about before, there's so many things involved, so many people involved. We want to build a metadata store that is able to integrate with everything as a single source of tools. So like this diagram here, you can import data from your CSV files if you want to. And you can import data from your CMDB, use the same standard that’s API. Or you can leverage the same metadata in a metadata store to automate a migration activities such as I want to install software for all my servers in wave one, since we know which servicing we want in metadata store to automate data, it becomes really easy to do. Also, you may also want to automate cutover process as well, instead of cutting our servers one by one, we want to integrate with clop into API, so that enables us to cut over a large number of servers such as 20 or 30 servers together using APIs. So that's integration. Let's talk about automation piece. And since we have everything integrated, that's easier for us to automate across different tools as well. Using this as example here for CloudEndure automation activities. First column is the build phase, in a build phase we have three tasks. First one is check the prerequisites and why this is important? Because you don’t want to spend hours of time, or days, to do troubleshooting. What if something doesn’t work? Your application doesn’t work in a cut of a window, you may have to spend a couple hours to figure out root cause. But, the root cause can be easy as just not enough free space in C drive. You can spend five minutes to check a free space, but for the cutover that could save you hours of time, right? That's really worse to do it. However, five minutes 4000 servers is going to take you 5000 minutes. What do we want to do here is write one automation script that is able to check the prerequisites for all Windows and Linux machines all together in the same wave. So, example here is for Windows. We check a C drive free space dotnet framework versus TCP 443 to cloud into console and TCP 1500 to cloud into a replication server. So, you run that once for all your servers in the same way. So now when your servers are ready, you want to push an agent to the source machines as well, right? To install one agent is super easy, maybe only three to five minutes per server. However, if we’re talking about 100 servers, things become a little bit more complicated. Because you have Windows, you have Linux, you may also have 10 different target accounts. Ten different target database accounts means you get ten different installation tokens, one for each CloudEndure projects. So, you may end up with 20 different ways to install an agent on 100 servers. That's not easy. Even with the tools like Ansible or SCCM, you still need to figure out for any server, do I use method number one or 20 needs to agent on the source machine? Now using the automation script here, we are able to push agent to any source, any Windows machines and any Linux machines. And also, we can push to any target machines as well, using one automation scripts. So this is automation, we provide a part of a solution, but as I mentioned before, one size does not fit all, you can add additional automation or customize automation to integrate with existing systems, such as your password management system, or maybe your CMDB to build a fully end to end automation to support large scale migration. So that's automation. From architecture perspective, we can deploy the solution to your AWS account using one automation cloud formation template. When you deploy the cloud formation template, it will deploy the front end and the back end. The front end is JavaScript application. The back end is lambda functions and DynamoDB. We use Cognito to authenticate with the solution. Even you have multiple accounts to migrate, you only need to deploy the solution once to your account and use that to support a migration to multiple target accounts. On the left hand side, this is the migration execution Server and Windows Server in your data center in your AD domain. So we can use this server to connect to your source Windows servers, use the remote PowerShell WinRM protocol. Or we can also use the same script to connect to your Linux machines using standard SSH protocol. That's the architecture for the solution. Let's do a quick demo, I want to show you how we can use this solution to accelerate your migration. So, I want to show you three things in the demo. First thing is import server data to the CEMF solution instead of updating the server metadata one by one on the console how to import data from a CSV file. Second thing I want to show you is how do we check the prerequisites and push agent to the multiple source machines, both Windows and Linux at the same time for the entire wave? And the third thing here I want to show you is how do we do cutover? How do we launch meaning server together instead of launching server one by one from the CE console? So, let's get started with the demo. Now, I'm on a demo server right now. I'm using a demo server to run automation script. And demo server is used to mimic the source data center environment. So this server is in a source ad domain, I can use this server to connect to all my source Windows servers, using remote PowerShell WinRM protocol, or SSH to all the Linux servers. So, let's start with automation number one, import Server data into CloudEndure Migration Factory. But before I do that, I want to show you how do we normally do that manually, so we can compare the two, right? This is CloudEndure console. Now, we have two servers here. Normally, you can select one server and update a blueprint. Blueprint is the target instance information, you have to select instance type, subnets, security groups, and save the blueprint one by one for all your servers. Think about that, what if you have to repeat that 100 times for your 100 servers, right? We will change that by using the data in a CSV or JSON file and import all that data into the migration Factory. Now, let's take a look. This is the CloudEndure Migration Factory web console. We are on the resources page here. On the resource list, we have wave list, application list and serverless. In Wave 1, you may have three applications, 10 servers, wave 2 maybe 20 servers. Right now we will import data from CSV to the Factory. Now this is my CSV, we have four servers here, two Windows servers and two Linux servers. I have full server information including operating system, FQDN and target server information as well such as subnet security groups instance type, we can use information here to update CloudEndure blueprint. Let's import a data into the Factory by selecting the CSV. Within just a few seconds, we will have four servers in the Factory, two Windows servers and two Linux servers here in wave one. Now, another option to get data into the Factory is if you have large data set with server application wave together in a big CSV, you can always run a Python script to ingest the data into the Factory. Now we have the data here, what if you want to change something? You can always switch to pipeline page and change the information, such as: you may change your application from wave three to wave four if there's any delay, and save application. And you can do the same thing for the server as well by changing a server from one subnet to another and save the server information as well. Now we have the data. Let's do automation number two. let's validate the prerequisites on the source machine and push the CloudEndure agent on source machine as well as the run our first automation scripts here, which is zero dash, pre requisite check. Let's provide a wave ID as a future and CloudEndure replication server IP because we want to validate connectivity from the source machine to CloudEndure replication server via TCP 1500. So let's test that, and first step is I need to log in into the migration Factory with my username and a password. Now we have the serverless for wave one to Windows servers and to Linux servers. Looks like everything is good for Windows and let me type a username and password for Linux this time. Of course, check different settings for Linux. And only a few seconds later, we have a final report and tells you which server passed the checks and which server fail. It looks like everything is good. Now, if we switch to the resource list page here and future migration status, we will see something change from the Factory. Now let's filter this. So for the four servers in wave one, the status change to test prerequisite check pass, right? This is because every time you run automation script, a set of feedback to CloudEndure Migration Factory API to update status for you automatically. So you always have visibility of the entire lifecycle of your servers. And your migration engineer doesn't need to spend their valuable time focused on status update because this is all automated process. Now, next step is to push CloudEndure agent to the source machine. Let's do that. As I mentioned before, this should work for any source, any target, let's see. First, let me log into the Factory using my username and password and now we have to provide a CloudEndure API token. So, this is my CloudEndure API token, and we paste it here. And we are getting a serverless, two Windows servers and two Linux server, right? This means this works for any source, any Windows and any Linux. We also have servers in demo two and demo three project, this actually means works for any target. So let's type a username and a password for Linux. So the process start from the first server in the first project. Since the first server is Windows, it's actually using remote PowerShell to connect to Windows. If the next server is Linux, the script will automatically switch to Linux using SSH to connect to Linux servers. Now this will take a few minutes. In the meantime, I want to show you automation number three, how do we cut over large number of servers in a cut of a window? Before we do that, again, I want to show you and then compare the differences how we do things manually. Normally, on the CloudEndure console, you have to update the blueprint, one by one, that's the first step, right? Before cutting over any servers, select the right instance type, the right stop and then in switch back to machines and select the machines, click a button, launch a server is in a test mode or cutler mode, right? If you have 100 servers, you have to repeat that step 100 times, right? Think about that select 100 servers out of 500 on the console and update blueprint one by one, that's a big task. We want to change that a little bit, because how we do it in the migration Factory is completely different. Because we never touch the servers, we always operate at a wave level, let's grab our API token from here and select a project name just to a dry run first. And launch type will be test with ID 3. So dry run does not launch any real servers, dry run basically validating your data. So we import the data from CSV, now this time we want to validate the data, make sure there's no typos, no invalid values in a CSV, right? You don't want to spend your valuable time in a cutover of a window to troubleshoot issues like typo. So we should do dry run a couple of days or weeks before the cutover. Let's do a dry run. So as soon as click the button launch servers, this will send the data to CloudEndure API to validate the data. Now you either get a response like, dry run was successful or dry run failed. So it looks like dry run was successful for all the machines. Now we can change the dry run from yes to no, to launch a real server. Let's do that, and launch a server. So similarly, this was sent data to CloudEndure API, update blueprint, check replication settings and create a job not for one server, but for the entire wave. Now we have test job created for machine two and three, right? Let's compare with CloudEndure console here and there is a job or my entire wave, wave three. So you may notice difference here is I did not select any servers, right? I simply choose a project name, and wave ID and click button Launch Servers, whether it's a one server, 10 servers or 50 servers, that doesn't matter, because we launched the entire wave together. So this will help you accelerate the migration by focusing on the waves and eliminate some of the potential issues during the manual process. Okay, let's go back and check the agent installation. It looks like everything is good here, we have agents successfully installed on four servers. Now if we switch to the resources page here, we can see some status change as well. Migration status for this full server change to CE agent install success. And these two is Test Instance Lauched, right? Similar to the previous script, every time you run automation, or do anything from the Factory console, we will update status automatically for you. So you always have the visibility of the entire lifecycle. Let's validate again using the fact using CloudEndure console here. We have two Windows servers in Demo2 project and we have two Linux servers in Demo3 project as well, right? This actually means the script works for any source, and any target. So that is the end of the demo. I just want to show you this one to let you know that how automation could help you accelerate your migration to AWS. Okay, let's talk about best practices and lesson learned from our CEMF customers. So, customer A large cutover in just a few hours. What I mean by large? Have you thought about cutover hundreds of servers in just a few hours? So that's exactly what this customer did. They were able to cutover 600 servers in just a few hours. So, how do we do that? What do we learn from this customer? Number one, minimize a change is a key for large cutover. Change is a good thing, but also sometimes is a risk as well. So it's a fair to say that if you're going to change 20 things, generally, the risk is bigger than just change one thing. So you may want to change your computer name, AD domain or your IP address of the server as well, right? However, what if there are some legacy applications hot or cold IP somewhere in the application, but nobody knows, that's a risk. So for this customer, they try to mitigate their risk by not even changing IP address to cutover the entire subnet together and that's one thing. Another thing that's similar is the networking, right? Ideally, we want to build an application-specific security group for every single application before migration. However, there are some challenges because maybe due to the tight schedule, maybe due to lack of knowledge of the application, we may not be able to do that for large migration if you have a tight schedule. So, for this customer, they developed generic migration security group to support large migration and push the application security group design to a later stage, that's how they did a large cutover in just a few hours. Next thing we learn from this customer is automation, where the migration Factory solution is also the key because you do not want to coddle your servers one by one, you do want to bundle the server together and launch your server together to save you some time. And, last thing from this customer, but it's also very important is make sure your app teams are ready. And your application teams are really critical, because any large migration is never just an infrastructure project. We need to make sure we got the application team's business unit involved as part of the migration. They are being in the same team. They are not just playing a supporting role, they have to help us do the application testing, and change of vacation make a go and no go decisions as part of the large migration. So really important, make sure your app team is a fully aware and support the large migration. That's for customer A. Customer B, we have another customer scale from 10 servers to 90 servers a week, actually scale from one cutover to three migration cutovers in a week. That's about 90 servers. So, how did we do it and what do we learn from this customer specifically? One is plan the migration wave ahead of time. So, believe it or not, large migration store, sometimes not because you don't have the right team and skill set, not because you don't have the right tools for the migration, you may have a perfect team and skills and perfect tools for the migration, but you may not have enough servers to support it. So, what I mean by that, you want to migrate 50 servers a week continuously for a few months. What if you don't have that many servers ready to do the migration? And that's the challenge we find sometimes for large migrations. For this customer, the finish away planning for 900 servers ahead of time, so they were able to import all 900 servers into CEMF. So that 900 servers are ready for large migration, even with 90 servers a week, that's enough for 10 weeks. So that's one thing we learned from this customer. Second thing is automate the Server Data intake process as well. So you may have server information in an Excel spreadsheet, server information, in the CMDB, in the Discovery tools as well. Now, try to avoid doing manual copypasting from A to B and merge the data together. One, that's not efficient. And two, there might be a lot of errors during the manual activity, copy pasting. So for this customer, they have a big wave planner Excel spreadsheet in SharePoint. What he do is he logs into SharePoint to do the wave planning, basically which server in wave one, which server in wave two, that will trigger a Terraform process. Terraform process, basically updating the on prem firewalls and create a security groups for the specific server as well. And that triggers another lambda function to validate all the data just to make sure there's no tables and it has all the data ready for migration. And that lambda triggers a second lambda function to import data to CEMF ready for migration. So, as you can see from here, there's only one manual process in the middle, which is someone logging to the SharePoint updater, the wave planner, so that triggers all the other processes. Everything else is fully automated from end to end. That will save you a lot of time and avoid a lot of errors during the large migration. So from this customer, we learn more automation actually means less troubleshooting, less troubleshooting means faster migration. Customer C, they had a 1 GB DX, but they were able to migrate 500 servers in just three months. And remember, they share that 1 GB link with the production servers as well. Now, what do we learn from this customer? How they did it? Number one, develop the end-to-end process in early stage of migration, they did that. And don't wait till the last minute and two days before cutover still trying to figure out who's going to install agent on my source machine and who is going to shut down the server and who is going to change your DNS, make sure you develop the process RACI model. In early stage of the migration, make sure everyone is fully aware their role and responsibilities. So no question will ever be asked who does work for the migration process. Next thing is they developed a centralized tracking dashboard for everyone to track the entire migration status just like this one. So you can save maybe 15 minutes every day, you don't need to do the status update meetings anymore. So everyone is simply logging to the dashboard to see the status for the entire migration. 15 minutes may sounds small, but if you have 20 people in the team, that's 300 minutes every day, that's quite a big number. So, this really helps them, gives them leadership visibility and helping them manage an entire migration using a centralized dashboard. And the next thing is, since I mentioned, they only had a 1 GB directing that link, they actually developed additional automation. Just as I mentioned before, we want to see CEMF solution to be flexible. So any customer can develop additional automation for their use cases, that's what they did. They develop additional automation to integrating with a CEMF using the metadata store, but to disable CloudEndure replication in the business hours to avoid any impact to production servers for the entire wave enabled replication after business hours, that's really useful. So that's how they could use only 1 GB link and share that link with production servers, but still able to migrate 500 servers in just three months. That we learn from Customer C. So to quickly summarize a large migration best practices. Number one, plan the migration wave ahead of time, do not wait to the last minute and still trying to figure out which server goes to which wave, make sure you have enough buffer to support a migration, at least a couple weeks ahead of the migration schedule. Number two is develop end-to-end process and automation in early stage. Again, do not wait to last minutes do trying to figure out who is going to log into a server, to install agent, to shut down a server defined process developed automation in early stage can help you with a large migration. And, next thing is automation with CEMF is also the key. As we talked about one example here, more automation actually means faster migration. Number four is minimize unnecessary change. Change is a good thing, but this is another wish list, right? Since we're doing the reverse migration. Typically, the goal is to exit data center with a hard deadline. Some app owners may want to modernize their application from monoliths to microservices or change to serverless. That's understandable, but as part of the reverse migration, we should push that to a later stage instead of part of the migration schedule. Now, last thing, again, very important. Prepare your application teams, super important, make sure they're aware they're part of the migration team. They are not just supporting the migration, because without app owners, we will not have successful migrations. So, that's all for today. That's takeaways for today's session and some useful resources for you. First one is CloudEndure Migration page if interested. Second one here is CloudEndure Migration Factory implementation guide for you to deploy the solution in your environment. The third thing here is the best practices how you use a solution for large migrations. We also have few other links here, like Migration Immersion Day and AWS workshops as well. Feel free to take a look. And I thank you for watching the session today and see you next time. Thank you! Please complete the session survey
Info
Channel: AWS Events
Views: 2,569
Rating: undefined out of 5
Keywords: re:Invent 2020, Amazon, AWS re:Invent, ENT304, Enterprise/Migration, CloudEndure Migration
Id: is7cOcNUHlw
Channel Id: undefined
Length: 29min 19sec (1759 seconds)
Published: Fri Feb 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.