AWS re:Invent 2018: Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everyone thanks for joining us on Tuesday afternoon of reinvent for a talk on legacy migrations I'm here today with Mike gulag or skis senior director of business technology at Cisco my name is Mike Lapidus I'm a Solutions Architect for AWS when Mike and I were planning out this session we really wanted you all to come away feeling like you experienced a legacy migration that you'd experience both a failure of a legacy migration and the success of a legacy migration and with that experience you can take that and go move your own application what's not included in this agenda is a Q&A at the end so we're gonna leave 10 minutes we have two mics and you all can ask questions to Mike and I will also be around afterwards in the hallway if anything can't be covered so I'm gonna kick it off by talking about AWS and migrations in general if you're using AWS today you know that we have a plethora of services that are out there to support your migration efforts and we're releasing new services in this capacity based on customer feedback all the time just was it 2 days ago we released AWS data sync that further extends our ability to move information into AWS move data into AWS in a seamless manner so at the center of this as the AWS migration hub we have services that range from application discovery all the way to moving actual bits and finally keeping databases in sync so that we are able to cut over in a seamless manner now it's not enough to have a plethora of services that support your migration effort it's also good to know that you're walking in the footsteps of giants and organizations of all different shapes and sizes have successfully migrated to AWS so we have Cisco up here with us today but if you go to our migrations landing page you'll find case studies from McDonald's Ticketmaster GE and others who have also migrated to AWS in a very in varying capacities now for folks that have dabbled in migration or adoption in general you'll know that migrating and the technology itself oftentimes isn't the most challenging thing cultural changes is equally or oftentimes more challenging for organizations AWS professional services has put together put together their cloud adoption framework this is a series of white papers that actually call out some of those those challenges that you may experience and give your organization a method and a rubric to follow in order to be successful both educating hiring and training folks now for Cisco in particular this what you'll hear about here in a moment army army was a project it was at the very left hand side of this graphic and it really propelled Cisco forward it propelled them through the migration phase and and now today when we talked to Cisco they're in the reinvention phase they're building fresh applications that are actually rewritten and and repurposed applications from things that existed on premises in a clown native manner and when it comes to migrations there's a number of different ways that an application can be migrated and it's important to classify the migration effort before you kick it off you may have a simple Rijo switch is basically moving the application in its existing form into AWS you could have something like a refactor where you're actually using cloud native technologies within your ari architecture of the application in order to take advantage of all the benefits of of the cloud for army which the application you'll hear about here in a moment it started as a simple Rijo stand we change that to reap lat form and then after migration and we'll go in more detail here in a moment we actually refactored it in order to take advantage of again those cloud native technologies that allow us to be more resilient for our customers and with that I'm gonna kick it over to Mike G who's gonna tell you all about Cisco thank you Michael so a little bit about Cisco we are global leading companies focusing on selling products and services to our customers you probably guys see a lot of our trucks out in in Vegas the big Cisco logo but what we're really focusing on Cisco to be a customers and most trusted business partner there's couple of things that we really really passionate about is really focusing on enriching a customer experience it's focusing on operational excellence obviously again you guys see trucks out in the road you see our trucks in front of the hotels really activate the power of our people and really doing this the business the Cisco way at Glenn's so I know you guys seen a lot of presence from Cisco but we are actually a forty four hundred billion dollar company with the presence in Americas and we're growing in Europe Ireland UK friends among other countries we have different brands think of it broad line business with about 1 million SKUs plus we have a specialty companies that's focusing on produce or specialty seafood and so on so Mike talked about rme if what the acronym stands for is we hosted mainframe environment so what we've done again this is a journey that we took on several years back the first thing that will what we did as a company is we focused on take out the z/os because we had a large ERP system environment cause just going through the roof so what we decided to do is we actually reap lat form from z/os to Windows using micro focus platform that position is in a nice way we got everything kind of worked out and our core ERP was running extremely well however just with you know if you look a couple of years back lab would say four or five years back where we were when we hosted in-house and usually the way that that works is probably everybody familiar with you buy a hardware you run the hardware for five years but in our case if you look at the previous slide obviously we grown through acquisition and that created a lot of challenges for us because our growth cannot keep up with the system we use our core ERP for many different reasons we have corporate billing we have purchasing we have finance so we have millions of dollars running through the system every day would you an array I would say about several terabytes of changes every week so extremely large system rapidly growing business we want to integrate a little bit better and really that became a catalyst for us to be on the environment that's more scalable brings a lot more elasticity and a couple of other reasons that we talked through the presentation a little bit more the first thing when we try to migrate to the cloud we failed and you don't hear a lot of companies talking publicly hey we try this and fail failure is a good thing and what it did for us is flush out many things that we were just not familiar with especially when it comes to moving large large system Mike is gonna go a little bit more in details but some of the things that kind of got us is truly understand the Delta and the data size of your data that you're trying to move to the cloud as we running a large European performance obviously it's a critical thing so we needed to make sure that as we move something to the cloud the performance actually better not up to par because again we wanted to make sure that we scaling up not just what we currently have today but we wanted to make sure that we have an ability to grow with the business and required resources are now and that's both from we try to understand both from the environment side and the skillset side because again we try to migrate the system that's old COBOL based environment and usually when it works it works but nobody wants to try like unplugging and see what happens so it's one of those things that was a lot of fun it's gonna shake it and see what happens let me talk up a little bit more about the architecture what we have today and we put a simplified version of this obviously there's a little bit more behind it and we after the talk if somebody's interested we can talk about it but basically we took existing environments along with the storage and the data we moved it to AWS we created the primary secondary to make sure that we have the right failover and we wanted to make sure that our data resides in multiple easy because that brings us ability to do failover it's a lot of it a lot quicker we also put a lot of data that does not resign a database on those three buckets because that provides a lot of storage flexibility and I'll talk a little bit more on what we've done with this and all those boxes actually surrounded by load balancers so he gives us a flexibility to scale up scale down absorb different loads on the system and then scale it back down here's the interesting chart that I want to kind of socialize with you guys so one of the things that cloud brings is actually a lot of interesting things that you can do and this is actually a cause that shows we went live with our system and our cost is actually increased a little bit and that was on purpose because what cloud gives you is a really good flexibility to say you know what for the next few months until I know what I'm dealing with I can actually scale up and extend my environments just so I don't hit any roadblocks but what we've done aggressively is after we validated the system works we get the right performance we actually start really aggressive optimization test and some of my team members in here who helped us with us we really focus on not only to bring the cost to the same point of where it would be equaling on board we'll take the host on side but our goal once we verify yeah we're now the same cost we drop it by close to 60% through various different techniques this is the data lifecycle management that we implemented so what we've done just with any system ins that set up corporate your own data center kind of like if it works don't touch it but actually created a lot of problems for us and it's not only about the technology it's really started looking back and talking to finance talking to legal and start talking about things that okay what about data retention what are our policies can we comply with all the policies and this is some of the things that we were able to do is not only to take the data that's only needed in production remove it from our core production system but actually apply the right data policies and retention policies they can help us with us so reducing your data load obviously equals great performance but on the flip side it also focusing on all the key things that you need from the legal perspective before go to lessons learn so we tried to do this thing in April 2016 we fail and it was not easy it was not easy going back to a CTO and saying look the team tried and kind of didn't work out but we went back and we kind of said you know what we were this close what didn't work out is the timing we started on Friday we needed to be back on line for the business by Sunday around 1:00 p.m. and it was like we were getting we were this close but it was just a lot of unknown but we had a plan to back out so we put the system back online we were able to revert it back we took a lot of notes we went back in June 2016 so two three months later we went back kind of the next day took to a couple of really good notes regroup with the the rest of leadership in Cisco and say look we can do this we were this close we found out exactly the lessons learned that we're gonna review with you guys and we want to do this again in June 2016 sorry 17 we were able to three months later to redo this project again successfully get it done was hard I'm not gonna sit here is like hey guys you just you know if you really want to move to the cloud just use all the services and magically everything would work doesn't work that way it was a lot of hard work that we put in but we got it done here's the thing that I think I'm extremely proud of for Cisco for my team everybody kind of who the whole team who worked on this I know you guys probably saw in the news in August 2017 Houston had a horrible hurricane everything that we were responsible for in Houston did not impact the business the system was already on AWS a lot of critical infrastructure were also moved to AWS luckily enough a building was not hit but the water was literally two feet away from the entrance and that's not kind of the risk that we wanted to take so with that I'm gonna turn over back to Mike we're gonna work you through here's kind of the key things that applies to any projects with large data set with large migration effort and hopefully is going to be a lot of useful information that would initiate it perfect thanks Mike so having failed and then tried again and succeeded we were in a unique position to actually capture those lessons learned and test our theories out make sure that the assumptions that we made on the reason for the failure was indeed correct and the that you see here that wasn't all of the lessons learned but for the sake of you all and the time allotted we felt it was best to boil them down into the most relevant so we hope that you'll come away feeling like these are going to be very beneficial for your future migration efforts the first that we'll talk about is knowing my data and I Delta and what we ultimately mean by that is it's important to understand the size of the data but also its composition where it's stored how it's stored the size of each file and the reason that's important is because other systems depend on the IO that those files ultimately place during the migration effort it's something we didn't account for in our me during the first go-around and we'll talk a little bit more about exactly what that meant in order to actually know your data you need a tool to measure it and as folks in system administration you should know that there's a number of tools at your disposal you can use things like sequel to run a query to dump the total number of tables that you have in the size of each table you may use your storage array and its management tools in order to understand the data at a specific time if you are taking backups using the deltas between the different backups on a regular schedule will give you those deltas over a period of time it's important to measure twice and only transfer once and this is another mistake that we made during rme we assumed that the first measurement that we received in the data Delta over a period of time would actually suffice well it turned out that a number of different financial factors end of quarter the number of reports that are generated in the system during that time it actually changed the Delta so again incremental snapshots will assist over time with actually capturing that data Delta and 30 days may not be enough for your organization go 90 may be regular reporting from things like your storage array and finally keeping the business in mind month and quarter and year-end maybe you have ecommerce business and so things like Black Friday will play into the migration I mean these sound obvious but often times in our efforts to migrate we lose track of other things I think it's extremely extremely important of understanding that Delta because I think what kind of hit us a little bit more is we measure everything we thought with it everything but then we got that financial quarter got introduced old Abed job ran and over you know what we expected something to take X number of hours exploded that remember what I was saying that the last couple hours we just couldn't hit that target that was it and knowing all the data facts and knowing the Delta that's probably one of the key things that make sure that you guys are planning for thanks Mike so speaking of planning planning to fail is as important as planning to succeed and what I mean by that is incorporating your rollback plan in your initial migration window capturing exactly how long it'll take at what point you'll need to say go no-go on the rollback I remember vividly it was 11 p.m. on Saturday we had all got on a conference call during the failure and said hey can we make this work without impacting business on Monday morning and it was determined that we couldn't and that's when we took steps to rollback so incorporating that failure time is crucial yeah and I think this brings to another really critical lesson that we learn early on I think the reason that we failed and will fail extremely graceful is we actually knew that at some point of time if something goes back we had a playbook ready with all the steps on how to put the old system back so as far as you know we all exchange emails and we'll all have conference calls and conversations but as far as our business concern where they worked in on Monday morning nothing changed for them the system was operational everything worked and there was no impact of the business so don't let obviously go for the success but make sure that the failure is properly documented and it's not just somebody else potentially how to put things back make sure that you have a playbook and make sure that you have a step-by-step instructions that you practice ideally not the day you go in life yep so the next section we're going to talk a little bit about storage and specifically how we can optimize our storage for its destination every destination on AWS has an optimal block size or storage size especially when we're talking EVS ec2 instances in particular actually have io constraints that we need to consider oftentimes putting more provision to AI ops on a specific volume will not have the intended results if we're not considering our instance size and its IO constraints and finally for large files that are ultimately ending in s3 are landing in s3 consider our multi-part upload and the different properties of it the size of the file so diving into the EBS io size considerations talking a little bit about army army had a file server and the file server had millions of extremely tiny files we're talking 1 2 KB adding up to just over 800 gigabytes now at the time we didn't really consider this a candidate for proceeding and the reason why is because it wasn't really large server at all things considered we're talking 20 30 terabytes of overall data a few gigs isn't really that significant what we didn't consider was those many's very small files would ultimately overwhelm our destination and the reason is when we're talking about EBS we have to consider the total of the optimal block size and that's 256 kilobits kilobytes and not only that but we'll do our best AWS and ec2 it'll do its best to actually write those blocks continuously continuously if we have multiples coming through that are smaller than the optimal block size but we will not write smaller than 32 kilobits so what that ultimately means is eight total blocks a total files will be combined in order for a single write operation which may only be 1015 kilobytes not adding up to the 256 kilobytes required in order to get the optimal IO for that particular instance type and EBS volume so if you've ever stress tested or pushed into ec2 instance to its max you'll recognize these two cloud watch graphs here the first is average write latency the second is our average queue length and these are two red flags when you're optimizing a system especially as it pertains to storage we want to make sure that our write latency and read latency are not high and that we're not building up a queue because as we're building up an i/o queue those operations are being stored in memory and can eventually overwhelm the system we'll talk a little bit more about that in a moment now talking about s3 we have different things to consider the s3 CLI will automatically handle multi-part upload so if you have a large file and you use the CL eyes copy command it'll break it apart into multiple parts and upload them concurrently if your file is extremely large though you may want to consider customizing the block sizes and the number of parallel threads you might also want to consider placing retry logic in there so in the event of a network incident you're able to retry automatically for that you can use the boda SDK something like Python or actually the API and the CLI s3 API it's a different command add to that the newly released AWS data sync and there's a number of different options that you can use to actually handle writing to s3 now the other side of the coin is if you have a number of small files in order to optimize your connection to ensure the latency doesn't impact the copy commands you may use something like transfer acceleration which allows you to use all of our points of presence our edge locations all over the globe in order to reduce that latency and copy files much quicker now when it comes to legacy applications there's a good chance that systems alone will not get you there you have some appliance some physical appliance in your data center that's actually part of the architecture of the application and for those situations looking to the AWS marketplace is a great solution so for example with Army it required that the data on this particular ec2 instance or a particular server was made available in a separate server instantaneously this was for the availability was a DR requirement because it is a financial system and in order to do that on premises they used a storage array network with replication capabilities now if you've built a server with Windows on ec2 you know that there's no native capability for replicating across availability zones in separate stacks so cisco looked to the marketplace they found a storage array network that was a virtual appliance that was similar to that but they used on-premises and they were able to incorporate that in the architecture in order to ensure that they met their compliance requirements and also met the performance requirements that we were looking for in the actual application architecture so again don't constrain yourself to the ec2 or AWS native services look to the marketplace to match or improve upon the architecture for your legacy application on-premises now we all know to test when it comes to a production deployment we know that if we are updating our software if we're rolling out a new tool we want to test it in our dev or test environment in the same capacity that we would do that in the actual production environment what often times gets overlooked though is testing the migration effort making sure that when we have that limited cutover window we're taking into account all the different things that are gonna strain that environment so I didn't say this I swear I didn't but I don't always test my migrations but when I do it's in production this was actually us collectively during the first Army cutover we didn't test what the impact to that migration would be on the system now again not all data is created equally the different sizes of those files the different destinations one going to a storage rain mount or a virtual storage or a network the other one going to provisioned I ops a third going to a database we didn't really fully test what the impact of transferring 18 terabytes and a 72-hour window would do to that environment we didn't understand the different failure modes we assumed that if the data couldn't be written because we were running out of throughput or bandwidth that it would just be a little slower that's not always the case when you're straining a system from an i/o perspective so you may have heard the saying animals under stress are unpredictable will say our machines these are two screenshots from the army migration the one in the background is perfmon we're looking at the different cpu metrics the one in the foreground is that from the hypervisor what this is ultimately telling us is during the migration when we were writing all of these tiny little files the system became unresponsive the cpu pegged and we all panicked now we didn't consider that the migration effort itself was causing this at the time because really the traits that we were seeing didn't line up with anything we'd seen before we didn't know that when you sent these million files using UDP because we were using a protocol at the time a special tool to stream quickly that it would overwhelm the i/o that i/o would back up into the CPU it would lock up Explorer it would seize the whole machine up and ultimately make it unresponsive and unaccessible so what did we do what any system administrator would do when they think they have a faulty instance on their hands we rebooted and that didn't work so then we just destroyed it and rebuilt it and that didn't work and by that point we'd wasted 1-2 hours trying to troubleshoot this issue that ultimately was because because we didn't consider the file size then I think the conversation went something like this I think it's me calling Mike and saying hey I think we broke AWS I think that that was the starting point in the conversation but seriously I think understanding this nuances and I think not just testing and doing the stuff in dev environment or Q environment this was I think the beauty of the cloud is you can actually do a full production test provision the environments it's gonna cost you a couple of dollars but it you don't have to go out and buy all new servers have everything line up provision exactly the same size Ronit you gonna understand the pinch points and you can understand what you're dealing with you have statistics to analyze the stuff and really understand what you're about to do before you actually go out and do it in Prague so key takeaway test with the production like environment so next is honouring the transfer window the business is agreed to give you a specific point to make that transfer it's important that again you include the failback mode that you include any potential business constraints maybe family constraints for me it was actually my wife's birthday during our transfer window we were at a brewery at the time when the failure began to happen and I had to rush away to go assist with the army folks so you know it's important to consider all of the different things that are occurring during your the transfer window that the business has allotted to you yeah and I think between April and June I think I got extremely comfortable during the migration and my executives was sending tax that will be done yet or be done yet and I told him look by six o'clock we're gonna be done and by 5:00 49 or 5:50 we were done I think having that confidence in building out that confidence to no matter what's gonna happen you're gonna land at the point in time you need to build that confidence if you're not confident about this thing work on your iterations and get to the point where you know that if I'm going to take this data size and move it by this time I'm gonna get it done another key thing to remember you know it's not only the transfer window it's all those nuances that can happen during that transfer window mike is gonna talk a little bit more but don't think you're gonna be in the highway flying 90 miles an hour from here to whatever understand there is gonna be traffic and you know Mike let me turn over to you because we ran into some of those scenarios and let's talk about it yeah and that's a great segue respect the bandwidth of the interstate when you can precede the data and the reason for proceeding is it lets us shrink that transfer window it also allows us to reduce the constraint of the network so whether you don't have enough bandwidth or you have other applications that are mission-critical that are competing for that bandwidth you may not want to utilize it all in order to transfer your data all at once so removing that network congestion as a potential blocker is extremely important now there are a number of different ways to get your bits into AWS we have the hardware based methods so think AWS snowball for those petabytes of data that you may want to transfer or snowball edge if you need compute during that transfer process we also have tools for pure connectivity so whether it's a VPN or Direct Connect to allow you to gain connectivity with a consistent manner into your AWS environment and then also application level so for your servers you can use server migration service which will actually replicate over time allowing you to perfectly plan out that transfer window and cut over multiple servers all at once or if it's a database we can do change data capture over a period of time to stream those changes and then simultaneously do the cutover now one of the favorite unintended benefits of the migration effort in all migration efforts is the opportunity to document the architecture for many legacy applications the architecture is actually a composition of institutional knowledge of run books that are there to help in the event of a failure or a DR scenario or comments and scripts and code that can be piece mailed together to maybe eke out some comprehension of what's going on but really isn't substantial enough to trust for army we had this opportunity to take this information form documentation from it and then validate that hypothesis during the migration effort that experience of the migration effort gave us the confidence that we had a clear understanding of our architecture and with that understanding with that confidence we can walk into the refactoring process we knew what every service did we knew how it failed we knew how to move it and that allowed us to refactor the application taking advantage of cloud native technologies so walking through that on-premises environment had very little redundancy it was a single application there were some scripts there that helped reboot systems and restart services the backup options were fairly limited because of the physical hardware and the throughput that was allotted to it and that hardware was aging I was coming up on its renewal plan the initial migration AWS was looking primarily to handle the availability and performance constraints of the on-premises environment so we created two tightly coupled stacks and in the event of a failure an engineer would still need to log in and failover the systems we changed the dns endpoint we'd start the services this was step one but through that migration process we felt confident that we had a complete understanding of the application and that allowed us to begin that refactoring process and the first thing we did was we used lamda to run a health check on the two a couple stacks if the active node the active stack became unresponsive lambda would change our around 50 through our DNS entry so that it would use the other stack and then it would also use it would also run a script on the server to start the services required in the secondary stack and this took our outage from what was previously 4 hours whenever there was an incident to 4 minutes as the lambda function was able to start services change DNS entries and end users the impact on end users and ultimately the business was far smaller yeah and look I think before just like any other companies I want to give you a little bit color what Mike said around the time that it took so usually you put in a ticket to the network engineers they pick up this ticket they have to do DNS then they hand over this ticket to somebody else so we actually looked as part of this as an opportunity to have enter on automation and really take out the humans out of doing the busy work so by putting the right scripts in place by putting the code as a service component who can observe the system automatically react and do whatever engineers would usually do from a provided a tremendous value for us so what we actually able to do things will always fail however a business does not go down anymore so we had a really high failure rate because we used old old aged hardware and we had to deal with a lot of those nuances plus when you take old hardware you add people on top of it hopefully everybody remembers to go from step ax to step Y by putting code in place we were able to do a quick recover and a lot of times when we have issues the system does go down but the scripts pick it up replace it and editing the beauty of the cloud and from I would say more than a year events that happen but business was never aware that something was happening with the system yep and along with the improved availability we also were able to gain lower cost and we talked about that a little earlier but that was through the implementation of s3 and a script was written to regularly copy older archive files off of those expensive EBS storage over to s3 and eventually following business best practices we placed it in glacier for long-term storage and speaking of data lifecycle this is a area that comes up often how do I put in place lifecycle policies and confident with the results the first thing is to leverage the storage class analysis that's a feature that you could turn on on your s3 bucket run it for 30 60 90 days and it'll track when objects in your s3 storage are actually being accessed or written which will give you the opportunity to write those lifecycle policies in a more strategic manner we actually just released a new s3 intelligent tier which will take some of this off your plate for you so I encourage you to look look at that it's also crucial to align the storage class to the business requirements so you may have your compliance or legal team telling you exactly how long you are allowed to keep the objects make sure that your lifecycle policy aligns with that that's something that we were able to implement with our me and finally don't fear retrieval costs organizations especially enterprises sometimes can get spooked by this idea that accessing the objects that they've stored in s3 or in glacier will have a cost that oftentimes isn't the case especially for archived objects we know that we may only need this object you know in the event that there is a audit that is not a reason to necessarily be concerned with the retrieval cost so don't let the retrieval cost scare you away from using these lower cost storage options explore using you know something like the storage class analysis exactly how often it's being accessed and then go with confidence into that retrieval cost now if we're talking about EBS and snapshots over the summer we released the life manager for ec2 this will allow you to leverage tags to automatically snapshot you're EBS volumes and then clean them up afterwards so that you can use that for potentially backups with some queue essing to the to the iOS on the volume and those tags can also align with the tier that that particular ec2 instance belongs to so you may not snapshot as often for dev tests as you would for production things like that so in summary reducing the dataset size both initially and for migration is crucial in understanding exactly what's needed from a data perspective to ensure the application is ready for production on the cutover date understanding the data composition is also crucial knowing the size of the files where those files are stored the number of files not just the size of the data and finally using this opportunity to learn to validate and to optimize your environment is something that we don't often get a chance to do especially with legacy applications where they've been running for ages and we've crossed our fingers that they're gonna stay online for the next few ages or I'm not gonna be here anymore when it is changed now we have that opportunity to go in document everything fully understand the system pull that tribal knowledge out of all the brains of the consultants and folks that are have been doing this for 10 15 years document it well and then use that to actually begin the process of peeling apart that legacy application and automating the recovery and the optimization of it yeah and I know Mike talked a lot about technical parts of this migration but what I want to kind of summarize and bring it home it's not really about the technology so let me kind of walk you through a couple of things from the business perspective and how our business stakeholders looked at us when we migrated this system to the cloud the first business benefit is the system stays up and they can transact without any issues the next thing we were able to increase for the same cost we were able to actually bring three times more users without adding additional hardware were able to reduce the overall cost of the business by sixty percent we practice a lot so this thing developed a lot of internal muscle and I think what I'm really proud of at the most we took the second largest ERP in a fortune 50 enterprise and we put it in the cloud and what this done for organization is actually created a gravitational pool where if we have some voices to say well should we put this thing on the cloud it stopped all this conversation and also what it did for the team is really kind of I know we talked about fail quick fail fast learn move on I think that really demonstrated how the team pulled together how we quickly were able to learn from it and actually in internally start embracing the things that we always talked about and have actually a very specific example and success and lastly I think last year we decided to buy a company in Hawaii and we building the next set of this ERP platform because now in the cloud were able to stop that I have several people sitting here they stood up additional api's we were able to integrate additional business units and we're talking about weeks not months or sometimes years of implementing those big projects so bring the agility using the platform reducing cost pooling as a kind of gravitational pull for the organization and then even the COBOL based systems now going through the DevOps and acting like a DevOps and really talking about how do we deliver the business hub of business outcomes it's really not about the technology so with that I know thank you I know you guys spend a lot of times it's between us and a happy hour we're gonna spend the next 10 15 minutes if anybody have any questions with more than happy to answer if you want to take ask a question just come up to the mouth yeah please not tonight um two quick questions one what did what did your mainframe platform look like at the start of your migration and what was the total time frame from start to total migration so in the beginning it was a traditional mainframe system running on z/os what we decided to do for cost reduction purposes because we've challenging I think six seven years ago there is a challenge to continue keeping the z/os s we migrated to micro focus platform part of that migration we were able to take a lot of files and actually move them into database running cycle server and that was kind of the first cut of let just get from the z/os and we put on a traditional Windows environment once we start looking at the renewal we looked at a lot of different options and that was the second what I would call the second iteration what we're able to take the system and we were able to actually reap platform it to AWS now we're actually taking the bits and pieces of the functionality what I would call a secret sauce so it I talked a lot about different modules so for example Finance we can go out and buy finance but how we do purchasing it's a secret sauce so now we're taking that stuff we ISIL it's basically slicing that monolith in two parts creating api's around that and actually evolving the things that needs to live on so that's the approach that we took this is a Big Bang switchover right well just all at once one weekend yes so let me say yes and now and let me walk you through so what were able to do is we were able to isolate the core data that resides in the platform and with snowball majority of the data over to AWS then we focus and figure out what is that Delta and start sinking the Delta so in parallel we were we took the core data then we figured out here's a delta and over the set of couple of months I think it's like months and a half well actually every weekend start bringing the data over to the new homeland so it's about 25 terabytes seeded initially and then we initially had the 12 terabyte Delta and we brought that down to 8 terabytes for the second iteration so you initially just pointed your other application then well you couldn't have just pointed your old application to the new data no so we were we were able to brought as much as we could then we basically think about it data capture without moving the system continue keep the basically two systems in sync and then the bing-bang was actually okay what's the remaining we're gonna do a cut over we're gonna move the remaining of the data kick on bringing new hardware and change IP addresses so I users were able to sign in the only thing they were able to see is the same screens and we added a little thing on the cloud so some of the stuff that I've heard in some other sessions is try to pick off some of the new things and just start creating some micro services so you slowly move everything over there did you guys consider doing that before you just moved your whole application at once no and there's different reasons for it because if you start building the micro services then you start really focusing where do you build them do you build them on pram or do you move them to the cloud and you're still dealing with the data sync issues if you're building one or two my data go micro services then it's great right you can always sing the data it's not a lot of complexity but as you're building out more and more services ultimately the sinking of the data becomes too complex so for us the stuff that we decided to do and it was four different I guess business decision we have to do the renewal on the hardware so versus continue with the old model we decided let's go back to the cloud and let's start building true micro services using true native cloud infrastructure and that's probably the right thing for us but again this is not a wrong way or the right way this is just the way that we decided that made sense flux yeah and to expand on that a little bit for Army in particular the primary reason that we chose the Big Bang approach was because of the data set size and the change rate of the data and the latency from this existing data center to the AWS location and the way that the folks interacted with it the amount of time that it would take to rewrite into multiple micro services was a high bar and so we decided to move first get it as close as possible to those new services those cloud native technologies and then start that Strangler approach of peeling away those different services there are other teams at Cisco for other erp and management systems that are taking that exact approach they're doing the other direction they've developed their micro services within AWS that communicate back to data Pretenders on Prem so not every legacy application or application in general is created equal and it's really up to the business and the SLA is that you've put in place as far as response times to determine what the best approach is for them yeah I think fine this has just becomes your patterns right so you can develop and put together here's five or six different patterns that exist in my ecosystem and then you can apply the right pattern for the right business scenario you know put in place you know questions you mentioned problems with the throughput of many small files you were moving these into a Windows Server NTFS file system it was actually a EBS volume attached directly to a Windows Server okay yes and EFS yeah I've seen similar problems in a different context of really large files file service performing very very badly on the VMware with NTFS and we did some benchmarking and on the same hardware natively we got 40 times better performance and then we tried virtualization with ZFS so not NTFS and it was only a penalty of about 5% so our experience has been that NTFS is disastrously bad when you have many small files did you consider moving files to some sort of staging system that is not an NTFS volume we did the issue was going back to if you're recalling those slides we didn't test the migration effort initially so by the time we were prepared to make that sort of assessment we run out of time we were trying to think of other ways that we could go about it but ultimately what we landed on was it would have to end up on NTFS because that was the file system that was used by the application the service that was running on the server so at some point in time we'd still need to make that final leach I mean we've walked through okay normally you'd do something like archive them you zip up a bunch of files together so you get that perfect block size in order to write that and take full advantage of the i/o but at some point we would have to unpack it you know unpack that and then we'd run into the same constraint so this really boiled down to not being able take full advantage of the i/o for our particular instance type but that's great information as far as you know different using a different file system if you can have that sort of capability that helps the most yeah yeah okay thanks for sharing yeah and just for the clarity this is just only during the transfer window it's not you know natively you know the performance is gonna be up to par but he's absolutely right you know there's different options available and it really comes down to test test test so you know exactly what you're dealing with and philosophers that we knew the window it worked for us and we were able to successfully get it done so when you migrate your data first what kind of schema changes that you guys make on your new schema so that you're microservices could perform better than the schema that you had on the mainframe so we actually able I can talk to that so if you go to the second options by moving over to the cloud and then setting up the microcircuit room Iker service pattern you want to take the data that a microservice relying and put on completely different database so now because you're on the cloud you can take that data you can do the DynamoDB you can put on another sequel server database you can use whatever database you want so you can isolate the data that's relying on you know for that microservice it over and then transact against that date so that's the technique that we use successfully or you know based on your volume you can actually create a small micro service with the JDBC ODBC connection back to the original database table and if you're not getting hit with a lot of volumen and you can use that technique as well where you can write back to the table of origin so you can have your right you could have your micro service in doing the data that you weren't just in the old schema yeah and usually it's an anti pilot pattern if you say it's a true micro service change but the idea is look you can take whatever data you need in order for you to transact move it over create a new micro service on top of it and then sync up the changes it needs thanks that's the approach that we use just come up to the microphones look okay we have had you guys over here yes sorry yeah yeah go ahead is there any additional a part to choosing the migration solution or the more interning tours during the migrations because the system is separated because you have to Dementor the post office system and don't promise and eight hours crowd during the migrations so before the migration you you must be the parousia some sources and we can choose yes so we've done the way that we started actually mic sitting here in the front we actually said look we want to do this stuff we never done this scale before and we actually spin off a true you know proof of you know proof of concept proof of value whatever you want to call it and we actually went through those steps to say can we even do this stuff and we're actually able to run through the scenarios this is where some of the conversations we had during this toke but you can take a system instead of going out and buying a brand new hardware you can say give me a production like volume for the next two weeks or months run it go through the exercise and make sure that it's a valuable solution and this is where we created a really key what I would call exit and entrance criteria to say in order for this to be successful here's the sequel of things that needs to be true so for example we wanted to make sure that a batch job runs three critical batch jobs run a lot faster that they won't prom we validated that at work we wanted to make sure that the data sync between another system and our ERP system works so we went through that exercise to make sure that we actually benchmarking transactions that real-time going from one system to another so I think it's scripting that like key critical components obviously you're not gonna able to do a full migration but I would strongly recommend do a true POC but make sure that you have clearly defined entrance and exit criteria where you can say here's where we start the project here's how we're gonna add the end the project and here's what success looks like on paper can you talk a little bit about security and compliance regarding this project as well and what not necessarily roadblocks but what issues and concerns that those teams came up with that you had to you had to get around sure I can talk and like feel free to add so one of the things that we wanted to make sure we got legal involved as well as corporate finance and you know corporate audit to walk him through all the different scenarios that we wanting to do from the security perspective just getting business so kind of bring him up to speed and make sure that they fully understand that AWS security is gonna be a lot better than what can do in-house there's a couple of things that we have a direct connection to AWS as well so we have a secure connection we're not sitting on the public cloud where we have all the systems exposed so making sure security part of that initial POC that they actually work in side by side and it's not just about the security it's bringing your legal finance audit onboard and actually walking through here's what are we trying to do here's how it's actually gonna help you support your business policies I think that's important to bring people along for the journey and what I would say this is not just the technical I'm gonna play you know project that I'm just gonna play with the hardware it's making sure that everybody understands the why part the one thing that I'll add is because we chose a big hairy complex mission-critical every compliance check marked in Security check mark that the organization needed to be able to pass for this effort it cleared the path for efforts down the road and made it a much cleaner journey for other applications whether native or migrations that came after it so Cisco's been able to really accelerate their cloud adoption in a streamlined manner because the language is now the same for security they understand the nomenclature of the cloud they understand the new controls that are in place and what they were placed from the existing environment so it was just a matter of ensuring that the communication and the terminology was understood and that those new controls were well understood it's creating that gravitational pull that I was talking about but tactically will actually set up a really good framework for archiving that's been validated by legal and what we actually do every time that we start a cloud migration or other business teams or devops teams actually adopting that same framework it's been vetted by legal it's been validated by all the right teams and this is just okay I need to do X Y & Z here's the framework guardrails where you can say it's make sure it actually makes it a little bit easier any other questions did you have a question you know Michele previously that you were running on z/architecture z/os so did you face and challenges converting it to until based were moving to AWS like did you need to make sense of change to your application before you move the data and what challenges did you face yes so remember this is what a two-step approach so when we came up from Z we already converted the data and we took some of the critical tables from flat files into database so now that was the first step and then the second step would have done actually take that stack and moving to the cloud and actually what some of the things were able to optimize and again goes back to the thing that I was just talking about it's making sure that you have the right statistics the right numbers that you guys can actually go back and look is it better or worse we you know when we ran the initial test I actually find out the nightly job ran a lot worse so we're like oh that's not good and we went back and we'll start playing around with a couple of different options and ultimately we got three times better performance but again measure measure measure I think that's a key takeaway it's not a magic bullet so with that thank you really appreciate you guys staying out till almost 6 o'clock please complete your surveys enjoy the pub crawl thank you all thank you guys
Info
Channel: Amazon Web Services
Views: 3,846
Rating: 4.8222222 out of 5
Keywords: re:Invent 2018, Amazon, AWS re:Invent, Storage, STG311
Id: naPWVBDuuqw
Channel Id: undefined
Length: 60min 23sec (3623 seconds)
Published: Wed Nov 28 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.