From Monolith to Microservices at Zalando • Rodrigue Schaefer • GOTO 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so my talk is about our story at lambda mostly so you can see from where we came why we wanted to change things how we approach this from a cultural and organizational aspect and what that meant in the end for our attack landscape we will also cover a couple of the challenges we encountered there and if we have still time I can also show you a little bit about what we did with the online shop because microservices as a concept applied to front and is a different type of challenge and it's not that easy to solve so that's me it's instantly I have also the shame of the same shirt here hey I don't know it's not by choice I watched it sometimes okay let's start with zalando so who of you doesn't know the lambda okay at least a few in the States for example we are virtually unknown which is interesting because we we're really really huge and this is our vision so what we really want to do is we want to connect people with fashion through a couple of different channels we want to connect for example a stylist with a consumer and also with the supplier and of course we want to sell fashion in the end we're really huge I don't know if any of you have an idea what scale we are right now we have 15 countries with around 18 million active customers who buy stuff during the recent months and last year we made around 3 billion revenue we were at one point the fastest European company to reach 1 billion revenue so it grew really quick and right now we have around 10,000 employees and we're reaching 1,000 200 in tech alone we have seven tech hubs all over Europe so the biggest ones are the ones in Berlin the one in Dublin and the one in Helsinki but we also have small ones in Germany yeah let's go to the early days so the London was started in 2008 the proof of concept was made with Magento as the basis to see if selling shoes online is yea feasible because a lot of people at that time certain selling shoes online what if it doesn't fit and so there were a lot of concerns it was like a crazy idea and the Magento platform allowed the business to try it out and see how it goes Magento is an open-source shop solution there is also an enterprise one and it's based on PHP so we did that and we yeah we grew pretty quick so people really liked to buy their shoes online also because we had this free return policy there and very soon we came to a point where the load on the system couldn't be handled any more by that particular ecommerce system so we had to build a new system and we had to build it fast because we were already at a point it just couldn't go any further we we consulted with the CTO of Magento we did any kind of tuning possible to the system it just couldn't take it right so we spent approximately three months on building our own e-commerce system based on Java and Postgres and now we had a interesting situation I guess many of you have seen that before so if you go through a time where you have a lot of issues in operations right you have a lot of load issues systems can be down and stuff like that then you tend to get a little bit on the defensive side of things so when you design your next you tend to be very careful that it really can handle the load that it's rock solid and stable and everything and you you try to sort of yeah take precautions and what we did then was we pushed a lot of the business logic into the database layers so a lot of that was done and stored procedures and Postgres and there was a lot of business logic right not just two or three sort procedures there were like hundreds and as you can imagine maintaining that sort of code base gets tricky all right yeah so by the way does anybody have you know this painting by chance yeah right it's also the loading screen of civilization 5 but yeah it's a the Tower of Babel and the Tower of Babel is also interesting because the people at that time so the story says could only build this because they spoke the same language right and that's also something which monoliths tend to to have as sort of a restriction so basically everything you build needs to be in the same language otherwise yeah it's not a monolith anymore we had that situation so we had this language login here and yeah after we built this first e-commerce system we started to add features on top and on top and of top for five years so every country obviously has their local specialties like let's say payment or delivery or we also did a lot on the logistics side of seeing so the systems grew so at that point we didn't have really one model is it's like the simplification here we had a couple of them but there were really big systems so what does that mean when we look at the productivity of our teams we have a lot of teams working on the same code base which is quite huge so you have a lot of dependencies if you want to do something you have to check what are the others doing are you impacting their code in any way are you impacting the the operations of the system you you need a lot of coordination around the teams to do that also with releases for example and QA and this leads to this law of diminishing returns which means that adding more people right just doesn't bring a lot of extra value because the more people you have the more dependencies you create and everything slows down you can't reach a point where adding people is detrimental where you're basically slower than than before and this is not not a really good situation to be in and we also saw that innovation suffered so to stay on top in a tech environment you need to be very innovative you need to bring out new features you have to inventors reinvent yourself and with a huge code base you have the problem that the buck density just increases with the lines of course there are certain studies around this and you also see a lot more side effects so if somebody changes code here it can break something elsewhere and it's very hard to contain this so what people tend to do our companies tend to do in this situation is that they create rigid processes around the whole software development cycle like QA processes like web hooks into your version control and your ticketing to check every sing like checks you do in your deployment application for example to really boil things down to a controllable state and to sort of kill any kind of variance you might have in your software development process and if you kill variants of course you also limit your limit your innovation because I mean innovation is variants right you try to bring something new so we had less innovation here a good example is this programming language Lockean we had with Java so we couldn't deploy anything else so if you wanted to to go for a go application or something bad luck not very nice and we also saw a bad impact on gross so gross from the people side of things of course we we want to grow there want to have more engineers who want to bring out more products and if you have a huge codebase one aspect you see is that new stars have a really hard time to get confident with the code because it's so big they have to learn so much and they they really also don't want to take any risks there it's it's pretty hard because they can't understand everything so this is quite bad it takes months for somebody to really be productive and of course another thing which yeah as as the tech crowd you know what we we tend to like new technologies and we tend to play around with something new to experiment but if you're locked in an old tech stack which is basically 10 years old because in 2010 when we decided for this Java Postgres stack we decided for something which was proven and stable right so it was around for quite some time but the younger engineers which come up from the universities yeah they they're used to other technologies to newer ones and if they get on board it into a company which just doesn't provide that environment they're demotivated so in fact what we saw was that our company for tech was not very magnetic people didn't really see us as a tech company and when they heard about what we're using right this was like a point where many of them just decided not to to join Zalando at that point so hiring was pretty slow here we we had to change something if we wanted to still be on top in the retail space in the fashion space here then we needed to do something so then we set together for a couple of months like around 30 people from tech and thought about how do we have to restructure our organization and our culture to improve on all of these things to create an environment which engineers love right we called it radical agility and that's basically what we wanted to have wanted to have autonomous teams which could deliver new products efficiently at scale which means like we wanted to take a couple of Engineers put them in a speedboat and then tell them which direction they should go and then leave them and see what what comes out of it and there was yeah quite a break from the formal former culture at salon which was due to these rigid processes and the the history we had was a lot more controlling so that's what we did that we draw this comic strip here and if you've read the book drive by Daniel pink then this might be a little bit familiar because it served as an inspiration so we have three principles here one is purpose purpose means everybody wants to to know why he's doing what he's doing and that means that the teams can define with their product owner they can define their own purpose inside the company so what is it what what is it that they want to achieve this serves as quite a motivational boost here there's not only the team purpose there's also the individual purpose everybody has some sort of purpose maybe you want to be the the best DevOps guru in the world or you want to be a highly skilled expert at Scala something we also see and we also try to help our engineers achieve that purpose and yeah go further there and of course we have the company purpose which gives some sort of direction what we want to achieve in the mid and long term right so autonomy the second principle is a very important one and also quite hard to achieve autonomy means basically that people are free to choose how they solve things so in in the book so in theory it it's comprised of four T's it's team technique damned time and first one okay forgot that one anyway it's about having the ability to to really define how you want to work and that also means that the teams can't choose the technology they think is best for their current task at hand they can decide how they want to structure their time right when to do meetings and stuff so we try to leave them alone there they can decide how to prioritize their tasks based on their own purpose so if they define their purposes they build up some let's say product KPIs to to track how they're going they can just look at that and think about how can we improve our own KPIs to get better at our purpose and we have a mastery as a cert principle here mastery is about our drive to get better at what we do everybody wants to do a good job right there's probably nobody who really wants to do a crappy job and this mastery aspect is something which is often overlooked in most tech organizations because usually tech leadership has two responsibilities one is delivery the other is people development and when it comes to deadlines people development is mostly pushed to the side so it has not the same amount of effort put into it than delivery that's also why we changed our leadership structure here to allow 100% focus also on people development everything together is helped by trust obviously if you give people autonomy you have to trust them to do the right thing and this also means quite a mindset change from a leadership perspective because you really have to to step back sometimes and just have the team do what it thinks is best and a lot of people are for a lot of people this is difficult sometimes if you have the opinion that this is a wrong direction you can't as a tech lead you have to convince the team with arguments not with your role in the organization I think that's a very good move so yeah goals here also increased innovation and productivity let's look at Conway's law so we changed the organization based on this philosophy and if you change the organization as Conway puts it do you also have to change your tech landscape otherwise you get some sort of friction it just doesn't fit right now we're in the middle of it I mean organizational changes also take some time but changing the complete tech landscape of a big company like Zalando takes even longer in the middle of that team still have to cope with older systems which yeah let's say cover more than one team while building their new systems which cover only one team so radical agility is not only the organizational idea and the culture idea which we had it's this and the technology landscape which fits to that idea so here I have from Adrian Cockroft a nice quote regarding what the micro Service is I try to to create a sentence which describe or organization and yeah it sort of fits right that's the idea here so creating an architecture which fits to the organization okay let's go to the challenges we had there are a couple of them operations operations is a pretty important one I recently read a Twitter block which said that they spend I don't know hundreds of staff years just reaching again operability of their systems after switching to micro services we have also Martin Fowler who said that there are three prerequisites to micro services by which he means you shouldn't do it if you don't care about these three things the rapid provisioning rapid application deployment and basic monitoring which are all a lot harder in a micro service environment where you have hundreds or thousands of micro services we've heard today couple of talks especially about rapid provisioning and yeah we did our own solution there monitoring especially if you want to track let's say a call from a customer over dozens of different micro services also kind of tricky you cannot really do it if you if you try to lock into the server and tail the locks it's a little bit more complex here so that's our operational setup we use for provisioning AWS also because of its huge ecosystem we use docker as the deployment technology and for monitoring we go with app dynamics and with that Montu which is a system we build ourselves it's also open source if you're interested I like it because it's quite good for let's say more business oriented metrics and KPIs and we use Stoops IO which is our framework for docker based applications on top of AWS you can have a look at Stoops IO they have a website and explains how we solve that it's not really you cannot really take it and use it for it's for yourself because it has some very customized points in it so yeah but you probably can get some ideas already oops oh that's the following so we have dr. deploy we have SSH access to machines we have audit reports which is pretty important for us because we are a public company we need to comply to certain things and we can give teams full AWS success so every team has its own AWS account and can run their whatever they need to run mind set mindset is also very important as on the engineering side but also on the leadership side so we have here a change when it comes to looking at the availability of services so if you design a new micro service service and this one is dependent on others in a way then you really have to expect that the other services can fail because they will at one point in time you have to try to build your system in a way that it's resilient that it can't cope with failure and there are multiple things you could do for example you could you could try to do mostly as an Chronos communication between the systems you could try to use hystrix you need to think about service degredation for example so what can I offer to the customer of that particular services down can they still do something or not how can i how can i try to keep things stable right and we have an to end responsibility this was a big thing because before that change teams were just developing new software and then it was like thrown over to K and they will fix it or they will see what what's wrong with it so we changed that and after QA by the way it went to sort of a platform team which covered the whole operations we change that and move that all to the team itself so the team itself is responsible not only for developing but also for testing and also for operating what they build this is a bit tricky because you need certain skills for some of these things and so we really have to also train the people and give them the necessary tools to achieve that software-as-a-service is something of a guideline so what we really wanted to have is that teams also feel a little bit like they're a small start-up because they have this responsibility and they should think about what they build as you would if you would create a software as a service company so have excellent api's have good documentation be stable have SL A's right will your system in such a way that you could theoretically offer it as a product somewhere and we had API first which is a new approach we took so if you're a team and you want to build a new service let's say a stock service right then your first thing would be to create the API and this API would then also be reviewed and checked by a dedicated group of engineers what we wanted to make sure here is that the API Saur yeah correct in a also semantic way right we also created one interesting tool called play swagger which allows you to write your swagger API and create the the play code for that automatically it's open source also global architecture that's also a very interesting one because if you have a lot of teams and everybody does what he thinks is best to solve their particular concern then it's it gets hard to have a big picture of your architecture and what we also didn't want to have is some sort of ivory tower software sitting somewhere designing some great tech landscape there so how did we try to do it so for one we introduced peer-review for example the API guild is one where we try to try to give advice to the teams regarding their architecture we try to get a shared concept of the business entities into the team for example a brand right if we look at a brand what is a brand the brand can be something attached to to a shoe but it can also be an organization which acts as some sort of user for one of our systems so these sort of things have to be handled we created Tech Raider which is a list of technologies we think are worse experimenting with or which are safe or which are unsafe so we we rate them we won't tell a team not to use something if they have a very good reason for that if you want to build for example an augmented reality app for the warehouse and you just have to use dotnet for that ok right you can and we created the rules of play document which covers a lot of these topics how microservices should be built the whole resilience part the the sizing of a micro service which is always a big question and yeah you can have look we have it on github in fact this one so another big topic here is compliance and security so as a publicly listed company we have to take a look at that we have to have a four-eyes principle there and we have to still keep it after teams or autonomous we have to guarantee audit trails if something bad happens we need to prove how that came to be Identity and Access Management a very big topic and very difficult so we try to move to a completely new identity and access management in that time we wanted service to service authentication we wanted customer to service authentication we wanted authentication for external services through a the ice we try to solve it all with one go didn't make it we switched back to our own solution based on JA tokens at least for the customers side and yeah I mean here you have to see that in micro service road there are a lot of authentications going on when you do one call to let's say one page this might trigger trendy calls to some web server micro services these might in turn also trigger as recalled so you have a lot of seven we we heard this morning that let's see there is also point so you want to reduce that data protection is in Europe at least a very important topic many companies especially bigger enterprises don't trust Amazon so much so they they want to be on the safe side here we took I think around three months to get a data protection agreement with Amazon to be able to yeah deploy our applications there so we have I think a little bit of time still for the case study of the fashion store the fashion store is also quite monolithic at least it was and one of the missions we had last year was to rebuild it and we tried to take the concept of micro services and apply it to the front-end world which is yeah a little bit different because everything is one page in the user browsers in the end so what did we do so we created a structure were teams own fragments of a page if you look at a page let's say a catalog page at zalando you see multiple parts of it like the header for example or the navigation or the catalog itself and the footer and all of these are their own fragments these fragments are served by web applications owned by each team so they serve the complete markup there for this fragment these web applications in turn can call micro services they can have their own data storage and then we have this nice layout service over this layout service knows how a page should look like so it knows layout through the template and the context and then tries to call all of these endpoints for the fragments it needs to to build up this page as in chronically we borrowed a bit from Facebook's big pipe concept there and try to implement that on top of that we have the router the router is an application which first allows us to to route calls either to the old shop application or to the new one and we can also do let's say small test runs here so we could say for example we want people in Italy to go to the new part of the page but only 5% of them and we can then test that against the old set up and see if anything is wrong or not the router is written in gold by the way the layout service is in its current incarnation node and we did these things as open source so you can have a look there there will be also a bigger blog post in the near future around Taylor's so the layout service and how we do the streaming of all these fragments to the client so result we can inject new features at runtime into the shop which was not really possible before we have faster feedback loops because each team can adapt to its own speed here they don't really need to to wait on other teams to deploy something it's tag agnostic you can build your own fragment the way you want more or less there is some implication here regarding UI we are more productive with intent responsibility was a full control with lean agile processes which every team can define on itself we have independent development continuous delivery is possible in that way faster onboarding more magnetic and yeah we can just add new teams as we go so here are some links you can see it the slides later I don't know do we have still some time or do you want to go to questions two more minutes okay an interesting part in the front end is also the look and feel so usually when you look at standard web pages there is some big CSS file right this defines everything so how do you really make that work with these autonomous teams so CSS is also not really a programming language in that sense so you can easily override other people CSS styles and then your page can look quite crappy so there are certain best practices you can do there but also you have to keep in mind what if you want to have redesign of your page would you then as your your X team or your item have to go to every team and ask them please please apply my new CSS guidelines and then deploy them it would lead to a situation where we would have the system which is inconsistent which would have the the new button type on some pages and the old one on the others its situation you don't really want to have that's why we saw a lot about this and try to come up with a solution where we can inject your changes centrally and that would be applied to all fragments which we serve immediately we do that in in Taylor and it's a UI component library it's we are not there yet it's right now it's more or less a technical concept or experiment stage we will develop that pretty soon and bring that life and one thing I want to note here is this architecture is live right now so we built that up while moving to AWS with the check out and if you're in Italy or in Poland you would go to the new check out at least on your mobile phone so it works we didn't have any big incidents there if you're interested just yeah write me a mail or something and I can connect you with our engineers who came up with that so yeah please remember to rate the session there's also something about this mastery aspect of course I want to get better so it would be nice to get some feedback here and yeah if you want we can do some questions so the majority of the questions were actually about how to convince middle management to convert from a monolith to microservices what would you suggest there hmm we didn't have that problem in fact because most middle management especially in tech are also people who who are interested in new technologies and new approaches to technology so they were behind that I think the more difficult part was to convince them to give up control and to give autonomy to the teams that was the tricky part okay there's a question about the autonomy of the teams and how you internally communicate available services within the organization so if you have lots of small groups of people defining lots of small services how do you let them know about each other yeah it's a good question fair one because if you have autonomous teams they attempt to isolate themselves a little bit because they don't need to speak to everyone right so there some teams which just do their own stuff and that's kind of bad so what you need to do as an organization is you need to create new communication channels for the teams to get together and to talk about what they're doing and also with regards to a new services if a team is building a new service in the end if no other teams are using it it's of no value to the company right so adoption rate of new services especially internal ones is a KPI which we use to measure how successful that team is how do you measure that is that automatic or is no it's not automatic but we can for example if if we build a new continuous delivery service and for other teams then you can count the number of teams using it as a KPI okay good so there's also a question here how large are your teams in general we followed this to pizza rule so we try to be around between two to twelve people where I would say the best size is sex as six sorry for that six to seven people yeah okay what kind of microservices architecture are you using are you using a synchronous calls or using asynchronous calls depends on the team in fact so we try to go as in cronniss as much as we can but if it's not possible we try to use history and the last question here is do you feel that your approach at doing micro services on the front end has been a success or is it getting there what's your what's your current current gut feeling I think that the the basic architecture is a success because it runs and doesn't have a lot of problems we're but only at the beginning of moving everything there so as I said it's just a checkout by now other teams are currently ramping up their migration to this new architecture one thing which is an issue there is that we're not only rebuilding the shop we're rebuilding everything so the entire back-end services everything from article services to stock to payment everything so we're trying to build something on top of a moving platform right now which makes things a bit complicated okay thank you Thanks keep on hand you
Info
Channel: GOTO Conferences
Views: 31,740
Rating: 4.8269229 out of 5
Keywords: GOTO, GOTOcon, GOTO Conference, GOTO (Software Conference), Rodrigue Schaefer, GOTO Stockholm, GOTOsthlm, Zalando, Microservices, Monolith, Computer Science, Programming, Software Development, Software Engineering
Id: gEeHZwjwehs
Channel Id: undefined
Length: 37min 39sec (2259 seconds)
Published: Mon Aug 08 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.