RabbitMQ vs SQS: Build or Buy?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey my name is Greg I work at Carta I'm an information security manager there and today my talk to you a bit about our experience with queuing mechanisms and how we we had to make the decision between build versus PI so this is Carta we if you haven't heard of us we help private public companies and investors manage their equity cap tables valuations and investments and equity plans so you can imagine from two foundries in a garage all the way up to you you know thousands of people public companies we're helping people manage their stock you know selling it through tender offers getting valuations all that I'll address so the reason we do that is because we want to create more owners you know the vision we have is imagine a whole world where every single person that works for a company gets ownership in that company at some level and we want to make it as easy as possible for that and and push people over the edge so you go take that leap as an employer to create more owners yeah so we were founded in 2012 with three people you know now we're seven offices it's more like five hundred people now I wrote this slide to act like three weeks ago and then ten thousand subscription customers seven hundred thousand shareholders sorry extra zero there and these are some of our customers that we really love to serve but yeah so the reason I'm giving you all this context and it will come up is because every technology decision we make now has to be read back to our core our core competencies in our in our mission so the reason I mentioned that is because we we have a core competency and we want to make sure we focus on that and the our core competency is moving financial technology problems from the physical to the digital and solving them so if you've ever heard the talk we are software people by Tulio's CEO we kind of followed that principle of like once you get something into the digital you can do so much more with it so we can solve really awesome problems you know with with technology as long as it goes from you know paper certificates and Excel spreadsheets into our our web platform so we had this problem where we were going from monolith to microservices age-old problem I'm sure a lot of you are familiar with that and so we were starting to break apart our monolith but because we're a FinTech company we have this auditing system that it's very you know it's really needed and it's also highly it has to be highly available and we can't drop a single transaction and so we had to maintain that audience system which at the time was implemented as a connection between our monolithic database and some sort of auditing database and it was just a tie right there using Postgres replication if you're familiar and we needed to move from something that was that was like decentralized into that into more of like a event stream or publishing events on on some sort of bus and then we went pull it and put it into into the audit system and so we had that that constraint we also we're working towards cloud being cloud agnostic in case we needed like a warm back up in GCP or multiple regions or things like that and we're also designing for hyper-growth we've been in hyper-growth for years now and its really does a number on you so there's a lot of this talk that actually goes into what that actually means from a technology standpoint when you when you're trying to design for hyper-growth so we this is the order we tried to learn from the order or the wisdom of others and we ordered our our priorities in this way but we also tried to learn from the wisdom of others and we were like there's absolutely no way we're the first person to have this event problem let's go learn from everyone else so we learned that LinkedIn had the same problem many years ago and they they moved to kafka I think they actually wrote Kafka but they moved to this model and we were like okay brilliant this solves all of our problems in architects for when we can foresee or five years of technology problems let's move towards this architecture so we started evaluating message buses and so we have looked at three Kafka sqs and rabbitmq because they were the most highly recommended ones at the time so this is about a year and a half two years ago so these you know these technologies have changed quite a bit they evolved but you know none of our nobody in the company had experience of Kafka so that was kind of a - maintenance of Calca we heard was kind of a bear because of zookeeper and if you're having more servers to manage the servers you're trying to manage we thought that was kind of a lot of over operational overhead so you're like maintenance isn't that great but it is cloud agnostic and the performance was just bonkers we were looking at that and going like there's no way we're ever going to need that that many events so we were just saying okay top-notch for Kafka rabbitmq was similar except for famous last words you'll hear earlier is I set it up it's bulletproof I've never had to touch it in my last companies so none of us had experienced that so either but maintenance was good according to the research she did and the people we had talked to you and it was also cloud agnostic and still you know even though it wasn't as performing as kefka it was still way more than we would ever need maybe not ever but for the foreseeable future so then we looked at it as an S us you know still no experience with it across the company the maintenance is great right you just Amazon handles all the servers you don't have to do a damn thing it's great so cloud agnostic no definitely not but one of the great things is still way more performant than we needed so we looked at these trade offs and also note on performance though if anyone's ever tried to evaluate the performance of a queuing mechanism you know that it is like more complicated than AI sometimes so there's a white paper we read that's an example but there's so many facets to securing that when you talk about performance then it was just impossible for us to quantify so when I say top and more than we need I'm just summarizing because there's there's so much so if you read this white paper you'll get a taste of the many hours we spent trying to figure out the best technology but we ended up looking at sqs is like this is this is bonkers cheap for a queuing mechanism and we were really impressed but the thing was we thought that we needed cloud agnostic more than we did and so we underestimated the need for yeah other other constraints but we were optimizing more for cloud agnosticism than we needed and underestimated the value of not having to manage any of these servers ourselves so we looked at Kafka it's like crazy fast zookeepers a little bit annoying we talked about this earlier but it was cloud agnostic and Ravin MQ is great cuz there's a middle ground right Kafka it takes a lot of maintenance rabbitmq was apparently bulletproof and we're just paying for the rabbit notes not everything around it to keep it going like zookeeper and it was still way more than we needed so we went for Avenue Q we're like huzzah there we go we solved it let's create some servers let's configure them we set up some consumer set other producers and we were off to the races we built so many new micro services on top of this it was great yeah it was a victory six months later we had this new requirement because riffin tech company so people you know even though we're a relatively young company we have enterprise customers and they have enterprise questions and especially spin tech we're moving you know millions of dollars through our pipes and so people are like hey do you patch your servers we said yes that can you prove that you do that on a regular consistent basis than we were saying okay you know we can and we will but it's much easier to prove to you if we do regular based patching it's like time-based patching like every month or week or day as opposed to whenever we get like a an alert that's something we have is vulnerable and so one of the things about patching if anyone's ever had to deal with that you have to reboot servers rebooting service is a kiss of death when it comes to availability because you have to have some sort of like pool and manage everything like that and rabbitmq did not handle that the best in the way that we had set it up in its AAA format and so we have to go back and say well now the maintenance is gone from good to not as great and we still didn't have much more experience because you know if it is bulletproof and you never have to touch it you don't actually have more experience with it you just set it up and forgot about it so not much more experience so we looked at Amazon again and we're like ok it's Amazon like everything in Amazon is kind of smells like Amazon you can kind of pick it up pretty quickly if you know anything about Amazon and maintenance is still just as good but one of the crazy things that happens is that it got better over time so what we learned was you know all major tools are great at something or many things whenever someone's like oh yeah react sucks or like oh yeah this tool is great always take that with a grain of salt everything is good at something and we've we've learned like remedy was good at something we never had to touch it but if you have to touch it it may not be the best thing hashey core vault has been great when it comes to the actual technology of storing secrets but when it comes to ceiling and unsealing in the automation of all those things when you have to patch it and that is great and rancher is the one we used for docker orchestration a similar idea a lot of trade offs in that way and so we went back and we said hey how can AWS help us better with this and we found that a lot of the time they provide better trade-offs because we were we were evaluating things in the wrong way and you know also one of the amazing things my name is odd isn't it it seems to get better at an accelerating rate I mean when I was evaluating sqs a couple years ago I you know I don't even know if it had the exactly once delivery or how good its throughput was the latencies but I do remember being surprised when I researched it again how fastly it evolved and that was amazing I mean I've never seen a company that has the gall to say unlimited throughput like that's that's insane that's that's a huge claim and obviously Amazon is big enough to do that and so we were saying okay we're gonna we're gonna go in on on Amazon this is an amazing technology let's try and there's always trade-offs you know if you come talk to me afterwards you probably say sqs has you know some Layton sees aren't as good as kefka all these things totally agree with you but for our use cases this was far more than we needed so we we went back and said hey what other build bridges by decisions are we making and how can we leverage a buy better and you might be thinking wow you know you're gonna spend a lot of money on sass now if you're gonna keep buying everything instead of building it but what we realize is we were miss calculating the value of designing for hyper growth you know I think that as I started as a startup everyone knows like yeah we're gonna you know grow a lot that means that we're gonna be bursting at the seams but until you live through it I'm not sure when learns the lesson of like when you're one person you start three years ago and you're tripling every year you're gonna have like nine people under you at the end of three years and if you have one problem at the beginning of three years you're gonna have nine problems at the end of three years so a lot of the time you have to to really think like am i designing for hyper growth because hyper growth when you're in it is the most important thing to keep going because as soon as you come out of hyper growth the compounding rules of growth start to start the diet and so every technology decision has to be thought of as as a part of the business and so when we talked about this we started to reformulate it and teach our engineers like the job of everyone here is to keep hyper growth going without being you know irresponsible so the job of everyone to start up undergoing hyper growth it's designed in anticipation of future hyper growth and what we learned was if we can if we can set up Amazon once we don't have to patch it that means that you know me as a as a DevOps manager I spent a month setting up all these different systems and tooling around my previous queuing system I could have spent a month building out a whole new infrastructure around a new product right and if I enter a new product space a year earlier or even a month earlier like compounding rate of growth because you entered a month earlier is insane over and so you start to see the business trade-offs in a technology decision like this and so we we went back and said okay we need to buy our way out of these problems when we can because if you have the money and you are hyper growing you know it is our opinion that is is better to buy than it is to build a lot of the time because our core competency has nothing to do with queuing systems right to go back to that we're trying to create more owners we're trying to you we're trying to solve financial technology problems that have never been solved before you know a lot of our engineering is devoted towards a lot of the software development side because that's where the leverage is so our goal is to is to help them as much as possible to keep building amazing things so anything we can do to to get clutter out of the way an operational overhead out of the way and move into a space of just rapidly being able to iterate where we're being good technologists and good decision makers when it comes to technology so yeah so the cost of not designing for hyper growth means that you're you're not designing things that are going to solve your core problems or accomplish your mission so that's that's kind of what we learned so thank you very much for your time

Info

Channel: Amazon Web Services

Views: 6,511

Rating: 4.6326532 out of 5

Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud, rabbitMQ, SQS

Id: _X2Akzqr2cw

Channel Id: undefined

Length: 12min 12sec (732 seconds)

Published: Thu Apr 04 2019