Comparing the Network Performance of AWS, Azure, GCP, IBM Cloud and Alibaba Cloud

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so as Alex mentioned my name is angelique Medina I work for a company called Thousand Eyes where alongside my colleagues I get to work on some really interesting research projects like looking at the performance of DNS and CDN services the subject of this session today is to share findings that are based on a research project that had its genesis about 18 months ago where we started looking at the network performance and behavior of some of the major cloud providers because we saw that there wasn't a lot of independent publicly available data on how cloud operators were running their network so we wanted to fill that gap so the first iteration of this research was released in late 2018 and presented here at Nanog in San Francisco last year and the second iteration of this research greatly expanded on what we did in 2018 and that's what we're going to go through today the initial findings from 2018 were very interesting because as you would expect overall network performance of the cloud providers is very good but there were some surprising anomalies that revealed differences in routing preferences and how they managed their overall network architecture and connectivity so in 2018 we looked at three major cloud providers so AWS Azure and GC P and in the most recent research that we did so this was published a few months ago at the end of 2019 we added to public cloud providers oli cloud and IBM cloud and we also took a look at AWS global accelerator which was a service that was announced just after our initial research was published we also added additional testing to look at intra and Inter mainland China connectivity as well as performance from vantage points that were located within US broadband provider networks so pretty significant expansion on what we did we also because we had a baseline data set from 2018 we could look at year-over-year changes for Azure AWS mgcp which is really interesting we're spend most of our time looking at that so in terms of our research method methodology just very high-level the way that we derived these measurements was using a custom form of TCP based trace route between two endpoints that we controlled so we had an agent in either end we were sending probing packets between them and we were able to look at performance indicators from that including end to end latency packet loss jitter as well as hop-by-hop measurements so we could actually map out the path between these two end points we also looked in both directions so it was bi-directional we're looking at each of these measurements independently one way and then aggregating those and the measurement they were using here is principally milliseconds for looking at averages also because we could map out the paths when we saw anomalies we could then go back and look at what the layer three hops were between the two end points and we also enriched that with some metadata like geolocation and AS operator information so these software agents that we use for this test are they were either in regions hosted by the public cloud providers or they were located outside the providers in hosting facilities that were connected to tier 1 and tier 2 service providers or in some cases broadband providers and they were managed by us so that allowed us to do some consistent measuring every 10 minutes over the course of 30 days which yielded a lot of data and most of the data wasn't super interesting so packet loss and jitter not not a lot to see there with the exception of China but with latency there was some interesting differences in how the providers performed so in terms of the scope of our research the questions that we principally wanted to answer were first off what is the performance of users so that could be an individual or an application tier connecting to the various regions of the cloud providers so what we did is we used vantage points across a number of different cities globally so 98 locations and we tested from those locations to each of the regions that you see for the cloud providers that you see here so there was 15 regions for GCP and AWS 25 for a juror and then some number 21 and 17 for a li cloud and IBM so pretty pretty significant set of measurements that we were able to collect separately we also looked at the performance for vantage points that were connected single home to us broadband providers so these were six providers that we're looking at across six US cities and because we're just looking at users connecting from US locations we only connected to four cloud regions for each of the providers so we just looked at North America us West East Central and Canada and then inter region measurements were taken for each of the cloud providers between a region pair and so there's just an incredible matrix of measurements that we were able to take so for example we're looking at for instance view as West and US East and looking at the performance between those two points something that we added from last year was a baseline measurement to compare how those providers were performing relative to some baseline performance measurements so we what we did was in the instances in which a cloud provider let's say they had a region that was hosted in San Jose and they had another region that was hosted in Ashburn we then in addition to measuring between those two points we also picked a vantage point that was outside of the cloud providers network connecting to a vantage point that was also outside of the cloud providers network in that same city so testing from San Jose say to Ashburn and then looking at the performance difference are they performing same better no different than this internet baseline it's just one kind of mechanism to see you know if if there are vast disparities or if they're for example as we would expect to be much more optimize connecting their between those two points we also looked at the performance with within specific region so we picked a handful of regions and looked at the performance of different availability zones within that region connecting to one another so we picked three availability zones for the handful of regions that you see here and what we're doing was we wanted to understand for example if an enterprise is putting together or not an architecture they want to host in a particular region they might choose to host in both US West one a and B or some combination of a B and C and that enables them to have a more resilient architecture while being still geographically hosted in a similar location and all of the cloud providers claim that they their availability zone and to her available at Villa inter availability zone performance is less than two milliseconds so we wanted to see if that actually held one thing that we won't cover today but something we also looked at is the interconnectivity between cloud providers so between say a region hosted an AWS in a region a hosted and azure are there interesting things to observe in terms of how they connect to one another as you would expect they're very densely paired with one another so traffic that is flowing from one region in a cloud provider to a region another cloud provider typically doesn't go over the Internet it just gets handed from one cloud provider to the other there is one exception to this which I will touch on very briefly so looking first just to give you some kind of context for the findings we're going to cover there's a lot of data to unpack so this is a little bit of a whirlwind tour we're just going to cover some highlights and then look more closely at some anomalous behavior so as I mentioned before we're using a baseline to kind of get an understanding of how the cloud providers are performing relative to some baselines so across all of the cloud providers they performed very well in almost every instance the performance between region pairs was performing better than the internet baseline that we measured so for example IBM cloud 97% of their inter region pairs were performing better than the internet measurement that we were taking between similarly located and points on the opposite end of the spectrum we have oli cloud which had about 15 percent of them were performing worse than the internet baseline measurement we took the interesting thing about oli cloud which we saw when we dug a little bit deeper is that unlike the other cloud providers so across the board AWS is your GCP IBM cloud when they're connecting between regions their own regions they always use their own backbone that is not the case in every instance with oli cloud in some cases connecting between two region oli cloud regions you're connecting over the Internet and that was the same connecting to other cloud provider networks sometimes it would go over the Internet and sometimes depending on their peering it would connect you would connect directly to that cloud provider so they were definitely anomaly when it came to that and it seemed to have impacted their overall performance that you see here so quickly touching on inter a Z performance so we are looking here at the handful of regions that we tested and overall all of the cloud providers look pretty decent they claim less than two milliseconds inter availability zone performance and that was pretty consistent across the board everybody does really well a couple of regions of IBM cloud were kind of at that watermark two milliseconds some but overall pretty good Asher was the most consistent so about 0.7 ish milliseconds for all of the all of the regions that we tested and then for example GCP and oli cloud there was a little bit more of a kind of a range there I mean some of it was like a li cloud is like point seven milliseconds all the way up to you know like over one-and-a-half milliseconds so pretty different range of performance there now these very low latency 'z some of it of a red flag for us I mean if you're using they you're if you're hosting in different availability zones within the same region you want to know that and if you're using it for redundancy purposes you want to know that these these availability zones are for example have independent power sources and are differently connected and so they're truly redundant so when we see this type of performance where it's the latency is that low it begs the question of that if that's really the case are they in a different rack or a different floor this data center are they truly networked and source power source differently from one another I think that's maybe something at least asked worth asking the cloud provider if you use them so that was just something that kind of stood out to us the other thing that we were able to show in looking at all of the measurements and how traffic gets routed from user locations so these are points outside of the cloud provider networks to the various regions is that there's really two sort of ways in which the providers prefer to route traffic and these patterns are really consistent across the providers again with one exception so there's a a routing preference where traffic from user locations will primarily preponderance of the of the journey to the service that's hosted in the cloud provider will take place over the internet so you're going to be you know if your user in Frankfurt you might connect across the Internet and then you enter the cloud providers Network very close to the service that you're trying to access and where it's hosted the opposite of this are the providers where they are very backbone centric and they pull in users into their network as soon as they possibly can so in many instances you might see that you're entering some of the cloud provider networks within just a few hops and then you're gonna be riding their backbone all the way to the service that you're you're you want to reach it doesn't matter where it is in their network could be on the other side of the globe you're effectively going to be primarily using the cloud providers network so these are very distinct approaches to routing users to services and like I said from a pattern standpoint it's very they have very kind of clear preferences across the providers so a juror and GCP they have a very extensive edge and users connecting to services hosted in those cloud providers they get pulled into their network very very quickly so they don't the traffic doesn't spend very long on the public Internet you're primarily gonna be writing about the backbone and that goes for both forward and reverse path so a lot of the service delivery for users for services hosted in these cloud providers is going to be delivered by the cloud providers Network Ally cloud and AWS very different they prefer to keep users on the Internet as long as possible they don't use their backbone as much so you know again you're gonna be subject to the performance of ISPs if you're depending on where you're connecting where your user is and where the region you're connecting to is is located and then kind of this third category that we saw this year was IBM cloud and they have the hybrid approach I guess you could say where in some parts of their network you are pulled into their network very quickly and you're gonna ride your backbone for certain regions like Ashburn as an example of this and then there are other regions where you're going to be connecting over the Internet until you're almost you know right on top of the service so predominantly again like very different approaches iBM is kind of a mixture of the two does this make a difference in terms of performance we'll see that in a moment I think overall there does seem to be an impact but it's not as clear Qaida it's like the Internet is worse or the backbone is better it really depends on some fundamentals like how you're implementing Browdy and whether you're optimizing and you can have issues whether it's a backbone or the Internet and we'll see some examples of this so some of the more interesting things that we're going to share today are around kind of comparing differences between what we saw in 2018 and what we saw in 2019 so this is just looking at the three providers that we looked at in 2018 so we have AWS measure and Google and this is for users from regions around the globe connecting to a region hosted in Mumbai so this is just for this one region 98 locations globally connecting to Mumbai so what we see for users that are located in Asia so you can see that on the left hand side AWS is yellow by the way and then blue is azure red is TCP because you can't see that and so what we saw was that not only was latency higher for AWS but the very variation you know those it's effectively the range and min max what we were seeing in terms of latency measurements was just so much more extreme than it was with the other providers now one could argue okay well if you're connecting from locations in Asia maybe the connectivity service provider performance isn't as good as maybe you might get in North America so could that be contributing to the latency well interestingly enough if we look at the difference this year latency improved and also the variability and the performance numbers that we were seeing also improved as well so they were more consistent in their measurements bear in mind they haven't changed anything in terms of their preference of routing users over the Internet so it's it's not likely that that was the source of the issue and in fact we'll look at an example of what the issue actually was or some examples of that so it's you know it's it's interesting to see that even though there's they're using Internet connectivity they still are able to effect optimizations and make improvements year-over-year and of course this is just looking at their regular service will come in a few minutes to how they perform from a global accelerator standpoint which is meant to do what a Juran GCP do as a default which is allow users to ride their backbone to services so this is a couple of examples of what we saw and what was contributing to some of the higher latency numbers that we were seeing so on the left-hand side you see two green dots the top one is advantage points that's located in Seoul sorry I know the the text is quite small and so that location traffic connecting to that Mumbai region you'll notice just the huge number of hops those blue hops that you see there those are ISP Network hops and the green dots are AWS is backbone and there's just you know again from a proportion standpoint you see it's a lot of time spent on the Internet the reason for that it turns out that users that were in Seoul were getting connected through New York so they were effectively circumnavigating around the globe to get to a region hosted in Mumbai which if you pull out a map you'll see that that's really not the optimal path from Korea to India so that was introducing a lot of latency and then you know of course contributing to the performance difference that we were seeing there now what changed in 2019 we can see here that there was actually some changes made in terms of how traffic was getting routed so we see from that same location Seoul that they are then connecting through Singapore to connect to a WSS network so they're not going around the globe it's not that you know odd suboptimal path it's same for Singapore's one of the differences we saw was an Equinix facility popped into the path and there's really just a few hops between that location and a more optimal you know I guess connecting directly through Equinix to a WSS network so even though you know again they favor the Internet they still were able to make these optimizations it's for routing customers to them so another interesting thing that we looked at was GCP so one of the things that we saw you know and again they favor their backbone they had really significant latency difference compared to Azure and AWS so why is that so when we looked at the actual network paths we could see that traffic from users in Europe was again getting routed around the globe to India GCP really does favor their backbone and using their own connectivity and they don't have at least excuse me they did not have last year direct connectivity from Europe to Mumbai or India and so they were routing traffic around the globe to get to that point now at she's at Nanog last year when we presented some of this information someone from Google stood up and said well actually they're gonna be making a change to their network infrastructure gonna be adding direct connectivity to India from Europe so this year we were really excited to retest and see what difference that made in terms of performance and what we found was there wasn't really any difference it was almost identical to what we saw last year which was really surprising to us because if you go to Google's published infrastructure and network map you can see that in fact they did add direct connectivity to India so they had infrastructure they had the fiber they released senior they owned it but we were still seeing this circuitous path that was taking place users again they were getting pulled into Google's Network in Europe and then they were riding their backbone around the world to India so we actually reach out to GCP and asked them about this and they had said that they were still in the process even though they had the infrastructure in place they were still in the process of rolling out these routes to all of their regions globally so what they had said was it's still kind of in beta and they expect to do that over the next few months and and they had said though that this route was available for users who were in the Middle East and we were testing from Dubai and we were not seeing that we're four vantage point the user and in Dubai they were also getting routed around the globe so they weren't taking this new path but we do expect as GCP said that Google says that they will be rolling out this new route to more regions over the next few months so when we test later this year we'll hopefully see some improvement in in performance so this is also not explicitly performance related but I think it's worth noting because this was a change from 2018 to 2019 and that was that for the path that were mapping out when we were testing from locations outside of GCP to a region hosted in GCP didn't really matter where it was that we suddenly were losing visibility into the reverse path and this was across the board so it wasn't specific to a particular region it was just change in how or in terms of the visibility that we had so what was the root cause for this well if you go to GC Peas network help pages so they have some they've done some publication on this and they basically have said that they've made some changes where any trace route that's for an internet destination so this is not the case across their their own network or anything that's inbound really anything that's out on the internet they are adding to the TTL counter and so because of that that's effectively breaking traceroute I mean they said that depending on the number of hops there may be you know you lose some visibility but the result that we've seen and I don't know what where they're placing that counter my guess would be some really high thrust relate to 55 or 128 or something like that it didn't matter really how where the user was located we there was no decrementing of the TTL enough so that we were actually able to get any response and so because of that we can't see the reverse path which i think is interesting to note because there is this trend towards the cloud provider starting to monetize their backbone and if you don't have this ability into how traffic is routed then it's hard to know whether or not you're getting the specific service that you expect to see and we're running kind of low on time so I'm going to breeze through some of this stuff so performance from China I would say that overall all of the providers do pay a toll it's not like a Lee cloud or any other cloud provider is not subject to the same issues for any user connecting from China to a region outside of China this is packet loss so pretty high across all of them again don't really have time to look at this too much but there are some viable locations to host in outside of China not only from the packet loss but also from a latency standpoint so hong kong good options azure Ollie cloud all look pretty decent so for vantage points from China where he wanted to just look at strictly for you as broadband providers connecting to regions that were are in North America so as you would expect overall connectivity's really strong from Chicago connecting to as your east of course it's gonna be a lot lower than anything on the west coast but we did see some really odd stuff like for example from Verizon on the west coast connecting to a GCP region that was hosted in LA really really significant levels of latency just really bizarre and so we looked more closely at that and what we found is that traffic was connecting into GCPs network in New Jersey so these are from West Coast locations and then it was hair pinning along GCPs backbone and getting connected to LA so it didn't really matter that you are connecting into their network or they you know that they are more backbone centric this was still a rowdy an issue that was impacting users and it wasn't optimal obviously they we alerted them to this they very quickly made a change so you can see this sort of dramatic drop in latency for users on the west coast and then it just kind of went back to what you would affect expect to see if you're in that region global accelerators it's a really interesting one so I'm going to try to get through it in a minute so this is effectively designed to pull users onto AWS s backbone very quickly so closer to where the user is located and that works by directionally so you're gonna have more of the service delivery that's done by AWS versus over the Internet so how did it perform I mean as a premium tier service it depends it wasn't uniform in terms of improving performance we saw instances so the green is where performance was improved and orange is basically where it was effectively the same as their their regular way that they route traffic and then red was where we saw it actually had they performed even more poorly than or poorly compared to their standard way of routing so it really depends on the location and they are in fact working on optimizations when they saw some of this stuff so it was an example of this we tested in October they said they were going to be making some changes and we did in fact see an improvement in some of the locations we looked at where it wasn't as great as we expect it to be so they are responsive I would say across the board cloud providers are in general very responsive if you show them where there's an issue in terms of routing or performance issues they will usually try to do their best to make it right so just as summary very quickly so cloud provider preferences they vary it's not uniform it could be more Internet centric backbone centric it's really up to the cloud provider and that can change over time to inter region connectivity pretty good and really just rides the backbone across all of them except for Ollie cloud AWS global accelerator you want to know that you're getting the performance you expect so you can't necessarily assume that just because you're paying for a higher level of network service that that's exactly what you're gonna get so definitely make sure that you have some visibility there and then GCP Europe to India backbone route it's still kind of rolling out across most of their Gio's so that should be something that we will see over the course of this year takeaways things change trust but verify of course the cloud providers want to do right by their customers they operate massive global networks and sometimes in optimizing things or making changes it can have unintended consequences so you want to make sure that you can see what the impact is of those changes really important to get visibility into how your cloud provider is performing for you if you want to see the fuller report on this you can download it at this web address and the obligatory obligatory social media plugs here so my twitter handle is at bit prince and my colleague who's the lead author of this research archana case event can be reached at Archana underscore k7 if you have any questions we can be reached there I don't know if we have time for questions we don't okay well if there are any questions okay who wants to go first okay David Paul Zimmerman LinkedIn did you capture and/or look at any data ipv4 versus ipv6 we did not know so we were just looking at ipv4 we have done ipv6 for some other research that we've done that may be something that we cover in the next it actually was something on our agenda to look into and potentially include but we were pulling in so many new things that we just decided to hold so we might do that later this year hi AVI Freedman question so every ten minutes you're doing these tests how many packets are actually between each of the zones or external regions so not like the trace but that are actually hitting each end so you're saying how many packets are being sent in each direction right roughly 50 packets okay that'll hit each end okay yes Thanks okay [Applause]

Info

Channel: NANOG

Views: 14,175

Rating: undefined out of 5

Keywords:

Id: 8Ojw4-EWIBU

Channel Id: undefined

Length: 31min 47sec (1907 seconds)

Published: Fri Feb 21 2020