Carl Meyer about Django @ Instagram at Django: Under The Hood 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Also Pinterest And pretty sure Yelp

👍︎︎ 7 👤︎︎ u/wskyindjar 📅︎︎ Oct 23 2020 🗫︎ replies

Guys on the Facebook side are probably jealous.

👍︎︎ 2 👤︎︎ u/Nerdenator 📅︎︎ Oct 24 2020 🗫︎ replies

3 billion devices run Django

👍︎︎ 5 👤︎︎ u/[deleted] 📅︎︎ Oct 24 2020 🗫︎ replies

They moved to in-house stuff a long time ago.

👍︎︎ 1 👤︎︎ u/Natrium83 📅︎︎ Oct 23 2020 🗫︎ replies
Captions
- Okay. So yes, Instagram. You may recognize us from Nadia's talk. (laughing) I'm Carl Meyer. I've been a Django core team members for about six years and an Instagram employee for about six months. This is me on Instagram. I'm gonna talk today some about how Django was part of Instagram's growth and how we still use Django today and how we make Python and Django work for us at scale. I know this is the fifth hour long deep dive talk of the day so I've tried to do a layered slide deck. For those of you who are completely asleep, there are an adequate number of cat photos. (laughing) So I hope that wherever you are in your day in your sort of mental process there will be something here that you can appreciate. So, what's Instagram? We are officially about an app for capturing and sharing the world's moments. This means of course cat photos, lots and lots of cat photos. Everyday, Instagrammers upload about 95,000,000 photos and videos to Instagram. This means in less than two weeks we get enough photos that if you printed them out and lined them up side by side, you would reach the moon. And these are not just any cat photos. These are truly excellent cat photos. People really like these cat photos. Roughly 4.2 billion times every day, someone likes a video on Instagram. All in all, our data store currently stores 2.3 trillion likes. That's roughly one like for every person on earth for every day of the year. I could go on with these analogies, it's kinda fun coming up with them, lots of zeroes. The point is, we don't just have cat photos, we have cat photos at web scale. (laughing) And every single one of those 95,000,000 photos and videos and 4.2 billion likes every day goes through this guy. But of course 4,000,000,000 daily likes is way too much to expect just one jazz guitarist to handle all alone, so we have, I'm not supposed to give exact numbers, but 10's of thousands of Djangos, servers running Django, although internally we don't call them servers. We don't call them the web tier. We call them the Djangos. (laughing) Feel free to join me in visualizing this as a fleet of 10's of thousands of mustachioed jazz guitarists. (laughing) Before we get too deep into the current state of things at Instagram I want to step back and do a quick run through Instagram's history. Starting back in October 2010 when Instagram began, coincidentally the same month that I joined the Django core team. I'm not sure what that says about my life choices versus some other life choices. But anyway, October 2010, when we had not 10's of thousands of Djangos but I think just one. I talked to Mike Krieger, the cofounder and CTO of Instagram about their early experience using Django. His first comment was that it was super easy to get started with, didn't require a lot of decisions, didn't require a lot of set up, made testing easy. All of these things allowed Kevin and Mike to build the first version of Instagram in about two weeks and in only a couple months get to their first million users. During those first few months they of course hit some of their first scaling pain points with Django, probably things that many of you have also encountered. Like, for instance, 404s or 500 errors sending an individual email for every single error. (laughing) If I'm not mistaken, this is an archival photo from the early days of Django at the Lawrence Journal World. Maybe Jacob or Adrian can explain why Simon was so excited about the emails. Anyway, apparently at one point Mike and Kevin were literally getting an email for every browser request that came in to Instagram.com because they had a missing Favicon, so every request generated a 404 and every 404 generated an email. Right around this time in Fall 2010 is when Russ committed logging support to Django. Although the default behavior of Django is still to send an email for every error, at least it's now much easier to configure. So I think we can wave our hands a little bit and call this one fixed, thank you Russ. (applauding) While I'm on this slide I should also mention the easy testing that Mike referenced a couple slides back. As of the Django 1.3 time frame, that was also Russ's work. So thanks Russ for that as well. Also, like many of us I think, Kevin and Mike had to hack around the admin wanting to do a count star of very row in a very large table in order to provide pagination. In 2010 they had to hack around this with monkey patches, today in Django 1.8 we now have a meta option to disable this so I think we can also call this fixed. Of course managing static files was a pain. I think it still is but around this time is when Yanis contributed .static files which improved the situation quite a lot. So thank you Yanis for that. Like many of us again, Instagram had trouble with Django's transaction management, which used to be pretty bad. It would wrap reads in unnecessary transactions. It would do extra round trips to the database. It would leave postgres connections hanging in and idling in transaction state. For those of you who use postgres you may remember that. But of course, since Django 1.6 this one is also fixed, thanks to Emrich. By June, July 2011 Instagram had about 5,000,000 users. At this point all of these 5,000,000 users, all of their data, was stored in postgres SQL and accessed via the Django ORM. They did use vertical partitioning by model to allow storage in multiple databases. So there was a database for users, one for media, one for likes, one for comments, et cetera. This kind of vertical partitioning by model is actually very easy to implement using the Django ORM's multi DB support and routers and that's exactly how it was done. This isn't the exact code but it's the moral equivalent, updated a little bit for some API changes in modern Django. But it's quite easy to establish mapping of model to database name. I think in Instagram's case there was also some cases where it was an entire app mapped to a database, but you can easily return the correct database from the db_for_read db_for_write methods and then only allow relations between models on the same database. Of course, you have to do some fake foreign keys. You can no longer have referential integrity between models on different databases but it's not that hard to implement. So it made sense as a first step for scaling the ORM for Instagram. Shout out to Alex Gainer for implementing Multi EV and once again, to Russ for mentoring that project. So, the Django ORM was managing 5,000,000 users just fine but at a certain point it became clear that the vertical partitioning wasn't gonna scale, there were just too many likes to store in one database. I'm gonna start saying we, it's a little weird cause I joined Instagram six months ago, so none of this involved me but it's simpler if I can just say we. So we picked a pretty typical horizontal sharding scheme to deal with this problem. So you take all of those many likes, too many for one postgres server, and you split them up into a whole bunch of logical shards. In our case, we split them up. We started with likes and we split them up according to the user ID of the user whose media, whose photo or video had been liked. I have 10 here for illustration but actually several thousand logical shards, and we mapped these to postgres schemas. If you're not familiar with postgres schema in postgres is not like the SQL schema of the table. It's more like a name space that can contain tables within it. So each logical shard corresponded to a postgres schema and you can of course have multiple postgres schemas on a single physical server. So this allowed a mapping from logical shards to physical servers that meant that the number of physical servers could grow over time as traffic grew without having to re-bucket the data. It just required updating the mapping from a logical shard to a physical server. So horizontal sharding like this is a little harder to implement inside the Django ORM. Today there are a couple third party apps, reusable apps, libraries that will implement something like this using some complex model base classes and manager base classes. In 2011 those didn't exist. In fact, some of them are based on some of Instagram's work that was publicized in blog posts. So that wasn't available as an option. Also, Instagram wanted a solution where they could take a set of IDs that might correspond to objects contained in multiple different shards, fan those out with asynchronous queries, get results back, put them all together into one result set for the application, and that seemed especially difficult to squeeze into the ORM at the time. So rather than using the Django ORM, Mike sat down and wrote his own minimalist sharded ORM committing the first version of it with the only appropriate commit message. (laughing) Work in progress. This ORM was really a fairly light wrapper around raw SQL. It included a lot of code that looked like this. Find the right shard, get a connection to the shard, put together some raw SQL and send off to the database. This new storage layer was not the end of the Django ORM at Instagram. Initially it was only used for likes. It took over two years for all of user data to slowly be migrated from the Django ORM into the new sharded system. Even today, we still use the Django ORM for some of our internal admin tools. But you could say that this commit in 2011 was the beginning of the end for the Django ORM at Instagram. Also just kind of an interesting side note, Instagram engineers at the time came up with a scheme for sharded uniqe Ids that's been much replicated since. One of the issues with sharding of course is that you can no longer rely on a simple database sequence or auto number in MySQL to generate your object IDs because obviously each individual database is gonna have its own sequence or its own of Ids and so you'll have clashes between them. So the scheme that Instagram settled on was beginning with a time stamp, then a unique ID for the shard, and then a unique sequence ID tacked onto that and this gives you unique IDs that are, also preserve the property that if you order objects by their ID you get a chronological ordering which was important to Instagram. And this also can be implemented in postgres as stored procedure and so you can use it as a default for the ID field in your table just like you would an ordinary sequence. The new sharded ORM worked very well but it did have its problems. Among those problems was Justin Bieber. (laughing) Justin Bieber was an early adopter of Instagram and so the new sharding scheme grouped everything by user, so all of a user's media and all of the likes on their media and all of the comments on their media were all stored in the same shard. This made sense for the data access patterns but I think you can guess what happens next. So, the Biebs posts a photo, everyone in the world, roughly, stampedes to like and comment on that photo, and you have a very hot postgres database in your hands. So for awhile there, there are still war stories told today about how for quite awhile every new Biebergram constituted a genuine ops emergency. (laughing) Memorizing Justin Bieber's user ID was actually a condition of employment at Instagram. You needed to be able to do things very quickly. So, this problem was eventually solved not with any radical changes but just by a lot of remapping of logical shards to physical servers to avoid, to make sure that, at the very least, Justin Bieber did not share a physical database server with any other popular account and also some re-caching techniques. Anyway, by the following April, Instagram had passed 40,000,000 users, April 2012, and that's also when we were acquired by Facebook. So up till this point Instagram had been hosted on Amazon web services. Once we were acquired by Facebook, there began an effort to move off of Amazon web services and onto Facebook infrastructure, which we did by early 2014, everything had been moved into a Facebook data center. And then a month or two after that Facebook conducted one of their regular disaster recovery exercises and cut the network to that data center. So everything had to be moved from that data center to a different Facebook data center and nobody really wanted to repeat that experience a third time. And of course we also wanted Instagram to be itself, disaster ready in case something would take out a data center. So that began the effort to move Instagram to a multi region or multi data center architecture. One of the big challenges of this move was caching. So, previously in a single data center architecture, like many Django apps, Instagram's Django application managed caching manually. So a request comes in for some data, we go see if it's in memcached, if it is, great, if it's not, we have to go to the database, we get it back from the database, we put it in memcache, we send it back to the user, a write comes in, we have to go invalidate the cache and then we send it to the database et cetera. Moving to a multi data center architecture we can use postgres replication to take care of syncing up the main data store between the centers. But we also have to figure out how to sync up the caches. If we just leave independent caches, well of course, we do need a cache in each region because for latency reasons you can't go across data center to talk to memcache, but if we just have separate memcache in each region, if Pat posts a photo in Data center A, and then Jan posts a comment on that photo over in Data center B, over in A memcache still has cached that there are no comments on this photo and so Jan isn't gonna see that comment for quite some time. That's a problem. So the solution we eventually came up with for this involved using PGQ which is an implementation of a queue inside postgres to actually keep a queue of cache invalidation events inside postgres and then that would get synced as part of postgres replication and then a separate cache invalidator process, and each data center would read from that queue and invalidate the appropriate cache entries in each data center. The point of all this really is just to say that this stuff is hard. And while we were busy trying to figure out how to solve these problems for our Instagram architecture in the meantime Facebook had built TAO. TAO is, now here I have to reach for my notes, an eventually consistent distributed graph object store and write-through cache. It's built on top of MySQL in memcache and it had already been proven by Facebook at more than 10 times Instagram's scale. I won't get into a lot of deep details here on the design of TAO cause it's a little far afield from the topic of Django. If you're interested in the design of big data stores, there's a white paper and a blog post if you search for Facebook TAO you'll certainly find those. Effectively, TAO takes all of this complexity, the complex doesn't go away, but it wraps it up into a service that somebody else maintains. So it's not our problem anymore. So, effectively, with TAO, the picture from Instagram's perspective could look like this. We just ask for the data we need and send the data we need to write to TAO and all the caching, all of the multi region stuff, all of the sharding stuff, all of that is just handled invisibly for us. Let's talk just briefly about what the data model looks like for TAO. It's a graph store. So for a typical Instagram case, let's say Jan follows Pat, so we have two nodes, Pat and Jan and edges between them, Follows and Followed by. Then Pat posts a photo and so we have another node for the photo and edges Posted by and Posted between Pat and the Photo. A comment, another node and a couple more edges. And then Pat likes the comment and we get a couple more edges there. That's really all there is to TAO's data model and the operations that are exposed as is typically the case with data stores that have to operate a very large scale, are very few. You can basically count them on one hand. You can get a node by ID. You can get a range of nodes by creation time stamp. You can ask for all of the outgoing edges from a node and you can get the edges between any two given nodes. And that's, effectively, all that you can do. As it happens this data model fits Instagram's needs pretty well. So, in 2015, we began the process of migrating Instagram from postgres to over to TAO and I think just about a month ago we decommissioned our last postgres SQL cluster. This cat is here in honor of postgres to express my personal sadness but many people at Instagram love postgres and we're very sad to see it go. But on the other hand, Instagram's data access code is very much simpler now. Nobody is really sad to see all of that hairy caching and charting go. So in June of this year we hit 500,000,000 monthly active people on Instagram, shortly after I joined, I'm sure it's not a coincidence. (laughing) So this summer we finally decided that it was time to fix the fact that we were still running on a heavily patched Django 1.3. (laughing) So, our approach to this problem was completely wrong. If you visit the Django documentation it will tell you that if you want to upgrade Django you should go version by version, so 1.3 to 1.4 to 1.5 and at each step you should carefully read the release notes and check your app for where you're doing something that needs to be updated and that's a great process. I've done it before. If we had done that I'm not sure we ever would of actually gotten around to the upgrade. So instead, this is what we did. We went straight from 1.3 to 1.8, just installed it and then ran the tests. And look some of them failed so we fixed things and then so rinse and repeat basically for several months. (laughing) Obviously also this is far too risky and large a change to either land and master in one go or to deploy all in one go. That'd be far too risky. So, rather than being able to simply convert the entire code base to Django 1.8 compatibility, we had to convert the code base to simultaneous compatibility with both Django 1.3 and 1.8. So as you can imagine that led to some pretty ugly version conditionals and compatibility code and in some cases entire 30 line methods, copy, pasted, and here's the Django 1.3 version and here's the Django 1.8 version. So it was not pretty but it worked and once we had the entire test suite passing, ran it on some development servers to flush out anything that wasn't caught by the test suite, pushed it to some pre-release boxes and then started rolling it out cluster by cluster watching carefully for regressions, performance regressions or errors or anything like that and once we were confident that it was working and we had it rolled out everywhere I got the very satisfying task of going through in one big diff and ripping out all of the Django 1.3 compatibility hacks. So now we are entirely on 1.8 and we couldn't go back to 1.3 if we wanted to, which we don't. So, we only really had two performance regressions to speak of with Django 1.8 compared to 1.3. Does anyone want to guess where in Django we might of experienced performance issue going from 1.3 to 1.8? We don't use the ORM anymore. I think if we had, if we did still use the ORM, this whole process would of been more challenging. - [Man] Swappable TET notes? - Swappable model? - [Man] Replacable templates. - Oh no not really. We actually already use Ginga so there wasn't a lot of difficulty there. Actually the two issues that we ran into both related to internationalization. The first one, somewhere between 1.3 and 1.8, I'm not sure exactly in which version, but somewhere the feature of internationalized URLs was added so you can actually have URLs that change depending on the active language. It was a very cool feature, but as a side effect of that feature, Django now will compile every URL regex in your project once per active language rather than just once period. We have a lot of URLs and also a lot of active languages and so even though this is something that only happens at start up, our uWSGI processes restart often enough that this was a serious issue for us. It's also something that I think we were able to fix it pretty easily with a monkey patch and I think Django could be smarter here, particularly in the case where you don't actually use the internationalized URLs feature. So I plan to look at this in the Sprints. Similarly, somewhere along there, Django gained the ability to load translations from any installed app instead of from just one directory, which is, again, a great feature, especially for reusable apps that want to ship translations. But for us we don't keep any translations in our installed apps and we have over 100 installed apps and it turns out that the way it currently works for every installed app, even if there's no locale directory there, Django will ask, gettext to load the translations and gettext will create a null translations object and do a bunch of work that really there's no point in doing. So again, that was a problem for us, and again I think it could be fixed. I plan to look at it in the Sprints. The third monkey patch that we still carry is actually not a new one in 1.8, and then we carried over from 1.3. So Django's lazy settings implementation means that every time you access settings.fu it actually goes through a Python getattr function call cause it's dynamically going down to a wrapped object to get the actual value. Turns out that at scale Python function calls are pretty slow compared to attribute accesses and so this was a problem. So, this is the silliness that we do to take care of that. At start up, we loop through every setting and we force it into the dictionary of the outer wrapper settings object so that every future access of a setting will just pull it directly from that object's dict as a normal attribute access instead of going through the whole lazy setting thing. This probably breaks the override settings decorator, which we don't use, we just use mach.patch, that's not an issue for us. So, this probably maybe breaks other things too so I can't really recommend it, but it was very effective for us. We had some views that accessed a setting in a hot loop where we saw 10% CPU instruction gain, just from doing this thing. I think this would be a little trickier, but I think it's also probably fixable in Django. The whole setting's implementation is a little bit complex for what it actually does. So the conclusion of all that is with a few monkey patches we're now on Django 1.8 and we experienced effectively zero performance regression from 1.3 to 1.8. So that's I think a credit to the Django team really. Again, if we used the ORM, that might of been a different story. I don't know. So that brings us up to the present and let's take a quick look at what the Instagram stack looks like today. If a request comes into Instagram, the first thing it'll hit is Proxygen, which is a Facebook developed open source HTP load balancer and proxy server and it'll actually go through several layers of proxy gen first at the edge routing then the data center then the cluster, eventually it'll hit a Django, which as I mentioned we run under uWSGI, and then Djanog will then talk to a number of different back end services. TAO is the primary data store. We use Cassandra, used to be Redis, we migrated at one point from Redis to Cassandra, but Cassandra is where we store counters, seen state, various other things that fit the Cassandra data model well. Everstore is a Facebook large blob store or file store where all the actual media goes. And then we do also put tasks into RabbitMQ for Celery for delayed execution. So we've already looked a bit at how we scaled the back end data store. So I'm gonna talk a little bit now about how we work on performance of Python and Django today. First thing to know about performance is that the word doesn't actually mean anything or more accurately it doesn't mean anything useful until you can attach it to a specific metric that you're trying to optimize. So what do we want to optimize? We would like to have more happy Instagrammers posting cat photos and we would like to do it with fewer servers running Django. As Adrian can tell you, jazz guitarists don't come cheap so we need to be efficient there. This suggests some possible metrics. For instance, cats per jazz guitarist. (laughing) Unfortunately, both cats and jazz guitarists have a tendency to wander off when you're trying to measure them so instead, we measure users per server. So the numerator here is straightforward. What we measure is Active Last Minute in our peak minute of daily traffic how many users were active. The denominator is a little trickier. Obviously we trivially know how many Django servers we have in production, but that's not a responsive or useful metric. I mean, if I push a change that makes everything 10% more efficient we're still gonna have the same number of servers in production. What we really want to know is not how many servers we actually used to serve our peak minute traffic, we want to know what's the minimum number of servers we could of theoretically gotten away with, given our current efficiency, if the servers were fully utilized. So, we determined this experimentally in production using the Linux PERF tool. This is some C code showing some basic usage of PERF. PERF is a tool that allows you to access hardware counters on Linux, one of which is a CPU instructions counter. So using this, we actually wrap this in C types so we can access it from Python, and then we instrument our Django uWSGI workers with this so that we can measure how many CPU instructions are used by Django to serve a request. And then what we do is in production we take a server, actually a set of servers, and using our load balancer, we push more and more traffic in their direction just until they start to fall over and then stop. And we have some metrics where we can tell when a server is under stress and about to start failing requests so we stop before it gets to that point. But you can see here a traffic chart or actually a CPU load chart showing in green a typical server in our fleet which will have sort of a curve of the daily peak traffic. And then in yellow, one of these servers that's under our load testing, where we keep the CPU usage pegged as high as we can go without falling over. What's interesting about this is although we're measuring CPU instructions, it's not actually, that doesn't mean that we only care about CPU instructions on the server because a server may fall over for a different reason. It may fall over, depending on how we've got things balanced, it may start to fall over because it runs out of uWSGI workers cause we don't have enough memory to run more uWSGI workers and so we get too many requests that are too slow to process and there's no uWSGI workers on the server free to handle more requests. So that yellow line is not necessarily at like 100% CPU utilization. There's a number of possible bottlenecks that we could hit but regardless of which bottleneck we actually hit, this gives us not a hypothetical or extrapolated or theoretical, but an actual experimentally determined measure of how many CPU instructions per second we can expect one of our Django servers to use in actually serving requests to users. And then of course we can also determine how many CPU instructions per second our entire fleet uses and divide that by our number of users to get CPU instructions per second per server over CPU instructions per second per user. And this is what we actually measure. This is our top line metric for all our efficiency work. Now of course this is what we measure but we can simplify it, arithmetically get rid of the common CPU instructions per second factor, do a little inverse flipping, and get back to exactly what we wanted to measure, which is how many Instagram users can we ... Hello? Can we serve with a fully utilized server. You might wonder why we choose CPU instructions per second as the sort of arbitrary like common factor here as opposed to something like requests per second or CPU time or other things that are commonly measured. One reason is that saying a server is a simplification we actually run a number of different generations of CPU hardware in production with different capabilities and CPU instructions per second is a metric that we're able to normalize across all those generations of hardware. The other reason is just it's a very fine grain metric. Requests per second is much more coarse-grained, not all requests are created equal. So CPU instructions per second gives us more fine-grained gradiations there. We call this metric AppWeight. As you can see here from the green box, when I took this screenshot in towards the end of October we were at 18% improvement in AppWeight from the beginning of 2016. So this is our primary focus, is optimizing this metric. Personally I think AppWeight is kind of a weird name cause like it sounds like more AppWeight would like weigh down your servers and you'd want less AppWeight. Actually, higher is better for this metric because we're measuring users we can handle per server, so we want more not less. So up is good. If I could name it I'd call it Django power. (laughing) Cause it's like measuring the power of one Django. I don't get to name things. I don't think I've been there long enough. So, one cool thing about this metric is that effectively it measures the definition of scalability, right? Because scalability means a web service is scalable if as your traffic grows, your hardware needs scale linearly or sublinearly with your user growth, right? If your hardware needs go up superlinearly with your user growth, that's the definition of not being scalable. And that's exactly what AppWeight measures. If AppWeight is constant, that means that our hardware needs and our user growth will track linearly. If we improve AppWeight, our hardware needs will actually be sublinear with our user growth. Yeah. We also do continuous deployment at Instagram. If I make a change to the Django code base and commit it, within about 10 minutes it will be deployed to all of our 10's of thousands of jazz guitarists and serving 4.2 billion likes per day. We do an average of 30 to 50 deploys per day. Each one usually contains somewhere between one and three commits. So with engineers pushing this many deploys per day how do we keep AppWeight under control? So with a large team, the key is not only to have good metrics that you can measure, but also to make them visible to engineers whose primary concern is pushing cool features. So they need to have good visibility into the efficiency metrics. Our primary performance data set is something we call Dynostats which we gather from a normal Django middleware. It's a little more complex but looks very much like this. We sample a fairly small percentage of production requests. We have enough requests that we don't need to measure all of them. We can sample and get good data. If Dynostats is enabled for a particular request, we measure a number of things at the start of the request, the CPU instruction counter, wall time, CPU time, the RSS memory usage of the process, a number of other things, and then as the response goes out we collect all that, check all those counters again to get the change, and send that off to Scribe which is Facebook's data and statistics pipeline project, also opensource, you can look it up. We send that off along with a bunch of request metadata like the URL path, the view, the HTTP response code, anything you can imagine about the request gets sent off along with that data. So that allows us then in our data analysis system to slice and dice that data in all kinds of different ways. Here we see CPU instructions by view over time and we can see very clearly where a regression happened. So if a regression like that happens, we'll get an alert, of a regression either in CPU instructions or in wall time. And so how do we find the source of the regression once we get an alert? So the most obvious is to look for a code roll out that matches up in the timeline. Usually that allows us to narrow it down to one or two commits and that's if we're lucky. Now, we push a lot of code that's hidden behind feature gates, and so often the regression doesn't actually jump up nice and neat like that, and often it doesn't correspond with the code roll out because it actually corresponds with somebody turning up a feature gate slowly to expose the code path to more and more users. So we also can show those feature gate changes on our timeline and try to match them up with the regression. If we can't immediately find an obvious cause using one of those techniques, we dig deeper using a tool that probably many of you have used, Cprofile, from the standard library. So in the past when I've used Cprofile I've typically run it in a controlled environment on my development server or my laptop or whatever. But for us, getting realistic data anywhere other than production is pretty hard. So we sample Cprofile in production as well. Again, we do it using a middleware which is very similar. We sample an even smaller percentage of requests because Cprofile has more of an impact. When you instrument with Cprofile it does slow things down so we don't want to do it on a lot of requests, but we do it on enough to get the data we need. And we just create a Cprofile profiler object, attach it to the request, enable it, and then on the way out we disable it, generate the statistics and send them off to Scribe. And that allows us to then generate tables like this among many other things. I had to cut off the left most column with the function names here to get it on the slide in a readable way, but each row in this table is a function in our code base and we can do time-based comparisons. So like, if we know when a regression happened roughly we can say compare the time before that to the time after that, and in this particular case we can see that one function is now using 70% more CPU instructions then it did previously. So that allows us to very quickly narrow down to exactly what the source of the regression is and where to look to fix it. So this is a key tool and we use it daily. We also pipe Cprofile data into gprof2dot which will generate graphs like this which show the hot paths. So again, we can see quickly where we need to focus optimization efforts. And as yet another step in the aim to make this data visible to every engineer, not just those of us who are focused on efficiency, in Fabricator, which is our code review tool and code browser, also opensource, some of you may use it, if you hover over any function, you get a pop up hover card which will tell you what percentage of our global CPU that function consumes exactly how many servers would, in theory, be needed solely in order to power that function, how many views, which views use this function, which other functions call this function and you can also drill down to see what does this function call that's using the most CPU. So again, the goal to make this data very visible to every engineer. One thing, interesting thing, about Cprofile that I didn't learn until I went to Instagram, is that you may have noticed if you've used Cprofile before you may have noticed that I keep talking about CPU instructions. Normally, that's not what Cprofile measures. Cprofile, by default, measures CPU time. But you can actually pass any function you want into Cprofile any function that returns a number and Cprofile will measure that or consider that the clock. So we don't use the default Cprofile timer at all. We use two different timers. One that uses PERF as I showed before to get CPU instructions, and another one that actually has no relationship to time, but actually measures RSS memory at the beginning and end of each function call. So you can use Cprofile to measure anything, even if it's not time related at all. Another interesting side note about Cprofile, it doesn't record the whole call stack. What it records is caller callee pairs. So if you have, say, two functions, A and B that call functions X and Y, Cprofile can tell you that of all the calls to Y, 10% came from A and 90% from B, which is great, that's very useful information. But if X and Y are both decorated with some common decorator that wraps them, Cprofile ends up seeing a picture like this instead and all we can find out now is that 100% of the calls to Y came from cached_property.get for example. So we lose that useful caller callee information. It would be nice if there was a feature in Cprofile to deal with this. There isn't so we actually run a custom fork where we've hacked up the C code to tell it to ignore certain common wrappers and just essentially pretend that step in the call stack doesn't exist. So we essentially trick Cprofile into seeing this picture again. If the wrapper, the decorator wrapper, used a significant amount of CPU itself, this would give us misleading data because we're essentially religion up all of the wrapper's CPU instructions into the caller function. But in general these kinds of decorators don't tend to use a lot of CPU internally. It's not particularly misleading and it gives us better caller callee associations. Alright, enough on Cprofile. So what do we do when we find an efficiency regression, we've narrowed it down to the source, and we need to fix it? I had a ruder version of this one but I took it out. Many of the cases are simply silly things that should be fixed, like an N cubed algorithm somewhere or concatenating a bunch of strings in a tight loop or fairly obvious things that once you see them you go oh yeah that's pretty clear how we can fix that. The close relative of the first one is not doing useless work, and these two together account for the vast majority of our performance regressions. Recently we fixed a case where we realized that we were fetching a set of media, photos and videos, and fetching all of the comments on every one in order to be used in a view that didn't actually show the comments at all. So, there are sometimes obvious cases where you just don't do useless work and you can gain a lot of efficiency. Getting into slightly more difficult areas we do a lot of caching, non only in Memcache but also just in process and often just for the duration of one request. If there's something that would otherwise be accessed multiple times, and we can cache it and just do it once. Getting even more in-depth sometimes, how many of you in here have used Cython? Okay quite a few that's cool. It tends to get used a lot in scientific community not quite as much in web community but it's actually very easy to use and extremely handy. It'll take a Python file and often with no changes, although you can get additional speed-ups by adding some type annotations, it will compile it to C code and then run it as native code, often much faster. And so when we have a hot spot in our code that we can't find anymore optimizations of the Python level we'll just change the extension to .pyx, our build change will automatically compile that using Cython, and we can see big gains from that. And of course sometimes, handwritten C can outperform what Cython is able to do, and so we do sometimes take an extremely hot spot and just rewrite it in C and call it as a C extension. Stepping back to Django, as we heard yesterday, one phrase that's sometimes used in describing Django's design aims, is this tightly integrated, loosely coupled, it's also occasionally been the subject of mockery by those who didn't think we were achieving it or that it was contradictory. But I think that the success of Instagram with Django is really a case study in the success of this very design goal. Or a similar one, a phrase that comes, often paraphrased, it originally comes from PEARL creator, Larry Wall's, make the easy things easy and the hard things possible. As we heard earlier, Django is tightly integrated out of the box although pieces work together, you don't have to figure a lot of things out or make a lot of decisions, and that tight integration, that making the easy things easy, allowed Instagram to get up off the ground very quickly. But when we had to start replacing components, we outgrew the ORM after 5,000,000 users or so. We were able to switch to our own homegrown ORM and later to TAO while continuing to use the rest of Django. There's a lot of Django that still is used everyday at Instagram. Among other things, we still use contrib.sessions. We still use contrib.auth. They're pluggable back in, so allowed those applications to scale with us from one user to 500,000,000, along with many other things. So long story short, Instagram has been very happy with their choice of Django. And on that note, as a follow up to Nadia's excellent talk I'm happy to be able to announce that Instagram is joining the Django Software Foundation as a corporate member at the gold level to help support the Django opensource community. (applauding) We've also now, just in the last few weeks, we've gotten a contributor license agreement, a corporate CLA in place which will allow Instagram and Facebook employees to contribute to Django as part of our jobs. (applauding) So as we at Instagram build towards our next 500,000,000 users, we are counting on Python and Django to get us there. We have no plans to switch to Rust or anything else that comes along. We're planning to stick with Python and Django and we want to do our part to help keep the community strong. Looking a little bit more towards our future some things we want to look at in the next year. We intend to be on Python 3.0 sooner rather than later. In fact, this was a big motivation for getting to Django 1.8 in the first place, is that we wanted to be on a Django version that would support Python 3.0. We'll probably do this upgrade just as badly as we did the first one. (laughing) But hopefully also as effectively. Like many people we want to do more with Async and specifically asyncio. We actually already use asyncio, the Python 2 backport called Trollius, to do a fan out of back end data requests from the web server. But we know that we're wasting a lot of web server capacity by using synchronous uWSGI workers to serve all requests and so we want to explore Async web serving. There's a lot of question marks there about how or if we can do that with Django. But I think it's possible and stay tuned for maybe a future talk in that area. One project I'm personally working on right now is a traffic replace. So the idea that we can record production traffic, production requests, store them and then set up test servers and replay those requests against the test servers. For instance, we can have a control server and then a server with some diff applied and actually collect some performance data from realistic traffic before we ever launch it into production. So again, maybe a future talk on how we manage that. I will be very surprised if this hasn't already shown up in the slack channel as a question. It did? - [Man] Twice. - Thank you, I knew I could count on all of you. (laughing) We have looked at pypy in the past, several years ago before I was there. Apparently there were issues with memory usage and also we do use a lot of C extensions not only for performance but also simply because there are C libraries that we need to be able to make use of and not all C extensions written for Cpython work well with pypy's garbage collection without some modifications or improvements. So we had some troubles with some of that stuff the first time we tried pypy. Pypy has come a long way, so have we. It's something we'd like to look at again. Again a lot of question marks there about both for memory reasons and for C extension reasons whether we'll be able to get it working well for us, but it's something we want to look into. Alternatively, if that doesn't work out, there is some work going on right now by Brett Cannon and I think Dino Viland, to integrate into Cpython itself the hooks necessary for adjust and time compiler. That's probably a long way out still but if that happens that could be huge for us. Lastly, I've been saying we all along for all kinds of things that I had nothing to do with, I'm just here talking about them, so I wanted to take a slide to acknowledge at least the team that I work with, the efficiency and reliability team at Instagram and all of the awesome work that they have done, and many other teams that I didn't have room to fit on the slide. A lot of the stuff I talked about here, there's more depth on it at the Instagram Engineering blog, engineering.instagram.com. Feel free to check that out. We are hiring if anybody's looking for employment with Python and Django and if you want to follow up with me on anything I'd be happy to chat here or afterwards. Thank you. (applauding) You need the network which is not there. - No it was after mine. Oh okay. So let's try that. Feed my laptop. Excellent. So, unsurprisingly we have an extensive number of questions. Quite a lot of them around the scaling problems that you have and the other ones. I'm gonna start with the most important question, are there more photos of cats or more photos of dogs? - I'll have to get back to you on that. (laughing) I need to see if we're authorized to release that data. Obviously we all know that by heart, but I mean, it's a little bit sensitive. - So, let's start with two. You mentioned a bunch of things that you were using, a bunch of profiling tools, internal things, monitoring tools and so on. You know, within your Instagram Engineering bubble or something do you have somewhere where you list all of these that are opensource or which bits are opensource? I probably had about 10 questions which were is this thing opensource, is this thing opensource, or are you going to? - We do have some posts like that in the backlog of our engineering blog but I don't think we have anything recent or up-to-date that kind of lists out everything. So that's probably something that we should look at in terms of an updated blog post and kind of the full stack and the pieces and what's opensource. So yeah I'll put that on my to-do list to talk to somebody about getting that done. - And similarly with sort of monitoring tools and so on as well do you use when you're dealing with your areas are you now all using like Facebook stack? - Yeah. We're using an internal Facebook stack for monitoring and alerting. I mean I know Scribe which is sort of the pipeline that gets the data off the individual servers and into the data analysis system, that's opensource. I'm not sure off the top of my head which other pieces of the data analysis and alerting stack are opensource. Some of them may be but I need to look into it. - All the talk of these enormous scaling problems certainly makes for some entertainment, how much of it do you think is actually relevant to normal size projects and which of the things should we be considering using ourselves? Is it worth us looking at doing this see your profile extension? Should we be looking at using TAO for a sort of moderate size website? - Yeah I think a lot of it is not relevant to most websites. I mean, I think Django does a very good job of hitting the 80% use case and focusing on that. And very few web applications ever make it to the level of needing to worry about these concerns. I think it's appropriate for Django not to, certainly not to make things more difficult for the projects just starting out, in order to make it easier for very large scale projects. But that said, I mean, certainly some of these things, some of the profiling stuff could potentially be useful even to midsize projects. I don't know how much of it belongs in core. We could maybe have some better hooks in core for detailed performance monitoring of what's going on within Django. That's definitely been discussed before in connection with the Django debug toolbar. - [Man In Black] And also with alt beat. - Yeah. Yeah. - [Man In Black] And these other services that provide that sort of tooling and all-- - And they're all monkey patching, yeah. So yeah, I think there's some things that can be done in core there. I think the bulk of the work could be external packages, it doesn't really involve changes in core. - Do you think that some of your developers or whatever may be able to contribute some documentation about how, you know, here's a collection of moderate size scaling pitfalls that you probably shouldn't do that are quite specific to Django? - Yeah possibly we could look into that. There'd certainly be other people who are more familiar with that stage of the growth than I am personally but yeah. - Last question on the performance per say, do you use any sort of staging environment or is it more in a sort of roll it out to small parts of the ecosystem and hope? - Yeah. I mean at this scale it's pretty hard to have a useful staging environment without making it an impractical size. So no, not really. - [Man In Black] Running 100,000 gen. - Right. (chuckling) - [Man In Black] Staging Djangos is a relatively expensive operation. - So I mean, what we tend to do more is roll things out to production gated and then turn them out very slowly and carefully. - So sort of internally, I guess Djano, Instagram's a sort of project where predominantly I imagine a high proportion of traffic comes over the API rather than the web directly. - [Carl] Yeah. - Does that kind of influence the choice of technologies in choice of engineering that you're doing much? Do you distribute things carefully and have like API servers run their own thing and you've got separate services to do other things or is it kind of more monolithic? - It's all the same service. I mean, I don't think we're doing anything particularly interesting in that regard compared to most other Django projects. I mean we have a lot of views that return JOSM is what it boils down to. Yeah. - With that monolithic architecture then is it just somewhere on your laptop is a checkout of Instagram and everyone's got access to everything and pretty much anyone can work where they feel like or what's your sort of internal structure? - Well in case anyone was thinking of making off with my laptop. (laughing) I don't actually. We tend to do our work on development VMs over the network. But yeah no I mean one of the cool things about working at Facebook and Instagram is that it's a very open culture inside the company and so there's a lot of freedom for a developer to go poking around in the things that interest them and beyond that actually work on whatever interests them. I mean, there's a very strong culture of development being driven by engineers and managers are there to support the engineers. So if you think that something needs to be done or would have a big impact you're supposed to go ahead and go do it and not ask for permission and then see if it does. - Roughly how big is this code base? Significant lines of code or something? Do we have some sort of idea of how big we are? - I don't have numbers on that. I'd need to look into it. - Your internal TAO based-- - It feels big to me compared to my past experiences. But that's about as precise as I can get without research. I'm sorry. - Your internal usage of TAO and so on, do you use something now that's like a Django ORM or do you just deal with sort of lower level objects for speed reasons, you know, some main tuples dictionaries and so on? - So there is a Python client for TAO which has been developed very much sort of by the TAO engineers at Facebook but in very close collaboration with Instagram engineers. There is other Python at Facebook but we are by far the biggest user of the TAO from Python. And it's different from the Django ORM. I would say one of the key differences that maybe relates to some things Emrich talked about yesterday is that, compared to the Django ORM, it's much more explicit about when you are going to the network, going to the data store, versus when you're just constructing some objects, so there's certainly nothing like look, accessing this attribute on an object magically does another query. Some of those pitfalls you just can't afford. But I mean other than that sort of thing, yeah it's similar to the Django ORM in that you have objects representing your nodes. - [Man In Black] The gating of features, do you do that within Django? Is all at routing level? - Yeah that's done within Django. We have a system that actually uses Zookeeper to push out sort of configuration information regularly to all the web servers and then features are gated by checking configuration values that are pushed out that way. - With your actual Djangos themselves, what operating system are they running on and what's your uWSGI set up? Are you using process, threads, Gevent? What are the limits on what you're doing? Is it CPU, is it IO, memory? - Obviously running on Linux. Actually I don't personally know more than that. I mean, I'm not on the ops team. So I'd need to check on details there. We're not using Gevent or any kind of monkey patching async, like I said, or using asyncio on the back end for some fan out data store queries, but the uWSGI workers are just regular synchronous uWSGI workers. What was the other part of the question? - That was pretty much it. - Okay. Yeah. - I think, hopefully. Sorry? Oh yeah, it's what's the limits that you point yeah. - Oh what's the bottleneck? Yeah interestingly I mean, that actually varies some with different hardware generations which I think is a very rough signal that we're more or less finding the right balance, but we do have some hardware generations where we are actually memory constrained and some others where we aren't. - Obviously it's still awhile ago, so you may not know all the details, but when you're going through this migration from single details to vertical sharding or vertical sharding to horizontal sharding and so on, do you use any particular tools for the process or was it all kind of manual configuration and hope? - So for the migration from one data store to another? - From migration to like multiple. You know, migrating from one massive shard to lots of little shards and from the vertical sharding to your horizontal sharding. - Yeah well I mean obviously I wasn't around for a lot of that so the slides got pretty close to the extent of my knowledge about some of the historical stuff. I mean I was around for the latter half of the transition from the horizontally sharded ORM to TAO and I mean that was done, again, really no magic just a lot of like elbow grease. I mean like it was sort of like all of these things when you're working at a large scale, it was a slow and iterative process where it's like you take one type of data at a time and one view at a time and you often have the data stored in both places with some background jobs running to make sure things are not getting out of sync and then eventually you cut it the whole way over and you can get rid of the old data store and then you just keep iterating on that with one type of data after another until you're done. - You were saying about moving towards Python 3.0 do you have any estimates for the time scale? How far along are you? - I don't have estimates but I'm sure you will all find out when we're there. (laughing) - A couple of things looking forwards. I think, obviously, we are-- - You should know better than to ask a software developer for estimates on stage. (laughing) - I will point out the honor of my question. And so looking a bit more towards the future, obviously as the DSF we are delighted to welcome Instagram as a corporate member, and someone has asked exactly how much the contribution, what is the gold, you said gold level sponsoring, that might not mean anything to most people in this room. - So gold level I believe is a 25,000 per year and up. Yep. - Within that context are you intending to stay on 1.8 for awhile? Are you wanting a longer LTS on 1.8? Are you wanting to move upwards to the next LTS when it comes out? - That's a great question. I don't think we totally know the answer yet. I mean, the move from 1.3 to 1.8 was a lot of work and I don't think people are really eager to put in that kind of work again. I'm hopeful that the next time around will be easier. I think it's very unlikely that we'll go version to version. I think we probably will try to go LTS to LTS, but we'll have to see. I mean, Python 3.0 was a big carrot because of async and because of type annotations. We really wanted to be on Python 3.0 and so we needed to upgrade Django to get there. So that was a really important carrot for getting us onto Django 1.8 and at this point, I'm not sure what the carrot will be for the next Django upgrade. And so there may be some temptation. I mean, the work of backporting security patches can actually look tempting, relative to the work of upgrading. So, I think there's some uncertainty there at this point. - Do you feel that Channels or other objects that we're nicking out and just think Channels is gonna be useful for your asyncronous those efforts or are you thinking you're gonna be a much lower level than that? - Isn't Andrew here? Andrew. I actually discussed that very question with Andrew over breakfast this morning. I think the Channels is really solving different problems than the ones that we have. I mean, we do web sockets or real time communication but we already have a whole separate system set up for that that integrates with Facebook infrastructure and uses totally different technologies and it's pretty unlikely that we're gonna want to push all that back into Django, and that's really the primary problem that channel solves. And the problem we have is wanting to make the entire web request from client all the way to back end data stores and back all async which would allow our workers to serve many more requests concurrently. And that's not a problem that Channels in its current design even tries to solve because it keeps all of the Django view logic in a synchronous worker process. So yeah, I think Channels is very cool, but it's solving different problems than the ones that we have. - Final question, what surprised you most about everything when you joined Instagram? - I mean this is gonna sound like a little bit saccharine but honestly like just how nice my team was. I mean, like I had never in my career worked for a big Silicon Valley company. Like, I ran my own company in the Midwest for eight years and I was a little bit apprehensive about what it would mean to work at a big Silicon Valley firm. But like, my coworkers have been awesome, I mean, yeah, in every way. So very smart and very easy to work with and just a really fun and collaborative work environment. - [Man In Black] Excellent. Well, thanks very much Carl, and thank you from Instagram. - Thank you. (applauding)
Info
Channel: Django Under The Hood
Views: 30,826
Rating: undefined out of 5
Keywords: django, python
Id: lx5WQjXLlq8
Channel Id: undefined
Length: 64min 33sec (3873 seconds)
Published: Sun Nov 13 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.