- Okay. So yes, Instagram. You may recognize us from Nadia's talk. (laughing) I'm Carl Meyer. I've been a Django core team
members for about six years and an Instagram employee
for about six months. This is me on Instagram. I'm gonna talk today some about how Django was part of Instagram's growth and how we still use Django today and how we make Python and
Django work for us at scale. I know this is the fifth
hour long deep dive talk of the day so I've tried
to do a layered slide deck. For those of you who
are completely asleep, there are an adequate
number of cat photos. (laughing)
So I hope that wherever you are in your day
in your sort of mental process there will be something here
that you can appreciate. So, what's Instagram? We are officially about
an app for capturing and sharing the world's moments. This means of course cat photos, lots and lots of cat photos. Everyday, Instagrammers
upload about 95,000,000 photos and videos to Instagram. This means in less than two weeks we get enough photos that
if you printed them out and lined them up side by
side, you would reach the moon. And these are not just any cat photos. These are truly excellent cat photos. People really like these cat photos. Roughly 4.2 billion times every day, someone likes a video on Instagram. All in all, our data store currently stores 2.3 trillion likes. That's roughly one like
for every person on earth for every day of the year. I could go on with these analogies, it's kinda fun coming up
with them, lots of zeroes. The point is, we don't
just have cat photos, we have cat photos at web scale. (laughing) And every single one of those 95,000,000 photos and videos and 4.2
billion likes every day goes through this guy. But of course 4,000,000,000 daily likes is way too much to expect
just one jazz guitarist to handle all alone, so we have, I'm not supposed to give exact numbers, but 10's of thousands of
Djangos, servers running Django, although internally we
don't call them servers. We don't call them the web tier. We call them the Djangos. (laughing) Feel free to join me in visualizing this as a fleet of 10's of thousands of mustachioed jazz guitarists. (laughing) Before we get too deep into
the current state of things at Instagram I want to step back and do a quick run through
Instagram's history. Starting back in October
2010 when Instagram began, coincidentally the same month that I joined the Django core team. I'm not sure what that
says about my life choices versus some other life choices. But anyway, October 2010, when we had not 10's of thousands of Djangos
but I think just one. I talked to Mike Krieger, the
cofounder and CTO of Instagram about their early experience using Django. His first comment was that it was super easy to get started with, didn't require a lot of decisions, didn't require a lot of
set up, made testing easy. All of these things allowed Kevin and Mike to build the first version of
Instagram in about two weeks and in only a couple months get to their first million users. During those first few
months they of course hit some of their first scaling
pain points with Django, probably things that many of
you have also encountered. Like, for instance, 404s or 500 errors sending an individual email
for every single error. (laughing) If I'm not mistaken,
this is an archival photo from the early days of Django
at the Lawrence Journal World. Maybe Jacob or Adrian can explain why Simon was so excited about the emails. Anyway, apparently at one point Mike and Kevin were
literally getting an email for every browser request
that came in to Instagram.com because they had a missing Favicon, so every request generated a 404 and every 404 generated an email. Right around this time in Fall 2010 is when Russ committed
logging support to Django. Although the default behavior of Django is still to send an email for every error, at least it's now much
easier to configure. So I think we can wave
our hands a little bit and call this one fixed, thank you Russ. (applauding) While I'm on this slide
I should also mention the easy testing that Mike
referenced a couple slides back. As of the Django 1.3 time frame,
that was also Russ's work. So thanks Russ for that as well. Also, like many of us I think, Kevin and Mike had to
hack around the admin wanting to do a count star of very row in a very large table in
order to provide pagination. In 2010 they had to hack around
this with monkey patches, today in Django 1.8 we
now have a meta option to disable this so I think
we can also call this fixed. Of course managing
static files was a pain. I think it still is but around this time is when Yanis contributed .static files which improved the situation quite a lot. So thank you Yanis for that. Like many of us again,
Instagram had trouble with Django's transaction management, which used to be pretty bad. It would wrap reads in
unnecessary transactions. It would do extra round
trips to the database. It would leave postgres
connections hanging in and idling in transaction state. For those of you who use
postgres you may remember that. But of course, since Django 1.6 this one is also fixed, thanks to Emrich. By June, July 2011 Instagram
had about 5,000,000 users. At this point all of
these 5,000,000 users, all of their data, was
stored in postgres SQL and accessed via the Django ORM. They did use vertical
partitioning by model to allow storage in multiple databases. So there was a database for users, one for media, one for likes,
one for comments, et cetera. This kind of vertical
partitioning by model is actually very easy to implement using the Django ORM's
multi DB support and routers and that's exactly how it was done. This isn't the exact code but
it's the moral equivalent, updated a little bit for some
API changes in modern Django. But it's quite easy to establish mapping of model to database name. I think in Instagram's case there was also some cases where it was an entire app mapped to a database, but you can easily return
the correct database from the db_for_read db_for_write methods and then only allow relations between models on the same database. Of course, you have to do
some fake foreign keys. You can no longer have
referential integrity between models on different databases but it's not that hard to implement. So it made sense as a first step for scaling the ORM for Instagram. Shout out to Alex Gainer
for implementing Multi EV and once again, to Russ
for mentoring that project. So, the Django ORM was managing
5,000,000 users just fine but at a certain point it became clear that the vertical partitioning
wasn't gonna scale, there were just too many likes
to store in one database. I'm gonna start saying
we, it's a little weird cause I joined Instagram six months ago, so none of this involved me but it's simpler if I can just say we. So we picked a pretty typical horizontal sharding scheme
to deal with this problem. So you take all of those many likes, too many for one postgres server, and you split them up into a
whole bunch of logical shards. In our case, we split them up. We started with likes and we split them up according to the user ID
of the user whose media, whose photo or video had been liked. I have 10 here for illustration but actually several
thousand logical shards, and we mapped these to postgres schemas. If you're not familiar with
postgres schema in postgres is not like the SQL schema of the table. It's more like a name space that can contain tables within it. So each logical shard
corresponded to a postgres schema and you can of course have multiple postgres schemas
on a single physical server. So this allowed a mapping
from logical shards to physical servers that
meant that the number of physical servers could grow over time as traffic grew without
having to re-bucket the data. It just required updating the mapping from a logical shard to a physical server. So horizontal sharding like
this is a little harder to implement inside the Django ORM. Today there are a couple third party apps, reusable apps, libraries that will implement something like this using some complex model base classes
and manager base classes. In 2011 those didn't exist. In fact, some of them are based
on some of Instagram's work that was publicized in blog posts. So that wasn't available as an option. Also, Instagram wanted a solution where they could take a set of IDs that might correspond to objects contained in multiple different
shards, fan those out with asynchronous
queries, get results back, put them all together into one result set for the application, and that
seemed especially difficult to squeeze into the ORM at the time. So rather than using the Django ORM, Mike sat down and wrote his
own minimalist sharded ORM committing the first version of it with the only appropriate commit message. (laughing) Work in progress. This ORM was really a fairly light wrapper around raw SQL. It included a lot of code
that looked like this. Find the right shard, get
a connection to the shard, put together some raw SQL
and send off to the database. This new storage layer was not the end of the Django ORM at Instagram. Initially it was only used for likes. It took over two years
for all of user data to slowly be migrated from the Django ORM into the new sharded system. Even today, we still use the Django ORM for some of our internal admin tools. But you could say that this commit in 2011 was the beginning of the end for the Django ORM at Instagram. Also just kind of an
interesting side note, Instagram engineers at the time came up with a scheme
for sharded uniqe Ids that's been much replicated since. One of the issues with sharding of course is that you can no longer rely on a simple database sequence
or auto number in MySQL to generate your object IDs because obviously each individual database is gonna have its own
sequence or its own of Ids and so you'll have clashes between them. So the scheme that Instagram settled on was beginning with a time stamp, then a unique ID for the shard, and then a unique sequence
ID tacked onto that and this gives you unique IDs that are, also preserve the property
that if you order objects by their ID you get a
chronological ordering which was important to Instagram. And this also can be
implemented in postgres as stored procedure and so you can use it as a default for
the ID field in your table just like you would an ordinary sequence. The new sharded ORM worked very well but it did have its problems. Among those problems was Justin Bieber. (laughing) Justin Bieber was an
early adopter of Instagram and so the new sharding scheme
grouped everything by user, so all of a user's media and
all of the likes on their media and all of the comments on their media were all stored in the same shard. This made sense for the
data access patterns but I think you can
guess what happens next. So, the Biebs posts a photo, everyone in the world, roughly, stampedes to like and
comment on that photo, and you have a very hot
postgres database in your hands. So for awhile there, there are
still war stories told today about how for quite awhile
every new Biebergram constituted a genuine ops emergency. (laughing) Memorizing Justin Bieber's user ID was actually a condition
of employment at Instagram. You needed to be able to
do things very quickly. So, this problem was eventually solved not with any radical changes but just by a lot of remapping of logical
shards to physical servers to avoid, to make sure
that, at the very least, Justin Bieber did not share
a physical database server with any other popular account and also some re-caching techniques. Anyway, by the following April, Instagram had passed
40,000,000 users, April 2012, and that's also when we
were acquired by Facebook. So up till this point Instagram had been hosted on Amazon web services. Once we were acquired by Facebook, there began an effort to move off of Amazon web services and
onto Facebook infrastructure, which we did by early 2014,
everything had been moved into a Facebook data center. And then a month or two after that Facebook conducted one of their regular disaster recovery exercises and cut the network to that data center. So everything had to be
moved from that data center to a different Facebook data center and nobody really wanted to repeat that experience a third time. And of course we also wanted Instagram to be itself, disaster ready in case something would take out a data center. So that began the effort to move Instagram to a multi region or multi
data center architecture. One of the big challenges
of this move was caching. So, previously in a single
data center architecture, like many Django apps, Instagram's Django application managed caching manually. So a request comes in for some data, we go see if it's in memcached, if it is, great, if it's not, we have to go to the database, we get it back from the database, we put it in memcache, we
send it back to the user, a write comes in, we have
to go invalidate the cache and then we send it to
the database et cetera. Moving to a multi data center architecture we can use postgres
replication to take care of syncing up the main data
store between the centers. But we also have to figure
out how to sync up the caches. If we just leave independent caches, well of course, we do need
a cache in each region because for latency reasons
you can't go across data center to talk to memcache, but if
we just have separate memcache in each region, if Pat posts
a photo in Data center A, and then Jan posts a comment on that photo over in Data center B, over in A memcache still
has cached that there are no comments on this photo and so Jan isn't gonna see that
comment for quite some time. That's a problem. So the solution we eventually
came up with for this involved using PGQ which
is an implementation of a queue inside
postgres to actually keep a queue of cache invalidation
events inside postgres and then that would get synced as part of postgres replication and then a separate cache
invalidator process, and each data center
would read from that queue and invalidate the appropriate cache entries in each data center. The point of all this really is just to say that this stuff is hard. And while we were busy
trying to figure out how to solve these problems
for our Instagram architecture in the meantime Facebook had built TAO. TAO is, now here I have
to reach for my notes, an eventually consistent distributed graph object store and
write-through cache. It's built on top of MySQL in memcache and it had already been proven by Facebook at more than 10 times Instagram's scale. I won't get into a lot
of deep details here on the design of TAO cause it's a little far afield from the topic of Django. If you're interested in the
design of big data stores, there's a white paper and a blog post if you search for Facebook TAO
you'll certainly find those. Effectively, TAO takes
all of this complexity, the complex doesn't go
away, but it wraps it up into a service that
somebody else maintains. So it's not our problem anymore. So, effectively, with TAO, the picture from Instagram's perspective
could look like this. We just ask for the data we need and send the data we need to write to TAO and all the caching, all
of the multi region stuff, all of the sharding stuff, all of that is just handled invisibly for us. Let's talk just briefly about what the data model looks like for TAO. It's a graph store. So for a typical Instagram case, let's say Jan follows
Pat, so we have two nodes, Pat and Jan and edges between
them, Follows and Followed by. Then Pat posts a photo and
so we have another node for the photo and edges Posted by and Posted between Pat and the Photo. A comment, another node
and a couple more edges. And then Pat likes the comment and we get a couple more edges there. That's really all there
is to TAO's data model and the operations that are exposed as is typically the case with data stores that have to operate a very
large scale, are very few. You can basically count them on one hand. You can get a node by ID. You can get a range of nodes
by creation time stamp. You can ask for all of the
outgoing edges from a node and you can get the edges
between any two given nodes. And that's, effectively,
all that you can do. As it happens this data model fits Instagram's needs pretty well. So, in 2015, we began
the process of migrating Instagram from postgres to over to TAO and I think just about a month ago we decommissioned our
last postgres SQL cluster. This cat is here in honor of postgres to express my personal sadness but many people at Instagram love postgres and we're very sad to see it go. But on the other hand, Instagram's data access code
is very much simpler now. Nobody is really sad to see all of that hairy caching and charting go. So in June of this year we hit 500,000,000 monthly active people on Instagram, shortly after I joined, I'm
sure it's not a coincidence. (laughing) So this summer we finally decided that it was time to fix the fact that we were still running on
a heavily patched Django 1.3. (laughing) So, our approach to this
problem was completely wrong. If you visit the Django documentation it will tell you that if
you want to upgrade Django you should go version by version, so 1.3 to 1.4 to 1.5 and at each step you should carefully
read the release notes and check your app for where you're doing something that needs to be updated and that's a great process. I've done it before. If we had done that I'm
not sure we ever would of actually gotten around to the upgrade. So instead, this is what we did. We went straight from 1.3 to 1.8, just installed it and then ran the tests. And look some of them
failed so we fixed things and then so rinse and repeat
basically for several months. (laughing) Obviously also this is far too risky and large a change to either
land and master in one go or to deploy all in one go. That'd be far too risky. So, rather than being able to simply convert the entire code base
to Django 1.8 compatibility, we had to convert the code base to simultaneous compatibility
with both Django 1.3 and 1.8. So as you can imagine that
led to some pretty ugly version conditionals
and compatibility code and in some cases entire 30 line methods, copy, pasted, and here's
the Django 1.3 version and here's the Django 1.8 version. So it was not pretty but it worked and once we had the
entire test suite passing, ran it on some development servers to flush out anything that
wasn't caught by the test suite, pushed it to some pre-release boxes and then started rolling
it out cluster by cluster watching carefully for regressions, performance regressions or
errors or anything like that and once we were confident
that it was working and we had it rolled out everywhere I got the very satisfying task of going through in one big diff and ripping out all of the
Django 1.3 compatibility hacks. So now we are entirely on 1.8 and we couldn't go back
to 1.3 if we wanted to, which we don't. So, we only really had two
performance regressions to speak of with Django
1.8 compared to 1.3. Does anyone want to guess where in Django we might of experienced performance issue going from 1.3 to 1.8? We don't use the ORM anymore. I think if we had, if we
did still use the ORM, this whole process would
of been more challenging. - [Man] Swappable TET notes? - Swappable model? - [Man] Replacable templates. - Oh no not really. We actually already use Ginga so there wasn't a lot of difficulty there. Actually the two issues that we ran into both related to internationalization. The first one, somewhere
between 1.3 and 1.8, I'm not sure exactly in which version, but somewhere the feature
of internationalized URLs was added so you can actually have URLs that change depending
on the active language. It was a very cool feature, but as a side effect of that feature, Django now will compile every
URL regex in your project once per active language
rather than just once period. We have a lot of URLs and
also a lot of active languages and so even though this is
something that only happens at start up, our uWSGI
processes restart often enough that this was a serious issue for us. It's also something that
I think we were able to fix it pretty easily
with a monkey patch and I think Django could be smarter here, particularly in the case where you don't actually use the
internationalized URLs feature. So I plan to look at this in the Sprints. Similarly, somewhere along there, Django gained the ability
to load translations from any installed app instead
of from just one directory, which is, again, a great feature, especially for reusable apps
that want to ship translations. But for us we don't keep any translations in our installed apps and we
have over 100 installed apps and it turns out that the
way it currently works for every installed app, even if there's no locale directory
there, Django will ask, gettext to load the translations and gettext will create a
null translations object and do a bunch of work that really there's no point in doing. So again, that was a problem for us, and again I think it could be fixed. I plan to look at it in the Sprints. The third monkey patch that we still carry is actually not a new one in 1.8, and then we carried over from 1.3. So Django's lazy settings implementation means that every time
you access settings.fu it actually goes through a
Python getattr function call cause it's dynamically going down to a wrapped object to
get the actual value. Turns out that at scale
Python function calls are pretty slow compared
to attribute accesses and so this was a problem. So, this is the silliness that
we do to take care of that. At start up, we loop through every setting and we force it into the dictionary of the outer wrapper settings object so that every future access of a setting will just pull it directly
from that object's dict as a normal attribute access instead of going through the
whole lazy setting thing. This probably breaks the
override settings decorator, which we don't use, we
just use mach.patch, that's not an issue for us. So, this probably maybe
breaks other things too so I can't really recommend it, but it was very effective for us. We had some views that accessed
a setting in a hot loop where we saw 10% CPU instruction gain, just from doing this thing. I think this would be a little trickier, but I think it's also
probably fixable in Django. The whole setting's implementation is a little bit complex
for what it actually does. So the conclusion of all that
is with a few monkey patches we're now on Django 1.8 and
we experienced effectively zero performance
regression from 1.3 to 1.8. So that's I think a credit
to the Django team really. Again, if we used the ORM, that might of been a different story. I don't know. So that brings us up to the present and let's take a quick look at what the Instagram stack looks like today. If a request comes into Instagram, the first thing it'll hit is Proxygen, which is a Facebook developed open source HTP load balancer and proxy server and it'll actually go through
several layers of proxy gen first at the edge
routing then the data center then the cluster, eventually
it'll hit a Django, which as I mentioned we run under uWSGI, and then Djanog will then talk to a number of different back end services. TAO is the primary data store. We use Cassandra, used to be Redis, we migrated at one point
from Redis to Cassandra, but Cassandra is where we store counters, seen state, various other things that fit the Cassandra data model well. Everstore is a Facebook large blob store or file store where all
the actual media goes. And then we do also
put tasks into RabbitMQ for Celery for delayed execution. So we've already looked a bit at how we scaled the back end data store. So I'm gonna talk a little bit now about how we work on performance
of Python and Django today. First thing to know about performance is that the word doesn't
actually mean anything or more accurately it
doesn't mean anything useful until you can attach
it to a specific metric that you're trying to optimize. So what do we want to optimize? We would like to have
more happy Instagrammers posting cat photos and we
would like to do it with fewer servers running Django. As Adrian can tell you, jazz
guitarists don't come cheap so we need to be efficient there. This suggests some possible metrics. For instance, cats per jazz guitarist. (laughing) Unfortunately, both
cats and jazz guitarists have a tendency to wander off when you're trying to measure them so instead, we measure users per server. So the numerator here is straightforward. What we measure is Active Last Minute in our peak minute of daily traffic how many users were active. The denominator is a little trickier. Obviously we trivially know how many Django servers
we have in production, but that's not a responsive
or useful metric. I mean, if I push a change
that makes everything 10% more efficient we're still gonna have the same number of servers in production. What we really want to know is not how many servers we actually used to serve our peak minute traffic, we want to know what's the
minimum number of servers we could of theoretically
gotten away with, given our current efficiency, if the servers were fully utilized. So, we determined this
experimentally in production using the Linux PERF tool. This is some C code showing
some basic usage of PERF. PERF is a tool that allows you to access hardware counters on Linux, one of which is a CPU
instructions counter. So using this, we actually
wrap this in C types so we can access it from Python, and then we instrument
our Django uWSGI workers with this so that we can measure how many CPU instructions are used by
Django to serve a request. And then what we do is in production we take a server,
actually a set of servers, and using our load balancer, we push more and more
traffic in their direction just until they start to
fall over and then stop. And we have some metrics where we can tell when a server is under stress and about to start failing requests so we stop before it gets to that point. But you can see here a traffic chart or actually a CPU load chart showing in green a typical
server in our fleet which will have sort of a curve
of the daily peak traffic. And then in yellow, one of these servers that's under our load testing, where we keep the CPU usage pegged as high as we can go without falling over. What's interesting about this is although we're measuring CPU instructions, it's not actually, that doesn't mean that we only care about CPU instructions on the server because a server may fall
over for a different reason. It may fall over, depending on how we've got things balanced,
it may start to fall over because it runs out of uWSGI workers cause we don't have enough memory to run more uWSGI workers and
so we get too many requests that are too slow to process and there's no uWSGI workers on the server free to handle more requests. So that yellow line is not necessarily at like 100% CPU utilization. There's a number of possible
bottlenecks that we could hit but regardless of which
bottleneck we actually hit, this gives us not a hypothetical or extrapolated or theoretical, but an actual experimentally
determined measure of how many CPU instructions per second we can expect one of our
Django servers to use in actually serving requests to users. And then of course we can also determine how many CPU instructions per
second our entire fleet uses and divide that by our number of users to get CPU instructions
per second per server over CPU instructions per second per user. And this is what we actually measure. This is our top line metric
for all our efficiency work. Now of course this is what we measure but we can simplify it,
arithmetically get rid of the common CPU instructions
per second factor, do a little inverse flipping, and get back to exactly
what we wanted to measure, which is how many
Instagram users can we ... Hello? Can we serve with
a fully utilized server. You might wonder why we choose
CPU instructions per second as the sort of arbitrary
like common factor here as opposed to something
like requests per second or CPU time or other things
that are commonly measured. One reason is that saying a
server is a simplification we actually run a number
of different generations of CPU hardware in production
with different capabilities and CPU instructions
per second is a metric that we're able to normalize across all those generations of hardware. The other reason is just it's
a very fine grain metric. Requests per second is
much more coarse-grained, not all requests are created equal. So CPU instructions per second gives us more fine-grained
gradiations there. We call this metric AppWeight. As you can see here from the green box, when I took this screenshot
in towards the end of October we were at 18% improvement in AppWeight from the beginning of 2016. So this is our primary focus,
is optimizing this metric. Personally I think AppWeight
is kind of a weird name cause like it sounds like more AppWeight would like weigh down your servers and you'd want less AppWeight. Actually, higher is better for this metric because we're measuring users
we can handle per server, so we want more not less. So up is good. If I could name it I'd
call it Django power. (laughing) Cause it's like measuring
the power of one Django. I don't get to name things. I don't think I've been there long enough. So, one cool thing about this metric is that effectively it measures the definition of scalability, right? Because scalability means
a web service is scalable if as your traffic grows, your
hardware needs scale linearly or sublinearly with
your user growth, right? If your hardware needs go up superlinearly with your user growth, that's the definition
of not being scalable. And that's exactly what
AppWeight measures. If AppWeight is constant, that
means that our hardware needs and our user growth will track linearly. If we improve AppWeight,
our hardware needs will actually be sublinear
with our user growth. Yeah. We also do continuous
deployment at Instagram. If I make a change to the
Django code base and commit it, within about 10 minutes
it will be deployed to all of our 10's of
thousands of jazz guitarists and serving 4.2 billion likes per day. We do an average of 30
to 50 deploys per day. Each one usually contains somewhere between one and three commits. So with engineers pushing
this many deploys per day how do we keep AppWeight under control? So with a large team, the key is not only to have good metrics that you can measure, but also to make them visible to engineers whose primary concern is
pushing cool features. So they need to have good visibility into the efficiency metrics. Our primary performance
data set is something we call Dynostats which we gather from a normal Django middleware. It's a little more complex
but looks very much like this. We sample a fairly small
percentage of production requests. We have enough requests that we don't need to measure all of them. We can sample and get good data. If Dynostats is enabled
for a particular request, we measure a number of things
at the start of the request, the CPU instruction counter,
wall time, CPU time, the RSS memory usage of the process, a number of other things, and
then as the response goes out we collect all that, check
all those counters again to get the change, and
send that off to Scribe which is Facebook's data and
statistics pipeline project, also opensource, you can look it up. We send that off along with
a bunch of request metadata like the URL path, the view,
the HTTP response code, anything you can imagine about the request gets sent off along with that data. So that allows us then in
our data analysis system to slice and dice that data in
all kinds of different ways. Here we see CPU instructions
by view over time and we can see very clearly
where a regression happened. So if a regression like that
happens, we'll get an alert, of a regression either in CPU
instructions or in wall time. And so how do we find the source of the regression once we get an alert? So the most obvious is to
look for a code roll out that matches up in the timeline. Usually that allows us to narrow it down to one or two commits and
that's if we're lucky. Now, we push a lot of code that's hidden behind feature gates, and so often the
regression doesn't actually jump up nice and neat like that, and often it doesn't correspond
with the code roll out because it actually
corresponds with somebody turning up a feature gate
slowly to expose the code path to more and more users. So we also can show those
feature gate changes on our timeline and try to match them up with the regression. If we can't immediately
find an obvious cause using one of those
techniques, we dig deeper using a tool that probably
many of you have used, Cprofile, from the standard library. So in the past when I've used Cprofile I've typically run it in
a controlled environment on my development server
or my laptop or whatever. But for us, getting realistic data anywhere other than
production is pretty hard. So we sample Cprofile
in production as well. Again, we do it using a
middleware which is very similar. We sample an even smaller
percentage of requests because Cprofile has more of an impact. When you instrument with
Cprofile it does slow things down so we don't want to do
it on a lot of requests, but we do it on enough
to get the data we need. And we just create a
Cprofile profiler object, attach it to the request, enable it, and then on the way out we disable it, generate the statistics and
send them off to Scribe. And that allows us to then
generate tables like this among many other things. I had to cut off the left most column with the function names
here to get it on the slide in a readable way, but
each row in this table is a function in our code base and we can do time-based comparisons. So like, if we know when a
regression happened roughly we can say compare the time before that to the time after that,
and in this particular case we can see that one function
is now using 70% more CPU instructions then it did previously. So that allows us to
very quickly narrow down to exactly what the source
of the regression is and where to look to fix it. So this is a key tool and we use it daily. We also pipe Cprofile data into gprof2dot which will generate graphs like this which show the hot paths. So again, we can see quickly where we need to focus optimization efforts. And as yet another step in the
aim to make this data visible to every engineer, not
just those of us who are focused on efficiency, in Fabricator, which is our code review
tool and code browser, also opensource, some of you may use it, if you hover over any function, you get a pop up hover
card which will tell you what percentage of our global
CPU that function consumes exactly how many servers would, in theory, be needed solely in order
to power that function, how many views, which
views use this function, which other functions call this function and you can also drill down to see what does this function call
that's using the most CPU. So again, the goal to make this data very visible to every engineer. One thing, interesting
thing, about Cprofile that I didn't learn until
I went to Instagram, is that you may have noticed
if you've used Cprofile before you may have noticed that I keep talking about CPU instructions. Normally, that's not
what Cprofile measures. Cprofile, by default, measures CPU time. But you can actually pass
any function you want into Cprofile any function
that returns a number and Cprofile will measure that
or consider that the clock. So we don't use the default
Cprofile timer at all. We use two different timers. One that uses PERF as I showed before to get CPU instructions, and another one that actually
has no relationship to time, but actually measures RSS
memory at the beginning and end of each function call. So you can use Cprofile
to measure anything, even if it's not time related at all. Another interesting side
note about Cprofile, it doesn't record the whole call stack. What it records is caller callee pairs. So if you have, say,
two functions, A and B that call functions X and Y, Cprofile can tell you that
of all the calls to Y, 10% came from A and 90% from B, which is great, that's
very useful information. But if X and Y are both decorated with some common
decorator that wraps them, Cprofile ends up seeing a
picture like this instead and all we can find out now
is that 100% of the calls to Y came from
cached_property.get for example. So we lose that useful
caller callee information. It would be nice if there was a feature in Cprofile to deal with this. There isn't so we
actually run a custom fork where we've hacked up the C code to tell it to ignore
certain common wrappers and just essentially pretend that step in the call stack doesn't exist. So we essentially trick Cprofile into seeing this picture again. If the wrapper, the decorator wrapper, used a significant amount of CPU itself, this would give us misleading data because we're essentially
religion up all of the wrapper's CPU instructions
into the caller function. But in general these kinds of decorators don't tend to use a lot of CPU internally. It's not particularly misleading and it gives us better
caller callee associations. Alright, enough on Cprofile. So what do we do when we find
an efficiency regression, we've narrowed it down to the source, and we need to fix it? I had a ruder version of
this one but I took it out. Many of the cases are simply silly things that should be fixed, like an N cubed algorithm somewhere or concatenating a bunch
of strings in a tight loop or fairly obvious things
that once you see them you go oh yeah that's pretty
clear how we can fix that. The close relative of the first one is not doing useless work, and these two together account for the vast majority of our
performance regressions. Recently we fixed a case where we realized that we were fetching a set of media, photos and videos, and
fetching all of the comments on every one in order to be used in a view that didn't actually
show the comments at all. So, there are sometimes obvious cases where you just don't do useless work and you can gain a lot of efficiency. Getting into slightly more difficult areas we do a lot of caching,
non only in Memcache but also just in process and often just for the
duration of one request. If there's something that would otherwise be accessed multiple
times, and we can cache it and just do it once. Getting even more in-depth sometimes, how many of you in here have used Cython? Okay quite a few that's cool. It tends to get used a lot
in scientific community not quite as much in web community but it's actually very easy
to use and extremely handy. It'll take a Python file
and often with no changes, although you can get additional speed-ups by adding some type annotations, it will compile it to C code and then run it as native
code, often much faster. And so when we have a hot spot in our code that we can't find anymore
optimizations of the Python level we'll just change the extension to .pyx, our build change will automatically compile that using Cython, and we can see big gains from that. And of course sometimes,
handwritten C can outperform what Cython is able to
do, and so we do sometimes take an extremely hot spot
and just rewrite it in C and call it as a C extension. Stepping back to Django, as we heard yesterday, one
phrase that's sometimes used in describing Django's design aims, is this tightly integrated,
loosely coupled, it's also occasionally
been the subject of mockery by those who didn't think
we were achieving it or that it was contradictory. But I think that the success
of Instagram with Django is really a case study in the success of this very design goal. Or a similar one, a phrase
that comes, often paraphrased, it originally comes from
PEARL creator, Larry Wall's, make the easy things easy
and the hard things possible. As we heard earlier, Django
is tightly integrated out of the box although
pieces work together, you don't have to figure
a lot of things out or make a lot of decisions,
and that tight integration, that making the easy things easy, allowed Instagram to get up
off the ground very quickly. But when we had to start
replacing components, we outgrew the ORM after
5,000,000 users or so. We were able to switch
to our own homegrown ORM and later to TAO while continuing
to use the rest of Django. There's a lot of Django that still is used everyday at Instagram. Among other things, we
still use contrib.sessions. We still use contrib.auth. They're pluggable back in,
so allowed those applications to scale with us from
one user to 500,000,000, along with many other things. So long story short,
Instagram has been very happy with their choice of Django. And on that note, as a follow
up to Nadia's excellent talk I'm happy to be able to announce that Instagram is joining the
Django Software Foundation as a corporate member at the gold level to help support the Django
opensource community. (applauding) We've also now, just
in the last few weeks, we've gotten a contributor
license agreement, a corporate CLA in place which will allow Instagram and Facebook
employees to contribute to Django as part of our jobs. (applauding) So as we at Instagram build towards our next 500,000,000 users, we are counting on Python
and Django to get us there. We have no plans to switch to Rust or anything else that comes along. We're planning to stick
with Python and Django and we want to do our part to help keep the community strong. Looking a little bit
more towards our future some things we want to
look at in the next year. We intend to be on Python
3.0 sooner rather than later. In fact, this was a big motivation for getting to Django
1.8 in the first place, is that we wanted to
be on a Django version that would support Python 3.0. We'll probably do this upgrade just as badly as we did the first one. (laughing) But hopefully also as effectively. Like many people we want
to do more with Async and specifically asyncio. We actually already use asyncio, the Python 2 backport called Trollius, to do a fan out of back end data requests from the web server. But we know that we're wasting
a lot of web server capacity by using synchronous uWSGI
workers to serve all requests and so we want to explore
Async web serving. There's a lot of question marks there about how or if we can
do that with Django. But I think it's possible and stay tuned for maybe a
future talk in that area. One project I'm personally working on right now is a traffic replace. So the idea that we can
record production traffic, production requests, store them and then set up test servers and replay those requests
against the test servers. For instance, we can have a control server and then a server with some diff applied and actually collect some performance data from realistic traffic before we ever launch it into production. So again, maybe a future
talk on how we manage that. I will be very surprised if
this hasn't already shown up in the slack channel as a question. It did? - [Man] Twice. - Thank you, I knew I
could count on all of you. (laughing) We have looked at pypy in the past, several years ago before I was there. Apparently there were
issues with memory usage and also we do use a lot of C extensions not only for performance
but also simply because there are C libraries
that we need to be able to make use of and not all C extensions written for Cpython work well
with pypy's garbage collection without some modifications
or improvements. So we had some troubles
with some of that stuff the first time we tried pypy. Pypy has come a long way, so have we. It's something we'd like to look at again. Again a lot of question marks there about both for memory reasons
and for C extension reasons whether we'll be able to
get it working well for us, but it's something we want to look into. Alternatively, if that doesn't work out, there is some work going on right now by Brett Cannon and I think Dino Viland, to integrate into Cpython itself the hooks necessary for
adjust and time compiler. That's probably a long way out still but if that happens that
could be huge for us. Lastly, I've been saying we all along for all kinds of things that
I had nothing to do with, I'm just here talking about them, so I wanted to take a slide to acknowledge at least the team that I work with, the efficiency and
reliability team at Instagram and all of the awesome
work that they have done, and many other teams that I didn't have room to fit on the slide. A lot of the stuff I talked about here, there's more depth on it at
the Instagram Engineering blog, engineering.instagram.com. Feel free to check that out. We are hiring if anybody's looking for employment with Python and Django and if you want to follow
up with me on anything I'd be happy to chat here or afterwards. Thank you. (applauding) You need the network which is not there. - No it was after mine. Oh okay. So let's try that. Feed my laptop. Excellent. So, unsurprisingly we have an extensive number of questions. Quite a lot of them around
the scaling problems that you have and the other ones. I'm gonna start with the
most important question, are there more photos of
cats or more photos of dogs? - I'll have to get back to you on that. (laughing) I need to see if we're
authorized to release that data. Obviously we all know that by heart, but I mean, it's a little bit sensitive. - So, let's start with two. You mentioned a bunch of
things that you were using, a bunch of profiling
tools, internal things, monitoring tools and so on. You know, within your
Instagram Engineering bubble or something do you have somewhere where you list all of
these that are opensource or which bits are opensource? I probably had about 10 questions which were is this thing opensource, is this thing opensource,
or are you going to? - We do have some posts like that in the backlog of our engineering blog but I don't think we have anything recent or up-to-date that kind
of lists out everything. So that's probably something
that we should look at in terms of an updated blog
post and kind of the full stack and the pieces and what's opensource. So yeah I'll put that on my to-do list to talk to somebody
about getting that done. - And similarly with
sort of monitoring tools and so on as well do you use when you're dealing with your areas are you now all using like Facebook stack? - Yeah. We're using an
internal Facebook stack for monitoring and alerting. I mean I know Scribe which
is sort of the pipeline that gets the data off
the individual servers and into the data analysis
system, that's opensource. I'm not sure off the top of my head which other pieces of the data analysis and alerting stack are opensource. Some of them may be but
I need to look into it. - All the talk of these
enormous scaling problems certainly makes for some entertainment, how much of it do you
think is actually relevant to normal size projects
and which of the things should we be considering using ourselves? Is it worth us looking at doing this see your profile extension? Should we be looking at using TAO for a sort of moderate size website? - Yeah I think a lot of it is
not relevant to most websites. I mean, I think Django
does a very good job of hitting the 80% use
case and focusing on that. And very few web applications ever make it to the level of needing to
worry about these concerns. I think it's appropriate
for Django not to, certainly not to make
things more difficult for the projects just starting out, in order to make it easier
for very large scale projects. But that said, I mean,
certainly some of these things, some of the profiling stuff could potentially be useful
even to midsize projects. I don't know how much
of it belongs in core. We could maybe have some
better hooks in core for detailed performance monitoring of what's going on within Django. That's definitely been discussed before in connection with the
Django debug toolbar. - [Man In Black] And also with alt beat. - Yeah. Yeah. - [Man In Black] And these other services that provide that sort
of tooling and all-- - And they're all monkey patching, yeah. So yeah, I think there's some things that can be done in core there. I think the bulk of the work
could be external packages, it doesn't really involve changes in core. - Do you think that some of
your developers or whatever may be able to contribute
some documentation about how, you know, here's a collection of moderate size scaling pitfalls that you probably shouldn't do that are quite specific to Django? - Yeah possibly we could look into that. There'd certainly be
other people who are more familiar with that stage of the growth than I am personally but yeah. - Last question on the
performance per say, do you use any sort of staging environment or is it more in a sort of roll it out to small parts of the ecosystem and hope? - Yeah. I mean at this
scale it's pretty hard to have a useful staging environment without making it an impractical size. So no, not really. - [Man In Black] Running 100,000 gen. - Right. (chuckling) - [Man In Black] Staging Djangos is a relatively expensive operation. - So I mean, what we tend to
do more is roll things out to production gated and then turn them out very slowly and carefully. - So sort of internally, I guess Djano, Instagram's a sort of
project where predominantly I imagine a high proportion of traffic comes over the API rather
than the web directly. - [Carl] Yeah. - Does that kind of influence
the choice of technologies in choice of engineering
that you're doing much? Do you distribute things carefully and have like API servers
run their own thing and you've got separate
services to do other things or is it kind of more monolithic? - It's all the same service. I mean, I don't think we're doing anything particularly interesting in that regard compared to most other Django projects. I mean we have a lot of
views that return JOSM is what it boils down to. Yeah. - With that monolithic architecture then is it just somewhere on your laptop is a checkout of Instagram and everyone's got access to everything
and pretty much anyone can work where they feel like or what's your sort of internal structure? - Well in case anyone was thinking of making off with my laptop. (laughing) I don't actually. We tend to do our work on development VMs over the network. But yeah no I mean one
of the cool things about working at Facebook and Instagram is that it's a very open
culture inside the company and so there's a lot of
freedom for a developer to go poking around in the
things that interest them and beyond that actually work
on whatever interests them. I mean, there's a very strong culture of development being driven by engineers and managers are there
to support the engineers. So if you think that
something needs to be done or would have a big impact you're supposed to go ahead and go do it and not ask for permission
and then see if it does. - Roughly how big is this code base? Significant lines of code or something? Do we have some sort of
idea of how big we are? - I don't have numbers on that. I'd need to look into it. - Your internal TAO based-- - It feels big to me compared
to my past experiences. But that's about as precise
as I can get without research. I'm sorry. - Your internal usage of TAO and so on, do you use something now
that's like a Django ORM or do you just deal with
sort of lower level objects for speed reasons, you know, some main tuples dictionaries and so on? - So there is a Python client for TAO which has been developed very much sort of by the TAO engineers at Facebook but in very close collaboration
with Instagram engineers. There is other Python at Facebook but we are by far the biggest
user of the TAO from Python. And it's different from the Django ORM. I would say one of the key differences that maybe relates to some
things Emrich talked about yesterday is that,
compared to the Django ORM, it's much more explicit about when you are going to the network,
going to the data store, versus when you're just
constructing some objects, so there's certainly nothing like look, accessing this attribute on an object magically does another query. Some of those pitfalls
you just can't afford. But I mean other than that sort of thing, yeah it's similar to the Django ORM in that you have objects
representing your nodes. - [Man In Black] The gating of features, do you do that within Django? Is all at routing level? - Yeah that's done within Django. We have a system that
actually uses Zookeeper to push out sort of
configuration information regularly to all the web servers and then features are gated by checking configuration values that
are pushed out that way. - With your actual Djangos themselves, what operating system are they running on and what's your uWSGI set up? Are you using process, threads, Gevent? What are the limits on what you're doing? Is it CPU, is it IO, memory? - Obviously running on Linux. Actually I don't personally
know more than that. I mean, I'm not on the ops team. So I'd need to check on details there. We're not using Gevent or any
kind of monkey patching async, like I said, or using
asyncio on the back end for some fan out data store queries, but the uWSGI workers are just regular synchronous uWSGI workers. What was the other part of the question? - That was pretty much it. - Okay. Yeah. - I think, hopefully. Sorry? Oh yeah, it's what's
the limits that you point yeah. - Oh what's the bottleneck? Yeah interestingly I mean,
that actually varies some with different hardware generations which I think is a very rough signal that we're more or less
finding the right balance, but we do have some hardware generations where we are actually memory constrained and some others where we aren't. - Obviously it's still awhile ago, so you may not know all the details, but when you're going
through this migration from single details to vertical sharding or vertical sharding to
horizontal sharding and so on, do you use any particular
tools for the process or was it all kind of manual
configuration and hope? - So for the migration from
one data store to another? - From migration to like multiple. You know, migrating from one massive shard to lots of little shards and
from the vertical sharding to your horizontal sharding. - Yeah well I mean
obviously I wasn't around for a lot of that so the
slides got pretty close to the extent of my knowledge about some of the historical stuff. I mean I was around for the latter half of the transition from the
horizontally sharded ORM to TAO and I mean that was done, again, really no magic just a
lot of like elbow grease. I mean like it was sort of
like all of these things when you're working at a large scale, it was a slow and iterative process where it's like you take
one type of data at a time and one view at a time and
you often have the data stored in both places
with some background jobs running to make sure things
are not getting out of sync and then eventually you
cut it the whole way over and you can get rid of the old data store and then you just keep iterating on that with one type of data after
another until you're done. - You were saying about
moving towards Python 3.0 do you have any estimates
for the time scale? How far along are you? - I don't have estimates
but I'm sure you will all find out when we're there. (laughing) - A couple of things looking forwards. I think, obviously, we are-- - You should know better than to ask a software developer
for estimates on stage. (laughing) - I will point out the
honor of my question. And so looking a bit
more towards the future, obviously as the DSF we
are delighted to welcome Instagram as a corporate member, and someone has asked exactly
how much the contribution, what is the gold, you said
gold level sponsoring, that might not mean anything
to most people in this room. - So gold level I believe
is a 25,000 per year and up. Yep. - Within that context are you intending to stay on 1.8 for awhile? Are you wanting a longer LTS on 1.8? Are you wanting to move upwards to the next LTS when it comes out? - That's a great question. I don't think we totally
know the answer yet. I mean, the move from 1.3
to 1.8 was a lot of work and I don't think people are really eager to put in that kind of work again. I'm hopeful that the next
time around will be easier. I think it's very unlikely that we'll go version to version. I think we probably will
try to go LTS to LTS, but we'll have to see. I mean, Python 3.0 was a big carrot because of async and
because of type annotations. We really wanted to be on Python 3.0 and so we needed to upgrade
Django to get there. So that was a really important carrot for getting us onto Django
1.8 and at this point, I'm not sure what the carrot will be for the next Django upgrade. And so there may be some temptation. I mean, the work of
backporting security patches can actually look tempting, relative to the work of upgrading. So, I think there's some
uncertainty there at this point. - Do you feel that
Channels or other objects that we're nicking out
and just think Channels is gonna be useful for your
asyncronous those efforts or are you thinking you're gonna be a much lower level than that? - Isn't Andrew here? Andrew. I actually discussed that very question with Andrew
over breakfast this morning. I think the Channels is really solving different problems
than the ones that we have. I mean, we do web sockets
or real time communication but we already have a
whole separate system set up for that that integrates
with Facebook infrastructure and uses totally different technologies and it's pretty unlikely that we're gonna want to push all that back into Django, and that's really the primary
problem that channel solves. And the problem we have is wanting to make the entire web request from client all the way to back end
data stores and back all async which would allow our workers to serve many more requests concurrently. And that's not a problem that Channels in its current design even tries to solve because it keeps all of
the Django view logic in a synchronous worker process. So yeah, I think Channels is very cool, but it's solving different problems than the ones that we have. - Final question, what surprised you most about everything when
you joined Instagram? - I mean this is gonna sound like a little bit saccharine but honestly like just how nice my team was. I mean, like I had never in my career worked for a big Silicon Valley company. Like, I ran my own company in
the Midwest for eight years and I was a little bit apprehensive about what it would mean to work at a big Silicon Valley firm. But like, my coworkers have been awesome, I mean, yeah, in every way. So very smart and very easy to work with and just a really fun and
collaborative work environment. - [Man In Black] Excellent.
Well, thanks very much Carl, and thank you from Instagram. - Thank you. (applauding)
Also Pinterest And pretty sure Yelp
Guys on the Facebook side are probably jealous.
3 billion devices run Django
They moved to in-house stuff a long time ago.