An Introduction to Computational Social Science

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello in this video I'll be providing an introduction to computational social science so one of the questions I'm asking frequently is what is computational social science and since I'm asked this question so much I thought about it and I came up with the following definition computational social science is anything that's cool now of course this is kind of a joke but it's also kind of serious that is I really think as a field as computational social scientists we should try to resist the urge to define what the field is at this moment and I think that for two reasons first the field is changing really quickly so I've been in conversations about computational social science for a long time and let's say ten years ago I was in conversations about how I should try to define the field and if we had try to come up with a definition ten years ago that definition would almost certainly not be the right definition for today alright so in a field that's changing really quickly I think we should resist the urge to try to build boundaries around the field the second reason why I think we should try to resist a formal definition at least for now is that we can think of computational social science is also as a kind of social movement it's people who are trying to change the way certain kinds of research takes place and so for social movements it's very helpful to have a big open welcoming tent that everyone can participate in so for example the organic food movement in the United States and the 70s that was really initially made up of people who had very different goals so some of those people wanted food that tasted better some of those people wanted to support local family farms some of those people were really worried about whether the food didn't have pesticides so all of those people work together to create the organic food movement and it was only after that movement had grown substantially that they sat down and figured out a set of rules that were necessary in order for food to have an organic sticker on it so by being open to the contributions of many people they helped the field grow and so for those two reasons I think it's best not to have a form definition of computational social science however we still come back to the question what is computational social science and anything that's cool is maybe not the most helpful answer so what I'd like to do instead is give you an example of a study that I think illustrates many of the important themes that that come up over and over and over again in computational social science so this is study by Josh blumenstock and colleagues that was published in science in 2015 and let me tell you a little bit about the problem they were trying to solve so global efforts to end poverty run into many many many problems one of those problems is that it's actually hard to know how much poverty there is where it is and how it's changing over time so for many people in wealthy countries were used to having national statistical systems that collect these kind of data in a regular and ongoing way and at high quality however in many of the world's poorest countries exactly where this information is most needed it's often not available and so what blumenstock and colleagues were interested in was seeing if they could come up with better ways of measuring the amount of poverty and wealth in developing countries by using mobile phone metadata and so let me tell you a little bit more about how their study works they started with a bunch of call records these are records from the calls of 1.5 million customers of Rwanda is largest mobile phone provider so they have records of outgoing calls incoming calls they have the location where the call took place the cell tower they also so so they don't have the actual what was said in the call but they have the call metadata so this call the call records they provide a lot of information about these 1.5 million customers but they don't provide exactly a measure of poverty and wealth and so what the researchers did is they took a random sample of these people and actually called them and gave them a survey a traditional social science survey that's designed to measure poverty in in countries and then what they did is they linked up these two very different kinds of data to use them together so first they took the call records and they went through a process called feature engineering where they converted these records where there is one row for each call into something where there is one row for each person and one column for each feature so features these columns could also be called variables but I'm going to use the word feature so they have a feature it could be something like number of outgoing calls or number of incoming calls but it could also be more complicated things like of the people that you call what percentage of them call each other things like this could all be features so then once they have a big matrix where one row is each person one column is each feature they can build a machine learning model to link these two things together so they train the model to given the features they have about your calling behavior to try to predict what you would say on this survey about your poverty and wealth then once they've built that model they can use it to impute the survey responses of all for all the other customers so by giving a survey to about a thousand people and combining it with the call records were able to then impute or guess the survey answers for the other one point four nine nine million people then they also were able to infer everyone's residential location and so the way they did this is roughly by looking at where they made their phone calls that night so what they did is a little bit more complicated but that's the main idea and so you can put this now together with these estimated levels of wealth and estimated residential location to produce high-resolution maps of poverty so this map here the red square with the tiles that's a one kilometer by one kilometer square and you can see within that they're able to produce estimates of extremely small geographic areas in this graph darker areas are wealthier and so you may be thinking wow how accurate are those estimates and the answer is we don't know so in fact no one has ever estimated poverty at that small of a geographic scale before in Rwanda so it's actually very hard to know if these estimates are good or not and this is a problem we see actually frequently in computational social science where we Resta mating something that's actually never been estimated before and so it's hard to know how well this new technique is working but fortunately blumenstock and his colleagues were working at a setting where they did have something else that they could compare to they were able to compare to the demographic and health survey which is a large scale probability based sample with in-person interviews that's funded by the US government these are done all over the developing world and these are considered gold standard data about health and demographic information in developing countries so to be clear these are not perfect but this is a very good estimate the best one of the best estimates we can and we can feasibly make so let's see how this new technique compared to this existing gold standard technique so blumenstock and colleagues aggregated their estimates to produce one estimate for each of the 30 regions in rwanda and these are the estimates that they produced now here are the estimates that come from the demographic and health survey now as you can see these two maps look quite similar so for our purposes I'm going to say they're basically the same now in the paper of course they do it very detailed comparison but for our purposes we can say these two things are basically the same so now you may be thinking wow it looks like blumenstock and colleagues figured out a way to make estimates for something that we already knew how to do like we can already do the demographic and health survey why do we need this new approach so what I haven't told you yet is the approach by blumenstock and colleagues is ten times faster and 50 times cheaper now 50 times cheaper it's it's not ten percent cheaper it's not twenty percent cheaper it's 50 times cheaper this is qualitatively different so for example right now we do demographic and health surveys roughly one every five years if we could cut the cost by a factor 50 we could do these surveys every month and so we wouldn't have to wait five years to collect this important information about these countries we could collect it every month much like its collected every month in many wealthy countries and so this is another thing that comes up a lot in computational social science where often the advantages of new techniques are that they are faster and cheaper and so sometimes it's very important to then take but often researchers are not really excited by faster and cheaper so taking these improvements in speed and cost and then mapping them back to a scale that researchers are more excited about something like going from demographic and survey health surveys every five years to every month and so also the researcher in me wants to be clear that this is not totally a fair apples to apples comparison that is I said the approach by blumenstock at all is 10 times faster and 50 times cheaper however the approach from the demographic and health survey has formal theoretical guarantees that come from the nature of the probability sampling and their substantial know how that's gone into deploying these surveys all over the world so there's a lot of practical know-how that we've learned over the last 50 years for how to do these effectively and the approach by blumenstock and colleagues doesn't have any of that it doesn't have this 50 years of know-how and it doesn't have formal theoretical guarantees yet and yet is the key word so learning from these big data sources is one of the fastest growing areas that I see in computational social science and I think eventually we will start to solve many of these problems so let's take a step back from the blumenstock a tile study and think a little bit more about what is computational social science so the blumenstock a tile study combines things that are computational so for example taking the call records and converting that into that feature matrix where you have one row for each person one column for each feature building that machine learning model there are a number of complex computational steps involved there it also involves social science in terms of figuring out what is an important research question to ask how do we collect the data on poverty how do we then make estimates that are meaningful to policymakers so it's a blend of computational and social science this study also involves ethical and privacy questions that are now considered complex so in this particular case they had access to the metadata record metadata of the call records for 1.5 million customers of this mobile phone company and so to be clear they did not have the calls but they had access to the metadata and in other research we've learned that the metadata from call records is actually quite identifying so even though they didn't have the names and addresses of these people it probably would have been possible to try to figure out who these people were and learned things about them now to be clear blumenstock and his colleagues didn't do it but they had to be very very careful because of the privacy risk that the data had finally this study combines two different types of data in a way that comes up a lot in computational social science so it combines data that are ready-made and custom-made and so let me illustrate these two styles of data with an example from art history so this is your urinal but it's not just any urinal this is a very special urinal this is fountain Baidu shop this is one of my favorite pieces of art because it's it's so creative so what do shop did is he saw this urinal and he said this is not a urinal this is art this is a process of repurposing repurposing taking something that was created for one purpose and turning it into something different and I think often some of the best data science research in the field of computational social science involves this creative repurposing so for example blumenstock and colleagues they call records those were not created for the purposes of research those were created for helping to run a phone company and blumenstock and colleagues said no I can take those records and repurpose them for something else and so this repurposing is very is practice very well in the most creative and beautiful data science research in computational social science and so if we had to come up with another style of art to illustrate the type of data that's more common in social science it would be perhaps David by Michelangelo so when Michelangelo wanted to create David he didn't look around for something that kind of sort of looked like David he said I'm gonna labor in this marble for three years to create David and so that is an example of a custom-made piece of art and so we have this example of ready-made which is more illustrative of the style of data scientist and we have this example of custom aids which is more illustrative of the style of social scientists and I think what we will see increasingly in computational social science is the idea of combining these two Styles just as blumenstock and his colleagues did the call data the call records alone were not enough and the survey data alone were not enough it was only by combining them that they can make these high resolution estimates of poverty and I think increasingly we will see that each of these pure approaches will run into limits the pure ready-made approach I think people who are accustomed to using ready mints will realize that there really aren't that many fountains out there there's a lot of urinals in other words there's their real limits to what we can learn by repurposing data that was never created for research likewise for researchers who are more accustomed to working with custom-made data it will become increasingly difficult to ignore all the ready-made data that exists in the world if our goal is to learn about the social social world as quickly and accurately as possible then it becomes harder and harder to ignore all the ready-made data that exists and now try to imagine all the ready-made data that will exist five years from now and ten years from now so even if that data doesn't have the precise measurement properties that we want even if it's not collected in the way that we're used to we can't simply ignore it there's if if we want to learn as much as we can we're gonna have to figure out how to take advantage of these ready-made data sources so I think increasingly just like in this study we saw from blumenstock will see combinations of ready-made and custom-made data so what is computational social science I think it often has these three characteristics it involves a computational element and a social science element it often involves ethical and privacy questions that are now considered complex and it often combines ready-made and custom-made data so now that I've said what computational social science is I will I'll tell you another question that I get asked frequently which is isn't computational social science just a fad and and here I have a much simpler answer and the answer is no so let me tell you why I don't think computational social science is a fad it's not a fad because it's being driven by a fundamental change in the world and what you can see here this graph shows the amount of information stored in the world and there's two patterns that you should take away from this graph first the amount of information in the world is expanding very quickly and second more and more of this information is now stored digitally and so it is this transition from the analog age to the digital age that is the fundamental driver of a lot of the opportunities in computational social science now this graph shows that the trend from 1986 to 2007 but information storage follows exponential growth similar to Moore's Law where things double every every two years and so there are similar Moore's law type behavior in our ability to store and transmit data and so this graph here goes up to 2007 but from 2007 to 2009 the amount of data has doubled from 2009 to 2011 the amount of data is doubled 11 to 13 is another doubling 13 to 15 is another dulling 15 to 17 is not another like 17th and 19 is another doubling and this kind of doubling will proceed for the foreseeable future and so as social scientists we have the opportunity to take advantage of all of this or we can be left behind now when people think about all this information they often tend to gravitate towards thinking about online sources they think about social media data search data things like that but increasingly we should think about this as digital data is created everywhere so one of the trends in computing now is the so-called Internet of Things so maybe you have an Alexa device in your house that's an example of a sensor built in the physical world and increasingly we're seeing more and more of these sensors built in the physical world so just as now all your behavior at an online store is instrumented and they have the ability to run experiments on you it will increasingly be the case that in the physical world our behavior will all automatically be recorded and the physical world will be instrumented to enable more experimentation and so when you think about all of this data that's created in the digital age don't just think about online think about everywhere so I don't think computational social science is a fad but I do think it has fad like elements and so the way I like to think about this process of fads within the field is this graph here which is called the hype cycle so this was created by Gartner consulting and so the x-axis here is time the y-axis is visibility initially there's a technology trigger and then we move to this peak of inflated expectations so this is where people say oh big data is going to change the way we learn about the world and it's gonna end poverty and cure cancer and and unrequited love and then people start to realize actually well there's a lot of problems with these new data sources some of these capabilities don't work as well as we thought and then they move into this trough of despair but then over time we figure out how to solve some of these problems and then we move into the plateau of productivity and so one way that I think our goal as a field should be is to push down this peak of inflated expectations pull up the trough of despair and get to the plateau of productivity as quickly as possible so what will that plateau of product look like so as a sociologist I can think of two things in my own field so first is computational social science and digital techniques they do not displace other techniques they complement them so think about a doctor and an x-ray machine so just because the doctor has an x-ray machine doesn't mean that she wouldn't talk to her patients that would be a crazy doctor right so it's a something that complements the other tools that we have it doesn't replace them and second is computational social science doesn't change our goals so our goals as researchers are the same as they were before to understand the social world computational social science just helps us achieve those goals more quickly again a doctor with an x-ray machine the x-ray machine does not change the doctors goals her goals are still to help us get better this is just another tool that she has to help her achieve those goals so that's a little bit about what I think the plateau of productivity will look like now how do we create a computational social science community imagine you're about to take part in a community that brings together people to help learn about computational social science something like the summer institutes in computational social science so I think the key idea is data scientists and social scientists working together with each group realizing that they have something to learn and with each group realizing they have something to contribute so now you may be wondering well what is a data scientist just as I talked before about what is a what is computational social science you may wonder what is it data scientist when is data science and again I think the answer is anything that's cool so again I have a more serious answer what is data science I think this paper is excellent for answering that question 50 years of data science by David Donoho I strongly recommend this paper if you haven't read it so he traces data scientist not to be defines it not to be people with laptops and hoodies but he traces it to go back to people like John Tukey who were focused on learning from data and so that includes two six it includes certain parts of computer science includes certain parts of other engineering fields it also includes certain parts of social science and includes certain parts of things that aren't currently happening inside of universities so data science I think is really about learning from data and include statistics and much much more so the idea of social scientists and data scientists learning from each other and also teaching each other is a great way that as computational social science the community can grow but this kind of exchange will not be always so easy but we have to remember first that data science alone is not enough if we want to study social behavior if the questions are fundamentally social in nature then ideas from social science will be key in defining what the important and interesting questions are and helping answer those questions likewise social science alone is not enough if we want to be able to take advantage of these new digital age data sources so it's really only going to be through a coming together of these communities that computational social science will thrive however the coming together of these communities also creates some challenges and so Hannah Wallach who is a computational social scientist reports hearing this kind of conversation I don't get it why is that research and so what you might find as you start engaging with people who come from different disciplinary backgrounds that there is this mismatch and what people expect and what people think is good or important interesting research so Hanna Wallach divides people into these two broad communities computer science community and social science community and roughly summarizes the kinds of characteristics of research in these fields so computer scientists can study almost anything the work is largely methods driven it often uses large found datasets and the focus is often on prediction social science on the other hand often studies social things it's often questioned driven where those questions come out of a long research tradition it often uses small design data and it focuses on explanation and so if you find yourself in a can unity that mixes together social scientists and data scientists or social site even social scientists from within different fields you're gonna have this experience of misunderstanding or not quite understanding what other people are doing why they're doing what they're doing and so again the key thing to keep in mind here is that each community has something to give to the other and each community has something to learn from the other and so if you find yourself having more of a background in computer science and you're talking to a social scientist try to understand more how they see the world and why they're doing the research that they're doing likewise if you're a social scientist and you see a computer scientist that's doing something you don't quite understand again try to see the world through their perspective and try to understand how their discipline approaches research it's through a combination of these styles that we'll be able to do the most interesting research going forward another difference that you'll see if you start engaging more in a computational social science community like the summer institutes and computational social science is that there are a little bit different emphasis between the fields so let's look at this picture here is this glass half-empty or is this glass half-full and if I had given this as a test question to a bunch of social scientists and data scientists I think we would see a difference in their responses based on their training so I've noticed that when evaluating and thinking about computational social science research data scientists are often more glass half-full they look at the things about the study that's exciting and they focus on those social scientists on the other hand are generally more glass half-empty they they look at a piece of research and they find the parts of it that they don't like and they emphasize those without paying as much attention to the things that are cool about it and so I think increasingly a healthy computational social science will focus on the fact that this loss is really both half full and half empty and so if you find yourself being more of a glass half-full person and you engage more with a glass half empty person try to look at the world from their perspective likewise if you're more of a glass half-empty person and you try to take the glass half full approach for a while and only by being able to switch back and forth between these views will we be able to make as much progress as we can in computational social science so I think the future of creating a computational social science community will involve blending insights from social scientists and data scientists that's not always easy when people from different backgrounds interact it can be difficult but that interaction is really where a lot of learning happens and that interaction can offer the the friction from that interaction can often be a spark for really exciting and interesting research thank you [Music]

Info

Channel: Summer Institute in Computational Social Science

Views: 6,029

Rating: 5 out of 5

Keywords:

Id: zGG9wPl1C5E

Channel Id: undefined

Length: 27min 18sec (1638 seconds)

Published: Thu May 28 2020