Introduction to Python Programming | Data Science Summer School 2023

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] [Music] um [Music] hello everyone welcome to another sessions at the data science summer school um my name is idang and I am the managers of the data science lab at the herdy school here um we're very happy to welcome everyone today to another session um the summer school and um their data science summer school is our annual effort to contribute to the open source communities and by opening up scientific learning for the general public and we are very happy to see many of the audience coming from many different countries Industries and disciplines joining us here today for the purpose of learning and expanding your technical skill sets and our summer school it's made possible by the generous funding of the herdy schools and the digital foundations in Germany and I'd like to thank them for their contributing and continuing support for this effort I know session today is an introduction to Python Programming which will be held by Dr Musashi Jacobs um harukawa he's currently a postdoctoral researcher at the data driven social science initiatives in Princeton universities where he works on applied machine learning and computational social science methodology and prior to joining Princeton Musashi was also a pre-doctoral researcher at University College London where he worked on mental models of political economy project and he received his um doctor of philosophy and politics from the University of Oxford where he wrote his thesis on Noble applications of machine learning to the study of political campaigns and also supporting further sessions um as the teaching assistant today is Johannes hawkenheiser and he he is also recent graduates in our master's data science for Public Policy program and I very much thank him for his support for the session today Musashi Johannes the floor is now yours I hope everyone will have fun and enjoy the session today thank you very much all right thank you um and just quickly checking everyone can hear me okay I'm going to assume that's fine good thank you all right great so um let's see oh one second there's always the initial Tech issues yeah right sorry about that all right thank you Hui um thank you everyone for coming today I'm sorry I didn't quite catch that that's just some Echo yeah cool all right good um yeah so hi my name is Musashi Jacob sarkova um yeah so as we introduced on the postdoc at Princeton University um so today I have the pleasure of being able to give the um introduction to python course as part of this hearty data science summer school initiative um I've been I taught the introduction to python course at the University of Oxford's political science department for a few years when I was a graduate student there um today I'm trying a few new things a few sort of departures from that course um I hope it's sort of enjoyable for everyone um before we get started I just want to have a quick Tech check um where if everyone could go to this link please I will put in the chart and I just want to quickly check that everyone knows how to use collab um because we're going to be using it today um so when you go to that link you should get something that looks like this um and all you need to do here is just click on the little button here it's going to give you a warning that was authored by someone else that's me I promise I won't do anything bad so you should say run it anyway [Music] um and with any luck you should get some result I'll put the link in the chat oh there we go thank you okay um I'm going to assume that everyone is getting on all right with um Club um then a few other things just the ground rules about this session um so I've written this course uh or I've written this session sort of keeping in mind that people can ask questions through sort of time um in the way of written it for that um Johannes um is kindly in the chat he will be helping answer your questions so you can definitely direct your questions to him um or if they seem like something that's sort of like you know something's not clear in my explanation I'm happy to sort of spend a little bit more time on something if it seems like we're running overtime then I'll sort of just say we need to move on um but in general you know I want people to be able to take advantage of the fact that they're sort of like a interactive element to this it's not just watching a YouTube video having people watching the uh Vlog you know enjoy that but I sort of want people give a chance to sort of interact with me um I'll also provide my contact details um if they're not up somewhere I'll provide in some other way so you know please do feel free to get in touch with me afterwards um I'm always Keen to sort of talk about these subjects okay um so today um the sort of format that I'm trying out today is I have a lecture um with slides there's I'm going to be showing you code and explanation um I'm sort of um sort of thinking both of the people watching the recording of this but also people here um so there's also a code notebook that I'm going to share in a moment um which has basically the exact same content as the slides so feel free to sort of execute the code as you go along um if you're watching the recording feel free to pause um play around that's sort of the point of being able to do this in interactive notebook um and if you're here right now you know ask me questions about what's happening like you know um why are things doing the things that they are I've also provided a few additional sort of takeaway like take-home exercises there's time in the session now we can work on this together um if not you know um I'll put them up also put up answers for these so you know um not saying you should you know just work from the answers but for your own sort of learning there's like heavily commented answers that you can work from I don't think I need to introduce myself too much um uh who you already did a great job and the only thing I sort of haven't put on my bio um that would mention is um most of my research is natural language processing so the last year has been kind of a bit weird and crazy for me um but yeah sort of flipping up white ML and sort of science methodology um sort of in the Texas data and past year I've been working a lot on sort of generative language models and um sort of various uh robustness and causal influence designs applied to those oh and one last thing is Once Upon a Time I was a data scientist in a finance firm but that's sort of where I originally were in Python Okay cool so topic here um sorry the sort of learning objectives today first thing is sort of um how is python used and like primarily social science research because that's why I'm um hopefully that's sort of interesting domain for people um from looking at sort of the people that signed up I see those really really wide variety of them um so you know it's sort of hard to cater to everyone but I think I'll try you know give a theoretical Foundation that's interesting useful for everyone um then we're gonna have a sort of the first part is going to be basic um sort of Base Python programming um I'm going to okay um I don't know who's annotating on the screen but if you could stop that would be great sorry uh can uh participants um not draw on the screen and move something you can delete it by going to the annotate button um and then just clear drawings cool thank you thank you correct cool all right and then the last thing is we're going to cover basic data analysis in Python and that's primarily going to be using the pandas Library okay so um without further Ado I'll get into it so first thing is talking about python for research um so the first part sort of you know for people who are complete beginners to this um you know haven't used python before haven't even programmed before um what is python how is it relevant to research one way to describe python is it's an open source general purpose scripting language um breaking that down bit by bit and by each part of that sort of description is important um firstly the open source aspect so it's a community project um it's built and maintained by a community of people who donate their time um there's also various chart of organizations involved um but um fundamentally it's free to use for all um you know that aligns a lot with sort of a lot of academic ideals but also you know um in a sense you can always use Python there's no sort of barrier to using it in terms of general purpose um I like to think of it as if there is something you're doing on a computer and there's any repetitive element to it um I'm sure you can think of loads of things like that right then like if there's any sort of repetitive development then you can automate it in Python it's extremely General um it's not just used for data science even though that's what we're going to be talking about today but the fact that it's a sort of general and a lot of people think that it's sort of easy to read and write um means it's very popular data scientist um regarding the slides so there's no slides for this initial bit sorry um the slides start once we start coding um but scripting so uh not really a strict definition for what script is um in coding terms again for people who sort of are unfamiliar to programming um you can think of a script as sort of a series of commands to automate some tasks you know a script that like an actor reads from you sort of do things in order it's a series of instructions another way to think of it is sort of how you're going to use it is it's like a pipeline right when we write data analysis Scripts the script is going to take in some inputs do some things to those inputs and give back some outputs and finally language um you know so I think prior to sort of um learning to code you know most money interaction with the computer was by applications right so things like Microsoft Word Etc um Python's a language non-application sort of practically what that means for you is whereas in sort of applications as we usually interact with them on the computer they'll give you sort of options to select them they'll be drop down menus they'll sort of be you'll choose one from a series of options whereas with coding languages um there's some sort of set of rules that um are interpretable by the language interpreter but from that you can sort of build anything that you want so the upshot is that you can do nearly anything with python like the downside though is that you can't sort of see all of your options so you have to be a lot more creative here to sort of get what you want to do now speaking to sort of um researchers or sort of like what's the research use case for python um again like sort of any repetitive task that you're doing can be automated with python so ways I use it in my own research um I use it for data collection by web scraping um I use extensively for cleaning my data and analyzing my data visualizing it um I also do considerable on machine learning um and most recently sort of well most recently the past few years I've been using Python and law for sort of deep learning tasks um to illustrate sort of one thing that brought together a lot of these things my doctor one of my papers was um investigating the effects of micro targeted advertising in a 2020 us election and I ran an experiment where I targeted people with antibiotin ads run by the Trump campaign and to do this I had to build a special website where I had a machine learning algorithm built into the back end that would optimize the allocation as people arrived on the website most of that was actually done in Python on the back end so everything from building a web-facing application to having a machine learning sort of back into it that was all sort of doable with python I know some people in who signed up were sort of Engineers so I just wanted to briefly say as an aside um how is sort of the research use case different from the standard sort of development use or standard development use case um just some observations here is research the usage is more kind of focused on scripting and interactive usage and the sort of main concerns researchers tend to have when they're using a programming language is whether it's easy to use because a lot of them aren't sort of dedicated Engineers they use these coding tools because they have some sort of research need for them um and so the focus mostly on sort of the time efficiency aspect of it how much time is it going to take me to learn these new skills um that's more important than how quickly will this code run or as in development you know when you're more on sort of the setting of developing full-on applications a lot of the time um there's a very sort of different set of concerns you're worried more about sort of portability is your application going to work in different settings um deployment and also resource efficiency so you know how efficient are you being but sort of the Computing resources the memory Etc so and yeah finally because I know this comes up um sort of like python versus Alternatives in the sort of data science ecosystem um full disclosure I use both languages Python and R so some observations I have um about the differences you know this is just a lukewarm take from me um there are differences in functionality absolutely um the r ecosystem for sort of statistical analysis is much better than python if you see how to try and run a regression in Python it's quite painful um whereas sort of python sort of has the general purpose programming aspect but also I think um especially in the past few years in term um sort of as NLP and deep learning have really taken off sort of The Cutting Edge things exist more quickly in Python than in our because a lot of these Cutting Edge tools in NLP are being developed in the Python program which foreign contributions to um there's sort of a lower bar for what you know is a sort of Library a package something that you can really sell the people you use um in Python there's some sort of not through any formal mechanism because you can sort of put anything up but just sort of through Community standards people are a little bit sort of more reluctant to use these libraries that only contain a few convenience functions um and related to that the sort of the standards of round coding are quite different in the two communities there are in Python Community um in My Views sort of the people in Python Community have a little bit more sort of an engineering mindset about things they're more so they care more about sort of you know some people might think are good coding standards of course you know there are people like how to come in the UR ecosystem and there's a lot of variation but I think the fact that our sort of caters a little bit more to like people who programming isn't the main thing that they do sort of also has a result in the sort of tools that appear in art um last point I will make about this is just I think you know if you're involved in academic research um or in general I don't really think there's any reason to use a closed Source data analysis language in this day and age um R and python but just much better alternatives to things like state or SPSS [Music] um some suggestions for things to use if you're storing research in Python so as an editor um vs code has become quite a good one we're going to be using sort of a variation of Jupiter or collab today that Google provides for free um for package management if you're coming from R you just install packages python is pit but if you're doing sort of research condom Lambda are sort of a bit more mature okay then finally sort of before we start coding um a brief sort of more theoretical aside um where you know I'm going to try and put a lot of because most of the next four hours I'm just going to be dumping a lot of sort of basic intro level python at you so I want to sort of cushion this and contextualize this by like why am I choosing what to show you how is learning python for this sort of application a little bit different and so on um so I want to start with just sort of a brief theoretical methodological aside about what is it that we're trying to do as social scientists or as scientists in general um and how does that sort of relate to how we use programming languages and especially python so the first question is sort of um why automate um you know and that sort of may be a trivial question in some ways right there's there's advantages of cost scale and scope there's sort of questions that you can't ask as a social scientist without sort of if you can't automate them there's no way to answer those questions um you know like a really good kind of extreme example could just be you know regression analysis doing all this by hand I don't think you could do an MRP by hand um likewise sort of trawling through millions of texts of um tweets if people still do that um you know in theory it's doable but humans are a lot of humans but sort of the end result was quite different to sort of automating aspects of them however the sort of cost of automating things is that we need to represent sort of the the observations we have about the real world in a way that our computers that algorithms and programs can utilize um and fundamentally this process of sort of quantifying and structuring observations even without sort of taking a sort of programmatic or quantitative approach but just sort of like trying to sort of structure observations and say what things have in common and Illuminating sort of other things usually until some loss of information so how we choose to represent the information and what information we choose to lose is quite important so the first thing is sort of we want to choose some way to represent our observations that retains the properties that are relevant to your analysis um there's a good analysis there's good debate about that but what I'm reading with all this is sort of two data types um when we have observations how do we sort of represent them how do we categorize them so here are some statistical data types one is logical that's sort of true or false types numerical categorical text date and time um data structures which we're also going to study in a moment are concerned about the relations between observations um so are the data points members of the same set are the members of an ordered sequence are they different aspects of the same sort of thing that we're looking at so the good news is python like most modern programming languages um Can represent most of those types that were on the slide before ecological numerical categorical the bad news though is at a fundamental level everything on a computer is sort of stored in zeros and points um and we're going to see that there's going to be some things that happen when we try and represent things that can't be represented zeros and ones as zeros and ones well I'm reading with all this is that we need to sort of think of a long Arc and understand the relationship between sort of the empirical observation that we have as sort of a scientist what is the thing that we observe the sort of representation of it in our mental or theoretical or mathematical model and then furthermore there's another layer of abstraction and approximation when we put this onto a computer and so just notice to say that like these things are bad but the to the extent that we pay attention to sort of the information we lose and the choices we make each of these steps the closer we bring our computational model to the thing that we're actually trying to study okay so there's going to be two coding tutorials today the first one is going to be on base python um but I can take a moment here um if people want to just sort of ask any questions about the sort of theoretical aspect if people want to jump straight into coding um I'll put this link in the chat right now okay uh best platform to learn about python I don't have a good answer for that actually um I think it's sort of a very personal thing I think some people find things like data account really useful um I personally sort of enjoy like just trolling people's githubs um sort of from you know the leading gloves I actually just like have like tabs on you know Facebook Ai and Stanford MLP and things like that hugging face um and you just read their documentation and their notes and I just kind of follow literally the full engineering workbook what they're doing um I think you know that's an incredibly inefficient way to do it um but I mean I don't know what to say about platform in a sense other than just sort of you should just um the best thing to do is have a project um and if three is simply a project that needs python you will learn how to use python um to Johannes um so I chose a compiler I use C python for the most part but I don't have very strong feelings about it um nowadays most of my programming is on uh HPC at Princeton University so sort of a super super computer um for this we're going to be using collab um okay and then last thing there's a question about um about scripting versus programming language there's not really a sharp distinction it's more to say sort of a way in which it's used um we'll just say python is sort of fairly easy to write and sort of fairly quick to go from like having something you want to do to them having a short program that will do that um so I think it's that's sort of the key distinction really that I'm trying to get up all right cool um well I will keep an eye on this um will this Workshop help you to analyze Big Data no um it depends on how big your big data is um we're only covering pandas which I think really is sort of up to like the million observations past that you might want to start thinking about using something else that's not a hard rule um oh a nice question about um LM and chat GPT um so no it's a super good question I actually thought like do I write this course um with sort of you know having people like um LM assisted coding as part of it we're still at a point where um if you just use the program that chat GPT or these albums give you um it'll be sort of 90 of the way there unfortunately that's sort of remaining 10 of getting it to work or knowing that it's not going to do the thing that you intended is sort of the thing that you only learn by sort of learning from zero to one hundred um I certainly think that it's a super useful tool um and it's certainly sort of a multiplier on durability and to sort of learn coding and also to like write programs quickly however it doesn't sort of really get replaced this sort of knowledge of how to know what you need to do okay I hope that's useful all right um there was a notebook um it should look like this um okay yeah it should look like this I'm basically like the content of these slides is essentially the same as that notebook um I'm going to be going through it and I'm going to be talking about it sort of but um yeah so so um we're gonna do a crash course in base python I've written this with the assumption that you've never programmed python before um so you know bear with me if you already know all of this well then again I would wonder why you're coming to ninja python course if you already know how to do introductory python so you know there we are okay so um what what do I mean I keep on saying bass right like what is base python um so like a lot of programming languages um sort of one of the great things about python is it's um extensible it's sort of enhanced by a lot of sort of contributions in the form of libraries um you know they add to the functionality sort of the things that we can do sort of python is the vehicle or the sort of language through which we do it um but the tool is actually the library um so when I talk about bass I'm just saying the part that's sort of core to the Python programming language so you need to know some like you know into some amount of the core programming language to use these tools this sort of question of how much base python do I need to know um you know it's something I wondered for a long time um because you know I start out sort of in a data science scenario coming from R and sort of I was more worried about ND analysis done than sort of regarding the core elements of a new programming language um so it really varies with what your application is how much base python you need to learn I've made this course with sort of the minimum needed for a data analysis workbook um I would say that if you're doing deep learning or app development you're going to need a lot more um based python knowledge interestingly especially the pi torch um sort of library for deep learning uses a lot of sort of integral elements from base python um that make it really nice and efficient and like I also just sort of as I become a better python programmer I do find that just knowing the core functionality of the language just helped me write better more elegant code so you know it never will sort of make you make you worse off by learning more based on it however we're going to do the bare minimum today the things we're going to cover very dry um got variables data types data structures control flow functions and we'll go through each of these things um Bucklin because we're going to be here for a little while and again um sort of people who are joining late um I'm sort of keeping half an eye on chat um I've written this with sort of enough time to answer questions as I go along especially uh there's a question I think would be helpful to discuss um there will also be time at the end I have some exercises that we'll be working on but if there's time I can answer questions and so on um so you know please do feel free to take advantage of the fact that I'm sort of live in here in front of you so first thing about values variables and types so um if this is your first time programming right and sort of like I remember when I first came to like you know why why am I just sort of writing these lines these instructions what's sort of happening what am I doing so um one sort of maybe metaphor or mental image that might help you for starting out is that um when we start a python session so on collab when you sort of connect your runtime um you can imagine that you're sort of creating an empty box um we're going to use commands written in the Python language written in the cells of the notebook in front of you to interact with that box these commands can create objects that can persist inside of that box and being purposely vague there um we can also interact with and modify things that are already inside that box and the first thing we're going to look at is the print function which is our tool or command to display things um in that box to our Outlook so every programming language for some reason that I don't know your first program you will ever write is Hello World in Python it looks like this and if you execute that cell um by pressing a little arrow next to it you should get that great if this is your first time writing python congratulations you have now written a Python program we're going to go much deeper um the next thing is sort of variables and square element to this um so you know you might sort of have some preconceptions about what a variable is from mathematics Etc um the sort of most concise way to think about it in a programming context is they are sort of names that point to some object in the box that's really all that they are um so those things that they're pointing to they could be values they could be functions they could be all sorts of things but a variable is just sort of a name that will sort of leave a pointer leading to it we use the equal operator to assign variables so here it's just like this and you'll see you get that still pretty easy yeah it might be too basically um there are some rules for variable assignment worth mentioning so you can't put spaces in variables so you could try to run this a variable you're going to syntax error you also can't have the first letter of um a variable it can't be a number or a symbol moving on now to data types so we're going to talk about four data types in base python integers booleans floats and strings so integers are just whole numbers positive or negative um you just create them by writing a number without a full stop in it unlike the r programming language which sort of assumes every number is actually a vector of floats python if you write just a number you will get just an integer a data type sort of um does is it determines what operations can be done on that particular object so as we saw integers can be assigned with equal operator and then arithmetic operations with whole numbers are pretty much as you'd expect them to be it's plus minus star multiply slash for divide Cuban these commands you will see sort of very much as you expect that's what you get and meanwhile the double equals test equivalent so is a equals B no false uh uh we will get to character strings void I don't I'm not aware of something like that um python types are mutable and a little bit weird but okay cool moving on to booleans so these are just true false values um you've actually already seen them two booleans true and false that's the only two the two values um and just you know to not confuse it with other languages it's always capital T and capital F every single case everything else is lowercase the main operators for this there's and which is a logical conjunction um if you remember these sort of logic tables maybe from high school there you go it acts exactly like what you think and disjunction which is an or table a flip aside though um there was some python actually behave like the integers zero and one so if you use them in an operation where it wouldn't work as a Boolean but would work as the interfere zero one it would just implicitly behave like that so prepared truth was true I got two if I do truth times false I get zero does anyone want to guess what happens when I do false divided by false Arrow yep we get a zero division error exactly good all right now moving moving moving on to floats floating Point numbers so um being very technical through the representation of the set of real numbers in base two um sort of more practically we're going to use loads um to represent any non-hole number so any sort of uh decimal number we're going to represent it with a flow we can construct them just by using a DOT and a number so if we do 1.0 that would be the flow one uh float with the value of one and so on another way you can create um floats is python will automatically convert the output of integer division to a flow so if you do one divided by three it will automatically give you a flow and likewise if you divide something by one it will still give you a flow as the output first thing um sort of getting back at that you know the challenges we have when we try to sort of represent real sort of you know real observations on a computer fluids can actually behave in pretty weird ways we already get to some sort of like fairly fundamental issues in numerical programming pretty early on so looking at these examples just give you a moment 10 times 0.1 times 3 equals 3. that's true this any guesses if it's going to be true or false true it is false and there's something really weird going on here um this is also false if we look at y we'll see that 0.1 times 3 is not exactly 0.3 I won't get into the explanation because it's a bit long and Technical but um fun fact you can just put that number into a browser and you will get to a website um that will sort of explain this exact problem and talk about the various solutions that different languages have for doing but really sort of like very briefly the reason this happens is for the same reason that we can't represent one-third with a finite representation in base 10 um there's just certain sort of real valued numbers that don't have a finite representation base two um 3 divided by 10 is one of those okay so moving on now to Strings strings are a little bit different so strings are actually a sequence of characters to create them um you just put any sequence of sort of letters or numbers or any characters between single or double quotes both are valid but they sort of if you start with a single quote you need to close the single quote so here's three examples sort of here there's not really any functional difference between using single or double quotes some python style guides insist on using one or the other but the language itself is okay with both so again what operations there's some Advantage disadvantages in divorce single quotes um no I think you just want to be consistent um as much as possible that's the main thing okay I'll come back to some of these questions in a moment so one operation we can do on strings same we can use double equals two test equivalents so you know you can see the first word is equivalent to it so but not to the second word um to quick reminder first word is hello world exclamation point we can combine strings so this is called operator overloading but we can combine strings in Python by using um no you don't Define types when you instantiate it um it just assumes what it is based on what it what you've done with it um which can be kind of yeah so the way you combine strings is you just add them together so that's kind of neat however you might notice that um that didn't give me exactly what I was hoping for so the thing is we can use variables and values in the same calculation so I can just put a string that is just a space into this there we go we got what I was looking for now remember that I mentioned that strings are a sequence um so like all sequences in Python they can be indexed which is to say we can sort of access ordered values in it by using these square brackets one thing to keep in mind will trip you up if you're coming from R is that python Counts from zero so I'm going to look at a bit of code here so I'm defining a new variable the word and it's bird this first line word is the board and then here square bracket zero is the first letter of this one instead of b b and here is the second letter so remember zero is the first value one is this second um one way that I probably will sort of talk about it if it helps make sense to people is you could think of it as you're starting at the beginning of the sequence and this number is the number of steps forward you have to take to get where you want to get to um I don't know if that explanation helps anyone it doesn't help me per se but it's nice uh indexing indexing is basically just pulling out something at a certain location in the sequence so if I want the first value in this sequence of four letters I do square bracket zero if I want the second I do square bracket one we can also do we can do slices over sequences so we can Define ranges by using a colon so square bracket M colon n returns from the M plus oneth letter to the nth letter so here from 0 to 2 will give you the first and the second letter of the word one to three will be the second and the third I want to put in shot what this is going to return and briefly briefly answering the question about how do we avoid calculation errors based on floating Point um it depends on what you're doing um to a certain point it can be somewhat inevitable one thing though is to just um keep things as integers for as long as possible because there's no integers don't face the problems like that so you could represent everything as fractions sorry as long as you can sorry getting over a cold yes good verb and now verb is not a word but there you go another thing you can do is negative indexing so um yeah you can try that but yeah if you try to put a number that's longer than the sequence a number that's larger than the length of the sequence you will get an error however you can put negative numbers um and this just means it counts backwards from the end of the sequence so if we do the word minus one we'll get the last layer of the word we can also do this with slices to get the last n values of the sequence um so be the last three letters of the sequence any questions about sort of all this indexing and slicing stuff I'm not familiar with the IEEE notation um so Maybe why is the birth not hard oh because the word is bird not word um yeah and so characters um we're going to get to this but characters are actually sequences um integers are not sequences floats um booleans these are what we call scalars so they're a single point value they're not sort of a sequence of values they're not multiple values in a structure thing is strings are actually kind of a data structure and a single character string or even a string with no characters an empty string is still a sequence it's just an empty sequence or a length one sequence okay now finally sort of looking squished forward and backward the same time I don't think so um no I don't think there's a minus zero I mean I haven't tried it but surely that would just be the same as zero ah you can play around with this but um yeah so you can put a third number after a colon if you want to do steps so you can do one to three and then actually it's easier to see with longer sequences and then call on another number and you do steps and if the third number is -1 then it's a backward sequence okay so now finally check like sort of checking and chorusing types um there's a function type and that will give you the type of an object so this unsurprisingly isn't it something you can do though is you can coerce a value into another type by calling that type on the object so here I call float on 15. and then now 15 what sometimes though that's not possible so for example if I try to turn meow into an integer I just get another one because there's no sort of sensible way to interpret meow it's an integer why does 15 become in does anyone remember why 15 is an INT I mean yes but so you can go to answer your question um in Python if you just write a number without a period or a dot in it python will assume that it's an integer exactly Elizabeth that's exactly right um yeah if you wanted 15 as a float you could do 1 5.0 also I should point out sometimes coercion is a little bit unpredictable so we can go with what we think these three things will be so what happens when I turn meow into a bull what's the int value of minus 0.99 and is the string of false equal to false okay Gotham gave me three predictions give you a few seconds to look at it so the first thing someone's executed it cool yeah meow is has the value true however if we cross minus 0.99 um we'll get zero and finally the string of false is not equal to false I can go over a little bit with each of these why it does what it does um in general if you call the truth belly on something as long as it's not zero negative or empty it will be true um that's sort of the general rule of thumb um you can write some really not great code by do taking advantage of this behavior um so you can basically just be as long as there's a something it's going to be true that generally holds true why it's it's um zero here when you call Int on a float it doesn't round it actually it um explores it so if you call into 1.9 you will get one I think now as I say that I'm fairly sure that's right and then finally um and this one should be a little bit less puzzling to people the string of false is in fact just f-a-l-s-e [Music] um the letters it's a sequence um as we mentioned before that's a something so it doesn't have the Boolean value false the sort of string f-a-l-s-e has a Boolean value of true um that's probably thoroughly confused you now not my intention but um this is all to say python types you can move things around sometimes you will have errors in your code you'll dig into Nob oh like this does not behave the way I expected um personally I think well it's because negative zero is still just zero okay moving on to data structures oh and a quick side note um if you actually want to round a number in Python you can do there's a function round so if you do round on 1.6 it will give two okay data types um so whereas data types were about the representation of individual points data structures are about the relations between these these values um so we asked these questions before so we're going to cover two there's a lot more um data structures and base python but these are sort of the two most useful ones um and you can kind of get by with these a lot a lot of the time um my first are lists which are called list and these are an ordered array and then you have dictionaries which are key value mappings so um lists are a type of data container a few important things about them one is that they're one-dimensional um so they just put things one in front of the other um there's only a forward or backwards on them it's not a matrix it's sort of more like a vector it's ordered so sort of yeah it's not a collection of objects without order sort of there is an order of it and also implicitly on every list there is an index of every item in the list based on sort of its position there and then finally sorry thirdly it's immutable um this sort of matters more to you maybe if you're coming from a language with immutable data structures python does happen but the thing about lists is that you can modify them in place you don't need to create a new copy of a list to change something about it you can sort of switch out individual values of it you can grow them you can shrink them and then finally they can fit any type of object um so this is another thing that you know might surprise you if you're coming from some other programming languages you can mix every kind of data type in the same array they're just sort of very flexible very general purpose just an ordered array of stuff to create one um you're going to write a sequence of values separated by commas between square brackets um that's a very verbal description but what you can just see on your screen in front of you so you just have open square bracket and then some values and they're going to be separated by columns as you can see um if there are no lists in the co-op then my bad I might have deleted them by accident When I Was preparing material hopefully they are there if not please feel free to write in this study okay cool um okay so actually indexing and slicing lists we've already dealt with this um because as it turns out strings are actually ordered right ordered arrays as well um so we can do the exact same thing list zero or return the first value of the list we can use the m to end slicing notation so here to print my list 0 will give you the first element over that one to three will give you first value and the second value sorry the second value in the third value you can change the size of the list as you're using it so first to get the sort of size in the list or how many elements it has there's a function Len um so the land of that list we're working on is my under as three you can use the dot append method and I'll talk more about methods as we move along to add an element to the end of the list so you can sort of append it attach it to the end so here I append four and you'll see I'm not assigning that value to anything this is an In-Place operation I have my list and if I do the dot append whatever is the argument of that function of pen is added to the end of the list it modifies the list in place if you use a plus symbol with between two lists it concatenates them which is to say it combines the two lists into one mind you for that to work both elements of that both sides of that plus sign need to be lists to remove objects so we can use pop um that will remove the final element it works a lot like a pen it's an In-Place operation no also actually when you call pop it will sort of return that value so if you're sort of Imagine a step um you're taking sort of one value off the stack at a time um adding and appending um other in a sense you could get to the same place by doing either a pending is just appending a single value whereas adding is adding two lists together so if you add a list of length one to another list that's the same as appending the value in that length on the list okay changing values in the list so we can just reassign values in place um so here I'm creating a new list called X I should say right now um in practice try not to use single letter variables um they can cause a lot of pain for other people when they're trying to search for them but just for this example we've limited for space so we're using a single load of variable so here I've created a list a t and E all strings now you can see X zero so I'm just saying the first element of X I'm now assigning it the value a doing the same with the second value and you can see it just changes it in place you can also do this with slice notation so I can go from the first and second value I can replace them with another list of the same length again sort of it lets you do sort of what you would hope you can do um there's sort of no surprises here it's nice okay that's lists um it's nothing too exciting but they're just sort of an extremely useful and flexible thing and it's nice to have just sort of you know an ordered sequence of item for a lot of things that we're going to do next thing um and I'll say this is something I really miss when I'm using bar is um addiction so dictionaries are an unordered mapping of keys to values sorry an ordered is untrue when I first wrote this sorry yeah I checked it they are ordered actually okay um so to create a dictionary it's easiest to sort of see an example so I'm creating a variable called favor ice cream um here I'm saying chocolate and that's my friend Maria what her favorite flavor of ice cream is and she said it's a good choice um but you'll see the way that I construct it is I open this curly brace and I do value colon value here everything's a string um but it doesn't have to be to get the values out of a dictionary again we use the square bracket this indexer excuse me from last slide favorite ice cream if we want value corresponding to this entry we put in Musashi into there there we go we've retrieved the values sort of out of the dictionary pretty straightforward dictionaries are also mutable so you can create new pairs by using equal you can also modify them in place by using equal so ask my other friend Chris what his favorite flavor is pistachio um and I actually changed my mind what's my favorite flavor so I've accessed my own value and I'm assigning a new value these are both sort of perfectly legitimate things to do in Python programming language and now we get these last thing that would be quite useful um if you want to access all of the elements of a dictionary at once um so the dot Keys method so the object or the variable pointing to the object and then dot accesses its methods and there's a function called keys that will return the keys of the dictionary values will return the corresponding values and then finally items will return pairs corresponding to the key and value um sometimes it's useful to be able to sort of pull out these items from the dictionary see that whole especially when you're iterating over it which we'll get to in a moment all right [Music] um uh accessing a few questions uh okay a few answering a few questions real quickly um we can do matrices in Python um you can do lists of lists if you're masochistic or you could use a matrix Library um like numpy for example um and yeah you can do matrices what I mean to say though is that no oh list a single list by itself does not behave like a matrix and a list of lists can be a ragged array so they don't all have to be the same length unlike Matrix the keys um answering Ivy and just going back here so the keys are sort of the thing that we use are when we put in the key we get back the value that's the way to think about it for a dictionary um and yeah you can put instead of um string here you can sort of the technical limitation it has to be a hashable value um so for example you can't use lists as Keys however you can use lists as values you can also what's quite common is have nested data structures so you can do from this to some value and you can put another dictionary and you can but dictionary inside a dictionary inside a dictionary and so on um if you've ever worked with Twitter data you'll know that like you or any sort of Json data format there's a lot of sort of nesting of data structures but yeah um that being said if you're familiar with Json sort of oops sorry one second sorry about that um yeah honestly I don't know if anyone else finds this but I find that Linux is not supported as well as some other things when it comes to um that's annoying sorry it is breaking a little bit hard so I'm just gonna restart that okay great sorry about that okay cool uh I think that that's not a short explanation so that's something I'll come back to later but it's just sort of very literally the way we have to represent numbers on computers is in binary in zeros and ones so this isn't getting really into the sort of nuts and bolts of like uh there we go yeah probably better to go read something than me sort of explain off the cuff okay control flow um so this is a term you sort of might know what these are by other names but um they're sort of control flow are structures or commands in the language that allow us to determine whether and in what order we run blocks of code um effectively what that means is we can reduce the amount of code we run because we can say like okay do this thing multiple times changing one little value at a time um we're focusing on two particular control flows today again there's a lot more in the base Python language um these are the sort of two most important ones I guess so one is conditional execution and the other is iteration um specifically for loops so conditional execution um text skew code sort of conditionally I'm using this word but it's going to be much more obvious once we actually see it um if you can say if this condition is true then run this command and that's how you write this in Python so you do if space and then some condition that's evaluated if that Returns the value true then the code underneath the that's indented um is executed so an example we have here let's say you want to calculate the absolute value of some number you could think of sort of the the logic for calculating the absolute value is if the number is negative then multiplied by negative one otherwise do nothing so you could write that in Python like this so setting the number to minus two but if number is less than zero the number equals number times minus one there we go the absolute value of minus 2 is 2. now it increases we can do if else conditions so you know we can also specify what to do if the condition evaluates as false so you know let's go with this example what language are we here to learn today if it's python um so I'm saying it to python if the language here to run as python then you're in the right place when we execute this we in fact get that um however if this were equal to some other value well we'll see so Ellis is the third one so the way to describe it is it's going to go down one by one so first setting language to learn to Julia it's going to evaluate this condition first is the language took on python false then it's going to move on to the next LF condition and if that evaluates the true then it's going to execute the command that's inside there there we go um and finally if all of these evaluate and it's not any of those things are true then this command down here will be executed um yeah yeah you could put anything sort of anything that evaluates actually do you remember earlier this whole thing about everything coercing to Bulls in really weird ways so this is exactly the sort of thing you could do um if this is a negative number then that's going to evaluate as false if it's an empty sequence then that evaluates as false if there's any elements in the sequence it evaluates is true um ideally for the sake of queer programming you actually want it to be something that evaluates the Boolean value but python when it evaluates this going it's going to actually say whatever comes out of this I'm going to try and interpret it as a Bool and now I'm going to use that to determine what I want to do uh yes it is you don't actually need an Ellis um it's just sort of a syntactic sugar it's just something that makes it a little bit cleaner a little bit less code but it is in fact a redundant operation moving on the for Loops um is this something if you're beginning programming you'll sort of underestimate the importance of um I remember uh my fiance she also um does data analysis is a big part of what she does I remember the first time she was learning R um and I saw she's gonna hate me for mentioning this but you know so the thing that we all do which is just copy pasting the same command 10 times in a row and then changing one value in each one um for Loops save you from doing things like this um I obviously did that as well when I first started programming so the way that we do that is going to become a little bit clearer but the syntax where it's again pretty straightforward in Python so we say for some element in iterable we're going to run a command implicitly What's Happening Here is the iterable is some non some non-scalar sort of any sort of sequence anything with multiple values that are ordered actually I think they need to be ordered and then what we're going to do is for each element of that we're going to run this command but we will actually take that value I is going to be equal to the element of that interval in that go sorry so to illustrate what I was talking about so for number in one two three four five six seven print number plus number so what's going on here is I'm running this line of code seven times the first time I assign the variable number the value 1. and then I execute this line and I get one plus one equals two the second time I send it to Value two and so on this is a for Loop again here I've used integers and a list it can be lots of other things we can also have some external counter um so here you see that I am just doing something to number and it's being assigned to Value once you do for class six seven as I go through the iterations but I can define a variable outside this and I can change that variable inside the loop and this lets us get at some pretty powerful stuff here again it's sort of a trivial to example so you can see in each Loop I'm saying the counter to the product of counter and the number so you can see that it's growing okay that's all on for loops okay finally we have functions um so yeah the easiest way to get to it is again just looking at an example so here I'm going to define a function that adds one to the input I can't it line by line so to create a function in Python there's a command depth from what I call Def whatever comes after that the variable that could be so whatever comes after that is the name of the function so it's a variable pointing to that function name and then I put some parentheses or brackets depending on what part of the world you're in and the things inside those brackets are the arguments of the function so these are the inputs that we can put into the function and then once we put the colon it's saying that the line following the definition is done um so you can have this go on to multiple lines um but sort of this there's no need here but you can imagine a function with a lot of arguments you might want to have it go into multiple lines the definition will continue until you go through the colon um the following line needs to be indented by four spaces or two or tab it needs to be indented it needs to be consistently um if you don't have strong feelings four spaces is very standard and I need to make a quick detour into um something a bit esoteric maybe a bit complicated or just the idea of namespaces um so Python and I mean most languages have this idea ah sorry yeah return didn't talk about the inside of the function so inside the function I'm defining defining the variable y I'm saying it's equal to the input X plus the number one and then the output of the function is whatever follows this command return if I don't put a return then this function doesn't have an output which is still valid you can have functions without outputs their output is just sort of the null set um so talking about namespaces it's going to be a complicated so I'm going to go through a few examples I mean but and there's sort of there's a lot more levels of this but we're going to talk today just about the idea of a local and a global namespace so when we Define a function um any variable that we create inside that function um belongs to the namespace of that function so that variable is only defined inside the confines of the function if you try to access that variable outside the function you can because it's only in the local namespace on the other hand if you define something outside a function you're defining it in the global namespace everybody can access the global namespace there is however a priority to this so if you're writing a function and you make a reference to some variable python is going to first check whether that variable exists in the local namespace and then it's going to check the global one so there's a priority for local things before Global things there are default values um I'll come back to that okay um if you're in your notebook and you're sort of coding along make sure to run this command here um it's just going to clear the environment so it's actually going to clear the namespace of what you're doing going back to the Box metaphor you're just emptying out the box so this first example again I have no sort of global variables defined at the moment I'm defining a function f with the single argument X inside it I'm setting y to the value five and I'm returning X Plus y so this is just doing add five to the input then later on in my code outside of the function definition I try to print the value y I'm going to get an error saying Y is not defined the reason is I only defined it inside the function so it's only in the local namespace of this function y only exists within sort of this indented code outside of it it doesn't exist moving on um this idea of priority right so now I do Define why globally let's set it to zero so again same function and I'm going to do F of five if we just look at it quickly it's going to be add five here and then five plus five or five plus zero and we get 10. and the reason for that is this definition inside the function has priority over this definition outside the function RSA definition assignment and finally you can do this I would say do it with caution you can define a global variable and then use it inside a function so in this scenario I haven't actually defined y inside this function is a function where I return X Plus y so unless there's a y in the global namespace this does actually return something um the use case here would be things like for example you know let's say well you have the idea of sort of global parameters in some program that you're running so you know for example let's say I'm training a language model or something there will be things like the name of the model that I'm training the date that I'm training it on etc those things I might want them to be accessible to everything throughout the script um so every function might want to be able to access those because it'll be true for that run of the script uh you can make local value Global inside functioning right yeah I think so I think you can call Global on things to make it global I wouldn't recommend it though as in you should just Define outside and finally um this would normally be not really an intro topic but it gets used a lot in pandas which we're going to look at um so Lambda functions they sound kind of scary they're actually like very trivial so just another way to define a function and it's sort of a much more concise notation for it so these two things are the same so here I'm defining a function it takes one argument and returns that plus one this is the same thing yeah so one thing to note though this exact thing if you wanted to use it you could wrap it in parentheses and then put parentheses after it so I could put parentheses around this and do parentheses two and that would call this function on that object it's a little bit weird but it's basically this here ends up being equivalent just to the F the variable name um because functions are options okay so covered a bunch of material um I'm going to quickly recap the things that we looked over so there's four data types we looked at integers which are whole numbers floats which are rational numbers remember to do some weird things some unexpected things uh sure I could show you the last name I missed that but this is what I mean you can do that the place where it gets weird is sometimes sometimes I might have a function that takes a function as an argument oops so for example this would be a really pointless thing to do oh sorry so now if you remember I defined f as X plus one also do this okay I don't know if that answers the question let me do it all so okay so I've defined f of x now I've defined G as a function that takes a function as its first argument and input as its second argument it just returns that function applied to that thing this would be a really weird thing to do so now what I end up getting is if I do that that's the same as just doing this if that makes sense the other thing I wanted to illustrate with this is that this is sort of what Lambda is in here so Lambda is not actually this it's actually just that it's sort of a function that hasn't been called um I'm not going to get necessarily into object because I haven't really like settled on an explanation that's good at a point where you're still getting into the language um generally what I mean to say is that some languages wouldn't allow sort of functions to exist independently um of some sort of class uh trying to think of an exhibition that doesn't go into things like this um can you assignment assignment names around the function yeah I mean you can assign it to a variable so like so I could do that if that's what you're asking the whole thing is I'm not getting into the whole class bit of python because you can do most of this stuff without getting into classes um but yeah the point is that in Python functions are allowed to exist without being sort of a method of some other class um and they have a lot of the properties of sort of anything in Python um is probably sort of what I want to say but it's sort of not too important so I'm going to move on a little bit I mean we can come back to it in a moment but um so we've covered quite a bit so we have four data types um again just to say remember that there's some unexpected things about these data types in Python um but you know sort of just things to keep in mind when you're using them because often the bugs that you have will be due to sort of things not behaving in the way that you expect them to that's the whole point of a data type it sort of defines some predictable regularized sort of things you can do with the value we've also looked at two data structures lists and dictionaries lists are an ordered sequence they're extremely flexible dictionaries are mapping from key to value the values can be anything the keys have to be hashable which you can talk about maybe later but sort of in essence you can do it by trial and error and you'll see for example you can't use a list as a key require dictionary we also learned to control flow structures so ones for conditional execution and letters for iteration and for loops and finally we looked at how to write functions um I've been answering questions that go along if anyone's been holding on to a question you can ask it now um otherwise um since we're doing okay over time I'm going to share a problem set if people want to sort of work on that discuss this in the chat want to ask me questions about it um we can start that now um yeah so I'll give a couple I'll give them or I can't see the raised hands so uh just go ahead oh okay so hi thank you for connecting this session I wanted to ask if you could explain how indexing and python works again because I'm confused uh when we were doing the bird uh the bird variable in that b is zero right then how do you uh could you just explain the whole indexing part again um I mean so it's basically say that in Python you start counting from zero is the maybe slightly weird thing so let's just look at like I don't know um what's some reasonably ordered thing uh uh let's just go with three cities in Japan so here I have a list with three values if I want to get the first value of it I use a square bracket and I ask for the zeroth value of it that Returns the first thing in the list that's just sort of how things are defined in Python um I don't think there's necessarily a deeper motivation to it some languages are this way others are another way um so that's what they say about that okay so if I were to use slices with indexing so if I were to say uh if I wanted to recall all of the all of the three names how would I do that would I use 0 to 3 or 0 to 2. um well we can see what happens with both so the thing that might be slightly confusing here is that it's inclusive of the left but not inclusive of the right value so this only returns zero and one value but not the tooth value um if I wanted all three I'd have to do the length of the list sort of I have to go one beyond that again um this is just sort of the way that it's done in Python um it's something you sort of get used to as you use the language okay makes sense thank you no problem and to answer people's questions before yeah or okay cool these functions have default values yes they do um so the way we would do that just like this so if you assign the argument of value inside the function definition it gets a default value so now if I call func without anything I get the default value bit okay I'm gonna share the problem set with people um what happened am I using for python coding this is just a terminal um it's not really a it's the terminal is Kitty and this is what happens when you just type python into a terminal you get the um python prompt um best practice were to use Lambda um if you see other people using it in a place then use it there um in general it's a bit of a weird thing to do and apparently controversial but okay um keep on looking at people's questions um just to sort of stick with the time that I had in mind um I'm going to sort of give people right now I mean if you don't want to work on this right now feel free to like take a break walk around I'll just say that um I'm gonna give about 15 minutes to look at these questions now they'll definitely take you more than 15 minutes if you want to work on it and you can sort of ask me questions as you're working through it you can also chat in the Discord or on the zoom chat or working on it um yeah the length of this collab file um it's just above in the chat the um thing from oh sorry I only sent it to the waiting room participants that's my bad ah thank you but yeah in the meantime um also be here um so let's see sorry it's dealing with time zones right now it's 5 40 where I am right now um I'll say five two um I'm gonna say at that point I'm gonna take a 15 minute break um and then get ready for the second part um but for these next 15 minutes I'll be online answering questions chatting Etc um I mean I don't want to give an authoritative answer to that question um I mean I can say I used to work as a job that had the title data scientist um but it's definitely sort of an overloaded term um a lot of things are called Data scientists and it sort of encompasses a lot of different tasks um you know learning things like sort of which tools you need will depend on what kind of work you do the sort of um python R SQL power Bia Tableau that's sort of you know um very close to sort of like business intelligence like sort of reporting analysis um things like SQL power bi they'll be less useful if you're doing you know forecasting for example or some sort of like predicted modeling also wouldn't be so relevant if you know for example working on language models um so it sort of depends on what your work is in terms of like if you're trying to get into that career I mean obviously the more skills you have the better um but I not really an expert on which one will give you sort of better career sort of get get you more sort of a higher paying job because it's not always obvious sort of what specific skills are more highly paid um yeah yes you should create a copy of the notebook I think if you want to modify it so you can save that into your Google Drive I think is the default option uh I should mention if you haven't figured it out you'll need to create new cells to write the code into um so you can do that by clicking on here and you can start creating code um what time will we start again so I'll be um I'll be online um for the next 10 minutes and then I'm going to be taking a 15 minute break um so let me think that's 55 um then 10 past sorry the time zones I don't know where all of you are it's currently 5 45 p.m here um I'll be starting yeah in 25 minutes um when we'll be starting part two oh there we go thank you for anyone's and you can also mention now um I'll share this after this um but uh all the material for the course I used to teach is online so you know that's a eight-week course where I go into a lot more topics and I can get into four hours right now so if you sort of enjoy the self-study workflow you want to sort of have some free materials to work off of they're slightly out of date at this point it's been a couple years since I talked of course but you know I'm happy to sort of share that material and happy to answer questions um so yeah uh I'll make sure to send that out foreign so there are formal rules for 11 and python it's a weird language and the alignment actually matters so for a function definition um it needs to have the sort of indentation um so indentation is sort of exclusively part of the language more broadly though to do with readability um there's a lot of different sort of thoughts or approaches to this um and they're sort of cemented in different standards for code formatting um I can put with the code formatting and things that are called some one's pet a another is black another is ypf if you look into these um these are all sort of code formatters um I'm giving you a lot of options if you had to choose one if I had to choose one I'd probably go with black um just because black was invented and responsive there being too many options um and too many arguments about sort of which code formatter to use and black is sort of supposed to be sort of the most like fascist option but you know flash is not a good way to describe to but like while everything else allows for customization black is just sort of there is one way to do it it's not the best way but it's sort of the least controversial way to do it yeah so Pepe um autopepate will give you that if you use um depending on what sort of IDE that you use um you know for example like you're using vs code um you can have all these things built into it so for example every time you save your code it'll automatically realign your code to align with Pepe or black or whatever people that are working on the um home set as well um if you're interested particularly the third problem um I keep well used to keep a Blog but um how they blog posts specifically about that problem so I wouldn't read this blog post before you try the question but I have a little blog post on optimizing your python code and it looks specifically yet sort of calculating prime numbers as an example of a lot of the techniques that exist I guess I should also say like it would be weird for you to be in an intro python class and ask these questions but if you have any questions about sort of more advanced applications in Python also happy to answer them um during these moments um particularly if you're interested in using for things like machine learning deep learning web scraping and that sort of area um that's sort of where I've spent the most hours in the Python programming language um so happy to answer questions about that as well right now I assume people with those questions will also be turning up to the other um courses in this hurdy Summer School uh given by some very brilliant people so I mean I would say they're kind of the same thing except when you run on your computer and the other you run on Google's computer um I used to feel more strongly about running things on my own Hardware increasingly I'm kind of in favor of a cloud-based approach and that sort of reflects the more I kind of collaborate with other people the more frustrating it is to try and ensure that they have the same development environment that I think um I spent a lot of time getting sort of Docker like like trying to build Docker into an academic workflow um if you want to hear more about our interest in about how we talked about that but like ultimately it's sort of getting people to install things on their computer is really hard getting them to visit a website is easy um so in that way Google Co-op is wonderful um because I can just get people to um how crucial is python for machine learning um so to cut a classic machine learning in a sense like you know um everything non-deep learning in a sense the ecosystem in an R is quite mature I'm not sure what that give a space between some repetitions um I could speak okay um for yeah for machine learning um so R has a library called tidy models um I think Julius Silvia is one of the main people on it I'm a big fan of that um however I always do all my machine learning in Python um scientific learn well it is sort of an educational library is also just a really fantastic library and has a lot of examples of sort of how to build something like this really well um I think data table actually exists in python as well um I do love data table there's also something called polars which I've been meaning to check out which is supposed to be a rust-based implementation of like data frame functionality yeah um I just I haven't been working with data frames much recently um the closest I come to it is hogging faces data sets um which are sort of not in memory dashboards um I I find plotly quite acceptable um hugging Pace has something called grad IO or um radio grad IO um those were super nice um they're really easy to get up and running but they're kind of very specific they're not for everything they're sort of very specific for kind of NLP type stuff um but yeah it's quite good oh I understand your question though sorry um yeah I mean you could put um okay um I'm going to call a short break now um I'm just trying to check when I said I'd be back in 25 minutes okay so I'll be back in 12-ish minutes It'll be 6 10 here um so yeah in 12 minutes um the stream would still be live I think um just yeah I'm gonna be AFK um but I'll try and answer questions when I come back [Music] thank you [Music] thank you foreign [Music] [Music] I think [Music] foreign [Music] [Music] foreign [Music] foreign [Music] thank you [Music] foreign [Music] [Music] foreign [Music] thank you [Music] let's go [Music] foreign [Music] foreign [Music] [Music] foreign [Music] foreign [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] foreign [Music] [Music] anything [Music] foreign [Music] foreign [Music] thank you [Music] everybody [Music] foreign [Music] foreign [Music] all right great um so I posted the solutions if anyone wants to you know look at them um yeah so they're there um likewise um if you're coming back to this um I've tried to sort of explain the solution as much as possible in common so as I go over it so hopefully you know no need to look at it right now um hopefully the sort of a useful self study guide going forward um I'll wait just one more minute before getting started again um because I ended up only being away for 10 minutes um but okay if anyone is curious I'm um using quartet for a lot of the material that you're looking at today so it's quite a big fan but everything um sort of everything in these slides is actually originally written as a text file um that I can render automatically um yeah so kind of a cool thing do I have a template um not for this one um well yeah I have a yaml file um it'll be in the GitHub oh which reminds me now's a good moment to also share um I should say I haven't completely finished like sort of tidying up this GitHub but um so you know if you go there and you can't find the slides yet um it's just because I'm still sort of cleaning it up figuring out exactly how to go about I'm sharing them because currently I'm only hosting the slides locally on sort of me at your web server for myself I don't know if it works or just download them um I didn't used to be a quarter person I used to use an underlying Library called Pandora where I could just share sort of Standalone files it doesn't seem to work quite the same with quartet okay great um so the collab notebook for part two is up here put it into the chat as well and okay great um so hopefully um I don't know maybe this will be maybe some of you came to this knowing base python but not knowing data analysis and this will be new stuff for you um but yeah uh still um full disclosure I'm sort of you know okay who signed up I'm sort of trying to Target as broader audience that I think will be useful as possible with this um and sort of you know trying to create another opportunity for learning and getting into a lot of this um but you know don't hesitate to ask me more advanced questions Etc are you doing um I'm enjoying the sort of participation and engagement so um yeah thank you very much um everyone in the audience all right um so uh first things first uh why pandas um or what is pandas um so it's a popular library for analyzing um topical data um popular data is sort of the I would say the main sort of data that we deal with as social scientists I don't know if that's sort of strictly true um but or it depends on your application but sort of it's rectangular it has rows and columns um it's sort of like an Excel spreadsheet is essentially top viewer data a lot of the data that we deal with this type of work and pandas is probably the most popular Library um for analyzing it or playing with it [Music] um there's a lot of different libraries for dealing with tabular data um pandas is very much sort of on the like Swiss army knife side of the spectrum um the syntax is quite expressive um it's quite readable um there's a lot of features implemented in this that in theory you know you could write your own code you could use base Python and numpy and things like that um so really what pandas is providing you is convenience um and also uh why is it pandas I actually did go ahead and look this up um the match was quite boring it's actually just stands for panel data because the early versions supposed to be a library specifically um sort of providing things for dealing with panel data I was hoping it'd be something a little bit more cute or exciting but yeah okay so um as I said there's a lot of options and before we kind of get into it so just quickly listening to pros and cons um because I am sort of acutely aware and I sort of wasn't sure whether to choose pandas in this moment but the main reasons I think that it's still an important Library it's sort of it's been dominant for a long time but people have a lot of complaints with it um but it has support for a lot of file types like a truly ridiculous number of file formats or supported for pandas um it's very much integrated into this whole kind of data analysis ecosystem or stack um it's sort of you know the default package that's installed on those data analysis workflows and like just sort of aesthetically subjectively personally I find it sort of you know a nice balance between verbalsidian function I don't find it quite as Extreme as or some of the Tidy burst things where I just find it like a little bit too much writing um but then I also you know find it sort of still expressive enough that you know it's going so it's not like writing Arc scripts if any of you I've never done that um and these are all the sort of like entry-level reasons again to pandas there's also sort of advanced pandas where you can get into really complex um in this season things like that and again there it actually provides some functionality that'd be really hard to implement on your own but um we're definitely not going to cover that today um because it is sort of it gets really gnarly once we get that far into the sort of the pandas library but it's worth noting it's actually not just all super simple convenience stuff they have some really kind of gnarly stuff in there as well [Music] um slides a collab notebook directly above you um reasons not to use pandas so one is if your data sets like more than a million rows like it sort of depends on your computer but um it's not really the fastest or like most memory efficient yeah um desk is one alternative for big data sets um the the main problem with pandas so there's a few problems one is significant overhead which is to say that like it's so full featured but that means every time you want to do something really simple you kind of have to you know break out the giant Swiss army knife when you just want to open a letter it would be quite convenient to just use a letter opener for that right so it's completely Overkill in terms of its functionality for most amount for most sort of simple things um and then it's also not written in a way that's particularly um I mean it's not terrible but it's also not great either in terms of performance dust is one alternative um yeah I think nowadays I find myself mostly writing actually just sort of straight numpy um but I'm doing a lot more numerical programming nowadays and as a result a number Jacks that kind of thing is where I am more days and that's the last thing to mention um pandas only runs on CPUs um increasingly sort of GPU accelerated programming is a big part of a lot of sort of data science workflows um so you know we have a lot more access to gpus so this sort of is an irrelevant um thing okay so having completely talked to yourself about why you might not want to do this um let's go ahead and start doing it um so we're going to use the running scenario that's kind of similar to something that I actually had um at one point which was um so back when I was a grad student um one gig that I had was I would get um every year the department would hire me to analyze the scores that they gave to the undergrads on their final exams um so mainly my job was to check whether there was any sort of like you know some professors gave unusually high grades or unusually low grades um but you know in essence it was sort of just put together a big report about sort of you know the patterns of grading um it also completely ruined by impression of subjective of sort of objectivity in the University system that's a different brand um but yeah so we're going with this sort of running scenario where you're analyzing student scores and we have sort of two data sets here one is um I made both of these up or to be specific I actually asked chat GPT to make up these data sets based on a description um so these are mostly chat GPT generated data who knows they might be real um so their student scores and five subjects and also their contact details so um functionality that we're going to talk about so I'm gonna think about more in terms of like questions or tests so one is how do I sort of get data from you know someone sends me a file how do I read the data in from it um How do I select and filter values so you know we have this table of data how do we actually point to the thing that we want how do we sort of extract it how do I do calculations on category data and how do I merge and combine data from multiple sources and then finally um I think we've been going at a good pace so there'll probably be time for this um we're going to talk a little bit with a little bit of sort of exploratory data analysis first things first because we didn't cover this um and honestly even if you do do a bit of python um not much this is quite basic stuff but like it tends to get overworked from your first learning a language so importing libraries so again um we just you know we're experts in base python now um it might be fair to say that it's a little bit limited to our sort of more complex data analysis um you know um writing a CSV parser and base python is not really the most fun thing to do totally doable it's not much fun um so pandas is this library for data analysis so all these tests that we might want to do that we don't want it to in base python they put together all these sort of convenient tools for us to use a library in Python we need to import it um the command just looks like this so import pandas um specifically if you're on Jupiter so this wouldn't work in a script but it works in a notebook you can use this magic command percent who to list the objects in the global namespace it's going to look different on colab compared to the compiler for quarter documents but you'll see there's like a couple things in there and one of them is pandas so pandas has been as an entire module pretty important to the global namespace another is a variable called pandas that points to that module we can now access sort of the tools within pandits as methods of the module so for example if I just want to check the version of it most modules will have where most libraries will have if you do the library name dot double underscore version double underscore usually get the version that you have just sort of a convention um sometimes though you know you don't want to import the whole module you just want to import one specific thing for them so there's a from command so let's say this wouldn't make much sense to do but I just wanted to import the version name from pendants I can use this syntax from pandas import version yeah so still nothing to be here and I kind of got it this before but um in general the approach in Python is to keep trying to keep the global namespace clean um you know um again if you're sort of an R user you might have faced the problem before uh 1.5.3 should be fine for this um but that's interesting that collab has 1.5.3 it's kind of updated but that's okay that's totally fine um so yeah um R you've probably had the issue before where you know the function select exists in every single Library um so you know if you import D player and mass you're going to get some messages and you're not actually going to know which select command you're using um in the sort of python world we like to avoid this problem so um people tend to import things wrapped by the entire module name that being said taking up pandas every time is a little bit verbose so we tend to type we use this thing as so we'll say import pandas and it will Alias it to PD now PD points the module so if I want to use the data frame function from pandas Library I can do pd.data frame it's not a function it's a class but again we're not going into that today um version version It's just sort of a convention in Python libraries to um happy version of a library as sort of module dot dot wonderscore version the Wonder score should give that um yeah so if you wanted to do the workflow of just importing everything into global namespace you could do it like this um you could do from pandas import and Then star would be the wild card so you can do this import from some module or script every single thing and sometimes you might want to do this if you have like your little utility script where you've just learned a couple functions and you just want to have this in a separate file you might do this however you can see you know because they're sort of everything should be wrapped there are really a lot of things inside the pandas library at the top level and even these so a lot of these are classes so they'll have their own like subclasses some of these will be modules as well so it gets really crowded we're going to use this reset command again um and for everything else um if you're following along and collab make sure you run this command um just so that we don't have all of those things in your local namespace okay great so getting your data so you can construct the data frame manually as it were um as a dict of equal length lists so here this is like you know earlier I mentioned the example we have a bunch of students and their scores on a bunch of exams this is all made up um mentioned they use chat GPT to help me generate this because typing all this out would have been very annoying um so yes and you can see if you remember from I don't know 30 minutes ago we Define a dictionary and then each of the keys and each of the values is a list of length five and then we can just call data frame and pass the argument that is the dict of lists to data and get a data frame here we go we'll talk about it much more um but I should say probably you won't be manually constructing your data frames usually you know like um especially once you start getting to you know even the hundreds of observations typing those out in addictive lists becomes very impossible um so a little trick in case you didn't know about it um you could there's a thing called tab completion and a lot of code editors so if you type into the cell PD dot read underscore and then hit the top key you should get a list of options popping up sort of an auto completion it might be different on collab but sort of by default it's usually tap and from this we can actually just see all of these read functions inside pandas um briefly just to talk about this um because there's a bunch here but I should just mention a few of these that are worth noting specific social science applications you see SPSS and stata are actually supported so you can beat in SPSS and state of files if you're thus inclined um Excel it's also supported if you work a lot in r in Python you might want to check out feather is created by the West McKinney and how they will come sort of the creators of like sort of Gigi claw and pandas respectively to have a sort of data format that's usable by both R and python um yeah so I'll say a lot of things were supported um and it's quite useful often um there's downsides to CSV format but often we'll be doing with csvs if I'm just in general so um we can use the Panthers dot read CSV function and then pass a path and type P does work that's fantastic thank you and pass a path oh it doesn't work that's a shame I can show you in the terminal real quick so this is having a tab but regardless um the list is here it's also in the documentation um the relationship between dictionary and pandas yeah exactly thank you um okay cool so this is the data framework we're going to be working with today so the first questions that you might want to have a data frame are just accessing specific values or specific sort of ranges or collections of values so what were everyone's math scores what did student 5a12 get on all subjects um and you can also have sort of multiple queries simultaneously the first thing with data frame is to talk about the columns in the indices so it's tabular it's a square that's sort of a it's a grid so every sort of cell in that grid its location is referred to by a column an a row so here The Columns are at the top and the rows or we call them in indices or indexes and pipe in pandas are a launch here so you can see for example this value right here is at zero and not you can directly access the column names by doing dataframe Dot column columns and you'll see you'll get an object that has so it's not a list it's technology appendix index but it's basically you get the list of columns likewise you can find the index doing the DF dot index by default um when you create a patents data frame it's going to have a numerical index which is just saying sort of the numerical position of the thing in this application we're going to do we're going to um actually set our index to the student ID because it works a little bit better for the analysis that I'm going to demonstrate it's going to be a little bit cleaner and simpler so here I'm using dot dot set index function and I'm passing the column name student ID and I get back a copy of the data frame with now student ID as the index here you can check the index and we see bad did not mean to click on that so now that I've set that you can see okay the indices on the left here the student IDs and then the columns are the subject names I've changed the presentation a little bit to make it a little bit more concise but this is the name of the index it's not a column so there's a lot of ways in pandas to access values I'm going to talk primarily about the one that I think is just the one that you should use by default um so Dot Lock is the one that we use for name based indexing in general you're going to do data frame Dot law and then you'll pass the index names and the column names here you could give a single value a scalar you could pass a list of values and again you can use the slice notation so let's go up with some examples so let's say we want to see what the student 5801 get on all exams the syntax would be df.log and then here is the rows so I'm passing the scalar or just the single student 5801 over here I'm passing the colon to indicate the slice of everything I provide some highlighting to sort of help visualize the intuition here you can see this 5a01 is selected along the indices and then here we're selecting all so what we get back and you'll see this in your collab is this range of values we can also select by column so here we could ask the question what did everyone so all the students get on the History exam exact same idea we use the colon to indicate everybody comma to end the bit referring to the rows or the indices and then we pass the name of the subject that we're interested in history and now you can see here we're saying everything and then Among The Columns we're selecting history and you will get back these values so again sort of all corresponds to here and the second value corresponds to here it's pretty intuitive you can start getting a little bit funky with it so let's say I want to see what two students got on all exams I could pass into that first space a list of both of the students names so that's literally what this is right here this is a list of two strings 5801 and 5a12 those are also both indices within the data frame this is exactly what you get likewise I don't think I need to believe at the point too much here I'm passing a list art and history I'm saying all gross you'll notice though that the order of the labels does actually matter so when you call it in your notebook you actually see something like this the order of the columns is changed according to the order that I call the columns in uh no the spaces are optional uh they're just repeatability uh your python crash um I think if you you go go not sure you'll have something to ask you to thank you um okay and finally the thing that we usually want to do is getting specific cells right so here I think it's very obvious at this point going slowly to avoid confusion student 5801 history there we go it looks exactly like this I can do the same I can pass lists into here now we get 5a12 on history and r I can sort of do as much as I want with this to my heart's content three students two subjects sure okay so that's all I think I've thoroughly done it it's quite powerful actually because we can also use it to filter most so let's say I want to see this like all the scores for students who got 90 or higher in math so for any student that got 90 or higher in the math exam I just want to see their exams on all the all the scores on all of the exams so we're going to do is we're going to create a Boolean array so here I'm getting everyone's math score and I'm asking is it greater than or equal to 90. when I get back is an array of booleans so for each corresponding to each student a true false false true false now I can put that Boolean so up here I've assigned it to a variable called good math and just to remind you good math just looks like this I can put that into lock right so now I can say give me the rows where that condition is true there we go these people are both over 90. uh do spaces matter all in Python I want to say other than indenting it doesn't matter um it's mostly just for sort of ease of readability I can't think of any situation where spaces do matter someone might correct me that being said this is kind of you know creating a new variable for that's a little bit annoying so we can do it with a shorter syntax and actually we don't need to use this greater than or equal to operator even though sort of physiological we can use the dot GE method which is the dot greater than or equal to method and then call 90. so this gives the same thing press this up here so we can combine these as well so let's say we want two conditions so both math and R are over 85 and I want to see their history scores so I can pass this and then here's the binary and operator so it's going to do this sort of pairwise operations of first true true true false the one going down and I'm going to get a single array of five booleans and then I'll check out his resource so this is the general syntax in Python pandas if you want to sort of access if you want to do a lot of filtering you can do df.log and then you can put a long list of filtering conditions on the roads it's best to put them into parentheses just for order of execution being correct and then join them by ends likewise you could do that in the columns as well I should also mention that lock is sort of name based you're going by the names of the columns and rows but there's also locational indexing so um that's literally just if I want the first two rows I will do um less than or equal to is l e less than is LT greater than is GT so here I walk we have the first two rows and the last three columns um yeah sort of you should note that you can't mix the two in the same command um so ilock only takes numerical arguments unlock will only accept things that correspond to actual names of indices okay now getting the things that are slightly more exciting than just accessing our data operations on our data so we want to do things like ask if this score is over 90. we already did that but that was an operation on our data might want to find out what the average math score was or what was the maximum score for some stata so on we're going to sort of cover broadly how to do all of these things in patterns some quick terminology um so I've mentioned this before a scalar sort of a single value a non-scaler is a data structure that contains more than a single value or can contain more than a single value you can have a non-scaler that is empty and again when I say argument I'm talking about the input instead of function you might have noticed this already I've just been talking about data frames dependence actually contains two data structures one is appendix series which is one dimensional in data frame which is two-dimensional there's some other differences but that's basically the key one you can get a series out of a data frame by indexing it with a scalar what do I mean there so if I put in biology which is a scalar I actually get that this is a series if I did the same thing but just wrapped it in a list this is no longer scalar this is a list with just one element and now when I get back is the data frame the difference is this is one dimensional so this is just length five in the First Dimension this is two-dimensional but it's length five in one dimension length one in the other now think about arguments we could have arguments that are scalar so um looking at individual values in this case we could have some command um is the score greater than 90. you'll see this function assumes that X is a scalar next is what a series sort of a one-dimensional array as the argument so I'm calling the maximum if I want the maximum of the scores I'm sort of lining up all the scores and finding the maximum there so that's going from a series and we're turning a single value and then here I can ask at the data frame level as well so sort of query over a multi-dimensional structure a lot of these sort of like standard operations that you might want to do are built into pandas so as we mentioned not equals dot any greater than is dot GT and so on a lot of Standards statistical summary functions there's dot mean dot median dot STD for standard deviation and so on um sort of beloved standard summary function you might want already implemented in python sorry on Python and pandas um it's also worth saying that they tend to be optimized so if it exists it's generally better to use it um unless you really know what you're doing in which case yeah so now we're just going to calculate the average score on the math exam so again with our indexer we're choosing everyone getting their math score and then we're calling the mean function on it so here we're getting a series out of here because there's a scalar here and everything there and I'm saying dot mean so on that series it has a function mean and the average is 87.2 and it turns out um where Lambda functions come important and where pandas becomes quite powerful or these apply and apply map functions so applying a plan map allows you to apply any function really to a series or a data frame so series dot apply taste um functions with scale our arguments so it'll apply the function to each element of the series I'll say it again so when you do series startup by you're going to take that function and it's going to be applied to every single element in that series if you do data frame dot apply it's going to take each row or column as an argument so let's say you want the average on each exam you could do data frame dot apply and then you could take the average of each exam the way we specify we want row wise or column wise is there's an argument called axis and then finally there's a function apply map these are functions um this basically just applies to everything in a data frame so here I'm going to get what the lowest score for each exam in student was so the lowest per exam I can use the dot Min function and X is equal to zero it's saying across the row so for each exam and go as per student again same idea here except you'll notice here I'm using dot min access is an argument here I'm using the apply function and I'm passing Min which exists just in base python um so basically when we're doing a function to a data frame we can apply it to each row or we can apply it to each column so here I want the lowest exact lowest score per exam X is zero is saying which direction to do this operation in so 0 means sort of apply the function within each column of the data frame apply sorry axis equals one means apply it within each row of the data frame so here I'm getting each student's lowest score and here I'm getting below a score in each exam okay no finally um let's say that we want to just convert F1 squared to an 8 to F ranking this is an American style one so here I've written a function for you there's a lot of different ways to do this I've just done it with Fifth house conditions because those are the tools that we've seen before you can see we're going through we get in some value and I'm just waiting to trip it I agree it actually is confusing and I often mix it up and I try both and figure out which one is the one that I want so I've defined this function here and I'm using the apply map function you can see it applies to every single element in the data frame okay alrighty so next talking about combining data this is another common use case where you'll have multiple data files um and you need to combine the data from them in some way for example those might be you want to merge voting records with sort of other things like credit records and things like that um yeah the list goes on we want to talk about two different ways of joining data together so what is concatenation this basically just means just sticking the data together not in any sort of principled way you just kind of put it together um the other is sort of a you would call a more principled way of combining data which is joining which is sort of you find some key that they have in common for example a social security number and then use that to join records so let's say there's like two extra students that we forgot um I create another data frame with a few more people with two more students they look like this I can concatenate these I can basically add these on to the end of the old data frame by using this pandas.comcat function which takes a list of data frames as the argument and then again this confusing thing axis axis equals zero means I'm sort of stacking them on top of each other so I'm stacking them this way one on top of the other axis equals one would be stacking them sort of horizontally so making the data frame wider axis equals zero is making the data frame taller or longer and there you go now we have seven elements in this data frame mind you this hasn't actually changed either of these it's this function returns a new data frame I tend not to say row wise and column wise I think because I find them confusing as well one note about concatenation is um what happens when you have different columns so here I'm going to reset the index um what the reset index function does is it takes the index of the data frame um makes it a new column and then creates a new index numerical so we can see when we call reset index the old index which is the student ID gets converted into a column and now we have a new index without a name that's just sort of the location numerically so if we try to combine data frames with different columns if we try to concatenate them you'll see that we get n a values um what's going on here is so this index oh so to make a site list confusing to make it sort of fit on the page um combine the same data frame with itself twice just once with the student ID is the index the time with the sort of student ID being shifted out into a column the top one questline DF extra doesn't have any values for student ID in the columns so it just puts an N A into place I hope that's relatively sort of straightforward and clear the same intuition applies if you try to do it sort of horizontally sorry um if you wanted to stick them together wide which I would call column wise stacking I guess because you're stacking together The Columns of the units okay so that's concatenation but kind of a Brute Force thing to do now this is a slightly more complicated thing which is um we want to match up records so let's say that you know until now we've been calling students by their code but we'd like to have their first names as well so GPT made up this thing for me which is basically a bunch of made up people Etc but we have their student ID and I have their name we have some other information about them um I'm going to be using the student ID to join the two data frames together and notice for one thing they're called different things in the two data frames I did this on purpose just because it's reflective of the reality that is data analysis um and also just for a bit of visual Simplicity um for the original data frame I'm going to be changing the index back to numerical one and make student ID a column um it doesn't this is not something you necessarily need to do when you're doing the analysis it just made this sort of visualization of this a little bit neater so again I'm sort of recreating these I'm creating one data frame called Data frame underscore contact and another one is the data frame that we've been working until now with the index reset so you might notice actually if Brenda if you were reading the table on the screen there's in two data frames don't have the same values of student ID so the contact list has for example 5a02 which isn't in here you also see 5a12 isn't in here either there's actually sort of and there's multiple mismatches between these there's also multiple matches that means there's a bunch of different ways that we could merge these things together how do we deal with the sort of things that are in one key but not the other so this is called an SQL style join um but the syntax is like this so there's the pen is not merge function and the first argument is how um here that's not what are we we're going to write in your code this code will fail if you try to run it but there's four options for this inner left right and outer we're going to look at each of those because sort of the easiest way to understand them is to just see them yeah but on the left we have the contact details and I'm just taking two columns from that data frame first name is student ID on the right hand side I'm taking the scores and again just for visual Simplicity I'm just taking the student ID and their history score um you might also notice I'm being a little bit sneaky here I'm not actually using Dot Lock um you can just pass a list to square brackets after a data frame and you will get those columns um if you go further into pandas you will find that there's a little bit of weirdness with this which is that it returns a view not a copy of the data frame so if you try to assign if you try to sort of change values of a view it becomes very ambiguous what should happen to the underlying data structure and you'll get a lot of annoying error messages so if you ever sort of assigning a value on the left hand side like outside of this example you probably will want to use the Dot Lock not just use this like plane indexing then I'm seeing what my key is on the left hand side student ID written in camel case and right it's an idea printed snake case so if we set how to enter what we're left with is only the rows of the data frames where there's a matching key in both data frames so there's only four keys in common between the two data frames if we set how to enter the merge returns something without any Nas so only matches on both sides the left one on the other hand will keep all of the values from the left hand data frame and we'll merge on things from the right hand data frame where there is a matching key otherwise it will just insert n a values write works the same way but in the opposite direction you keep the entire right hand data frame and you add values from the left hand side where there's a matching key and then finally outer is just keep everything from both sides so if you remember yeah so we have no eight values we have all the student IDs from contact and then we also have a score for some who we don't have their contact details for okay um gonna yeah so that um that cell won't work I was writing that as sort of pseudo code to sort of just give you tell you what could go into that spot but um no that code shouldn't work however the cells after that you should be able to executed no worries um okay I'll pause this again right here for a moment um because SQL style joins for some reason are considered a very tricky topic I think it's relatively straightforward um but it's straightforward when you visualize it I think sorry does not say if you're totally confused please feel totally free to ask any questions um but if not I'll just post inner different to Outer um does anyone else want to give a shot at that in the chat or if you want to speak I think you're allowed I heard someone for a moment okay um so inner you only keep the rows from the two data frames so the left hand right hand side yeah good um yeah so in set notation it would be the intersection um whereas outer I guess would be the union of the sets um what is that right yeah um so enter you only get the rows where there are matching keys in both data frames outer you get everything regardless of whether it is matching or not exactly it's perfect all right finally some brief notes on just a bit of exploratory analysis with pandas um this is a little bit about functionality but this is also a little bit about practice so we're going to be pulling some data from the British election study this data I worked more with when I lived in the UK um but the British election study is a national league representative survey of the British public um uh answering Morgan um so you see here I have left on and right on if you look at the documentation for the merge function there's also an argument on so if those two columns have the same name then I could just use on but because they're different I have to say left on to say the thing in the left-hand data frame and I'm right on to say the thing in the right-handed frame okay super selection study we're using a small subset of it um that I've just sort of cut out for giving some demonstration exploratory data analysis but it's a huge survey hundreds of questions I think um and tens of thousands of people in each way so there's sort of a lot of data it's quite complicated so you know when you first have the data what are you sort of do with it so there's this sort of amorphous idea of um when you're working with data even though you know using programming languages kind of puts a layer of abstraction between you and your data you know you don't actually have to look at the data itself um you should still be trying to sort of qualitatively get to know your data um that means maybe you might still want to open up in Excel and sort of look at it um I wouldn't say that because you know we can also do it without ever having to open Excel whether it's the idea that like first things first you just want to sort of understand what's in there and understand some high level patterns um so things that you'll immediately be able to check that should sort of flag things to you is you know you will get a sample of the data of the rows what you expect them to be you'll check the dimensions other two many rows too many columns too little um you'd always check that before doing any sort of further analysis data types so pandas is going to make a lot of assumptions when it reads into data about the data types and that can cause problems so you should check the data types as well make sure that you know for some reason um like a really common mode people's phone numbers being written as integers or IDs that are just strings right like Social Security numbers being recorded as integers we don't actually want to use them as integers because you know you don't ever want to actually calculate the average of social security numbers really they should be treated like strings and finally um once we sort of looked at Broad patterns of the entire data frame we're going to want to sort of do analysis on individual variables in the data and first thing you can do here is typeulating that's sort of like I think the best quick look at individuals variables so to look at the first five rows of the data frame you can use the dot head function and we will get this and you can see look at it did they vote did they vote for the conservative party how interested are they in the election how do they score on a civic duty scale which party they identify the ideology on a left right scale so um have whether they voted leave and the brexit referendum this data is from 2019 so it's sort of not long but after when people are still really trying to wrap their heads around brexit I think people still are trying to understand in political science they're pretty good relevant then so the first thing we want to check are the dimensions of the data so after we've sort of looked at it we want to say okay does this have what we expect um I happen to know that the 2019 wave of the British election study should happen about 40 000 observations and a few hundred variables so when we look at the shape this tells me the number of rows and the number of columns that should fact me that I'm looking at a small subset of the data but in this case I constructed it so I knew that's the case in general this dot shape function is useful for telling you sort of the dimensions of your data next thing you want to look at are the data types and the functions and pandas here has a dot info function it gives you a lot of release information so first it tells you what the index is so in this case it's a range index we've seen that before it's just you know the numbers zero to 2193. then we have information on each of the columns so looking at this I should kind of go through each of them and see like this is actually what I think it should be well one thing you might notice is um first it's why is it in 64. the short answer is it's not actually base python data types it's using numpy data types I don't need to say too much more about this if you sort of need to work on this there's something there's plenty of documentation for but this is just to say this is a kind of integer um and then object is kind of anything um in numpy data types but it's basically a string one more interesting so again here if we see object we should just usually understand it to be a string um there's some things that seem a little bit strange like why is female an integer um maybe it should be a Boolean for example maybe it should be a string so I can look into each of these things so I can from the column I can use the value counts function and tabulator so I was wondering okay why is female an integer that's uh I see they represent it as one or zero so one must indicate that it's the respondents that they're a female zero indicated but they're not female that could include non-binary in the category that might be a way of sort of operationalizing it and then here I want to see okay how many people from each region do I have in this subsample um it's in the British geography means much to you but you know here we have the sort of different regions of the United Kingdom I can see how many response from each region that I have okay all right so that's pretty much where I'm going to wrap up um the material for today um I see we could have probably gone on a little bit longer but um you know sort of wasn't sure how much of I could fit into four hours so um I'll still be here for a little bit longer but I'll do a quick recap take questions and then we'll go on from there so first just what we want to do with pandas Justin so one is Reading in files so I O stands for input output we learned how to read data from disk we learned how to index slice and filter that data we wanted to perform functions on it we learned how to combine and merge it and we learned a little bit about sort of what are the first steps you should take with the data frame um I should say that actually here you have pretty much most of the functionality you will need for most analyzes um like 99 of what you do will be composed of these four things that you need to do there are a few other operations that sort of don't fall under this one of the main ones is sort of group by functions which I haven't included because I wasn't sure if there was time but that's sort of a form combining filtering plus operations doing operations conditional on sort of the value of a column if you want to go to further resources particularly for pandas there's a textbook written by the author of the package it's still reasonably recent um and again this sort of gets also a lot about sort of data analysis principles which you know are as important as learning different themselves actually if anything more important than learning the companies themselves so here are the sort of sections that are relevant to the stuff that we've done today um if you are a university student a lot of universities have free access to the O'Reilly's Safari Books so there's one thing you might be able to access this book for free there's also a couple Advanced topics that I didn't bring up today but um dealing with string so like regular expressions and things like that um super important functionality in a lot of programming sort of how to create patterns that flexibly match patterns and strings um however I could probably spend an hour teaching brackets Alone um we'd only I don't know that's sort of a lifelong journey to learn records and finally categorical data which I'm going to be talking about um if you sort of want a bit more information on things there's some blogs and documents in here I also wanted to share um mentioned this during the break people who might not be there make sure they show us later that um here's the repository from back when I was teaching this course at Oxford it's an eight-week course um in addition we've basically covered the first three weeks already things that you'll be able to look at in here as well there'll be data visualization there will be sort of introductory machine learning from week six and then you're going to learn web scraping um week seven and eight uh feel free to use this feel free to contact me if you have questions about the material will do and yeah so in these four hours we basically covered weeks one to three of this course yeah thank you um yeah I'll also make sure to send some additional stuff there um okay great um so now just take some questions uh before I give you this sort of problem set for this and sort of continue to take questions but um any sort of immediate things that you want to ask for everyone um advantages or disadvantages python verses are with data writing um oh and not terribly I think whatever you get used to is sort of whatever you're the most comfortable with is sort of the the best tool for you it's a Time investment to get good at using any of these any of these libraries um what I'm using R I'm a really big fan of data table I think data table is pretty fantastic um but yeah yes I'm glad that there are other data Frame data table fans in the chat um let's see there's a problem set as well for the materials we just covered there's a few things that weren't covered in here as well um if people want to start looking at that um oops I'll stay on for a few more minutes to answer questions you can also start asking questions about that otherwise again um there are the solutions to this as well um available in GitHub Etc so yeah okay um and yep thank you very much to everyone for um Coming and participating um yeah uh obviously you know I'll Stick Around for another I'll say I'll Stick Around for another 10 minutes um to answer questions uh otherwise um yeah thank you very much for coming all right um I have a question yeah please go ahead yeah um so I was wondering because I work with pandas as well and mostly with data table when it comes to work because I work with transactional level data the problem with this with desk and bandages like vendors cannot load a data frame which will be more than a million rules and secondly it does grouping also there's a little time taking is there a way to address that in the same manner as data table does so when it comes to aggregations um should we use chunks in pandas but then again it can take some time I would just like to know what are the other workarounds because currently my work was divided in completed cleaning and are then bringing it to um because aggregations can take a little bit of time yeah um no it's it's a good question um the first thing that I'd be curious to try would be um polars is a new library that people are quite excited about um sorry let me just quickly give people the solutions let me think um another thing I found actually in general is actually that base python is quite fast um depending on the operation you need to do um so sometimes if it's not a terribly complicated thing you're trying to do on the data sometimes you might just want to not use pandas at all the last thing is um again it sort of depends on the operation um if it's all only going to be sort of row wise like element wise there's no sort of like group by operations then um there's a lot of libraries that allow you to do super fast parallelization on that um even actually the hugging based data sets Library um is pretty powerful for this because you can just point as much computing power as you want towards it um yeah I guess it depends a little bit more about sort of your exact use case through main use case would be grouper obligations because sometimes people buy operations yeah yeah Group by operations which cause problems in in vendors I believe pandas also do not use monkey threading to uh great degree uh that identity table in our mic does might you know it does so yeah cause the problems for me personally yeah um yeah I've only gotten around this in really stupid ways like um only using Group by to identify the indices for the group by operations and then drop it everything pasta to numpy or sort of lower level libraries um to sort of do the operations on the indexing um numpy arrays do indexing incredibly fast and efficiently so that sometimes works but yeah I don't have sort of a clean solution for that um again I would just say check out forwards because I think that's sort of the exact kind of problem it's trying to fix um and I see sort of on data analysis separates that people seem to be very fond of it thank you I'll check Bowlers I'm just reading on it now thank you for suggesting that may also eventually migrate from panister bowlers um checking to for other questions in the chat foreign if you're still here but how can you merge them so there's only one column left um I think if I understand you correctly um but default it doesn't do that you can just drop the column afterwards so just do it in two commands and they think would be the way to do it um here's the link again um oh that's very cool working with that field house um yeah um no the crew at Northfield pretty well when I was there um when BS moved in that field uh um yeah you're very welcome um to everyone thank you so much for coming um yeah um to Hui I thank you Johannes also thank you to YouTube um for organizing this helping run it um it's been great to do this um and yeah I'll I'll make sure that sort of all this stuff is sent um follow-up yeah uh Happy coding everyone [Music] um yeah thank you Musashi thank you Miss Ashley thank you everyone [Music] thank you thank you so much [Music]
Info
Channel: Hertie School Data Science Lab
Views: 2,304
Rating: undefined out of 5
Keywords:
Id: TAg5tiIC7Yk
Channel Id: undefined
Length: 209min 21sec (12561 seconds)
Published: Tue Aug 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.