Advanced Data Prep Tutorial - Discovery Copenhagen 2019

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let's get so ugly my hope is that sort of the scope for today I'm calling an advanced data prep but I hope that even if you've never worked with jump much or never work with data preparation much you get value out of it I also hope that those of you who have done extensive data preparations see some ways and jump that maybe we do it a little differently you know if you're new to jump but you've used other tools you may have noticed that there's hundreds of different ways to do things than jump and one may be more advantageous than a particular circumstance my hope is that it can give you a little context around why one might choose a particular path or a particular tool so if you are advanced and I'm showing basic things I hope you know you pay attention to those little nuances where I say you know this may be more advantageous in this circumstance or not but I hope everyone comes away with with some real value now started speaking about today let me just give you a sense of what I want to talk about and then we'll get right into jump really quickly I don't like a lot of preamble for it my thinking is I'll start just with some high-level discussion of you know types of data preparation methods I might like to distinguish what the methods are doing just to give us context around the tools themselves some thinking helps to group them I'm gonna start with ten plus examples I think in the description I said ten tools I started writing the talk you know six months ago and I've got past ten really quickly so it's ten plus so it's really gonna be ten classes of data preparation tools and it's gonna be a lot of them actually because I want you to see the scope because so much of data preparation is having the imagination how to get from your messy data to your finished data and seeing how you're gonna string together multiple tools it's really please it's really uncommon that any single process or any single tool will really prepare you completely and so we're gonna see a lot of them sort of so you can see the constellation of tools that you can then string together and then what I want to do at the end before we open up certainly for time for questions and back and forth discussion I want to work through a few case studies where we have to do exactly that Stream together multiple tools so that we can get from our are messy data set into something that we can actually analyze that's my thought process for today I realize it's a very pithy you know outline but I kind of like to keep things open please interrupt me I have a tendency to just keep talking if no one stops me so if there's everything's you want to see more elaboration on or if you have questions about why did something or if you just need more time to look at the consequence of something I did just please interrupt me okay so this type of data preparation method idea I'm gonna just think we started three classes of data preparation actually I'll pause already and say something I am recording this I'll make the recording available just to you all since you signed up for this tutorial but don't feel like you have to write anything down because you'll have the slides these will all be available but only you will get the recording so if you want to go back you can so types of data preparation I like to think of these in three big classes and some of this comes from my background in graphic design I sort of stole these terms and so the first kind of method type I'll talk about is constructive methods and data preparation and just to give you a sense of what these might look like constructive in the sense that you're building on top of data and adding something to it so this could be creating variables deriving variables from other variables we'll see this today with formulas things like recoding a variable taking a variable in the categories within that variable and transmuting them into a new variable taking categories combining them splitting them apart things like that setting column properties you know metadata within a data set is data itself so when you add properties to columns whether it be value ordering with every value labels whether it be coloring all that is adding data in a sense you're you're in you know adding stuff that allows you to see new things joining merging concatenated in a sense that's constructive you're taking data that exists in different sources and putting them together and establishing relations between things right you don't just take data and squish it together you align them based on sort of rows that match or characteristics of the data so you are adding value adding information an imputation I have a little star there because we can have a big argument about whether imputation actually adds information from an information theory standpoint it really doesn't but from a modeling standpoint you may regain the ability to use certain rows if you impute an imputation if you don't know that where it's it's missing data you are adding values back in based on some structure that you're presuming about the data set so oftentimes this is assuming linear relationships between variables and adding in values where you expect them to be if they're missing so we can argue about that being constructive or not but we'll leave that for now okay so contrasting without their destructive methods to data preparation and some of these are necessary for the process of analysis so we can think of things like summarizing taking variables and calculating means on them maybe means broken apart by certain categories sometimes you just need to report a table of means that's a destructive method in the sense that once you have just that table of means you can't reconstruct the original data and I saw this from from photo editing if you are applying filters to a layer or if you're making changes to brightness contrast and doing it in certain ways you can't go backwards you have destroyed in a sense the original data and end up with something new tabulating in the same way that summarizing is destructive you end up with something that you can't go back from right if you don't store the original data of course deleting records something you shouldn't do and jump we can always exclude recoding in place again something you probably shouldn't do taking a variable and writing over the original variable that's very hard to go backwards not not recommended and then there's a tool in jump I specifically call that here update this is a type of importing or merging method where you have a data set and you want to replace values in your original table so maybe you get new sales and you want to write over the existing sales in a data table and so you can't really go backwards from that now I mean destructive of course assuming you haven't saved backups there's good workflow where you save backups of your data set create versions and all that version control it's sort of outside the scope of today but from a standpoint of sort of thinking about these tools these would or what I would call destructive and then finally not destructive or constructive but not destructive so data prep methods that just change the way your data is oriented and so we'll talk about some of these today like stacking and splitting these are reorienting the way that observations are stored in your data set changing units of analysis maybe across columns to across rows or rows to across columns so we'll look at those transposing literally flipping the rows and columns of a data table you're not destroying any data you are reorienting it and that has certain affordances within jump and all statistical software software expects data to be stored in a certain way such that it can operate on it excluding and hiding rows non-destructive you're simply telling jump which things to use and not use and then taking subsets and views of the data and data view is a real nice thing and jump will talk about so those are non-destructive so just as a super class of thinking about tools when I talk about them today I like these categories because they they change what sort of things you should think about when you're doing them okay so that's actually all the PowerPoint I know my PowerPoint very much so we're gonna pop right into jump and I'll certainly share this journal with you today it'll be on the community page for the the whole discovery conference all the data sets that I reference in here they'll actually be inside so you'll be able to play with them yourself too just a point of question is the font size okay everyone can see you something okay and all when necessary you know zoom in for you I'll try not to do it too fast it makes people sick okay so to start let me just sort of talk about these classes of tools again there's gonna be more than ten but there are ten big classes we'll talk about table formatting and import type tools data restructuring those are the tools that I talked about being non-destructive like stacking and splitting we'll talk about data aggregating tools tabulating and summarizing data merging join update concatenate and query builder or weighted perform these operations all at once using jumps implementation of sequel light table views things like sub setting the the date of view function and jump which is really powerful and using the row editor which is also really powerful things we do with missing values I'm going to talk about imputation per se but I will talk about ways to visualize what data you have that are missing and how you might identify which rows are problematic so we'll use missing data pattern and explore missing values missing variables of getting into constructive methods we'll talk about the formula methods and jump instant formulas temporary formulas column formulas a couple specific things making indicator columns and binning well see about extreme values so this is sort of a separate idea and data preparation values you don't trust in your data set and how you might identify them we'll talk about univariate multivariate and model-based methods for this so different approaches to how we think about outlying this and some of the ways we can work with them straight wrong values so recoding and you know the don't overlook it find and replace can sometimes be really really powerful and jump especially since its sports regular expressions and then finally some basic interface and data set tools like data filters the unbelievably powerful select matching cells something you can use a lot of times row state columns and then text to columns so we'll talk about all these I assume in an hour and a half okay so going right into it's a table forwarding and import so I'm gonna pull up in a table here this is a table we can already see it has information about cars and so we have Volkswagen Acura you know we can sell based on the values in the column that that that's what we have we have information here but we have no column names and I should say my background I did a lot of research psychology I'm an experimental psychologist and so often people would send me tables from surveys they did where they just had no column names and it's a it's an unbelievable strange way to store data because we haven't identified what our variables are and they would often send me a different file that has the sort of data dictionary what the column names should be and this was always just a frustration to me especially when I was working you know lonely and jump or or an are and I needed to to align these with the table and so if you haven't noticed jump has of course this column section on the left and these these sections are all actually independent pieces of the jump table and there our affordances of this most notably that you can actually paste into your columns list a column names list so if you have these column names and somebody has sent that to you with them selected here I'll just select you can just paste them right in and add in all your column names this ends up being an incredibly powerful prep tool because you can now just take those data dictionary column names and put them in you can also copy them back out so if you copy from there and you need to make edits you can actually do that in other software and go back and copy them back in so again if you want to see that again I'll just undo with them selected in the columns list and you'll see when I select I can tell that this section is selected based on the indicator at the top and so with that selected you are now highlighting not the data but the columns themselves I'll just copy out the column blank names here and so you can see I can paste them back in so I can go either direction and it becomes incredibly powerful so something to know and something certainly to use I mean just high power point so I don't keep doing that there we go all right so pasting in column names ends up being incredibly useful do that one last time for you here all right so within data import there's there's of course plenty of other things we could talk about I'm just gonna keep going so I want to get to some some other tools of course but be aware that there are lots of things jump does to make importing easy of course you've seen the data preview when you import from a text file and hopefully you've also seen that there's a multiple file import I'm not going to cover that today but if you do have multiple files that are of the same type you can actually let jump pull them all in and stack them together for you so you can do really powerful things with that so we come back to that if that's of interest anyone later but I want to keep going so data restructuring so this is a big class of non-destructive method it ends up being one that will enable you to do a lot of things and jump that you couldn't do previously if your data come in a certain way and so I want to talk about stack splitting and and sort of talk about some options with them in ways you might not know to use them just to give you an example so I I built this really simple base table and I've labeled things in a particular way because I want you to notice where observations go but just so you can see I have four rows I actually have a column called row because once we we changed where the table's oriented I want you to see where values go we have columns a B and C and then I've labeled each of the values within each column based on the numerical position and the data set so one two three four four each so what if we needed to in some analysis have a variable that identified these a B and C columns for instance what if a B and C were different attributes of the same thing and we wanted to have in some analysis a variable that we can say attribute with so stacking is an operation that will take data like this which are we call wide data data where these variables represent some units and there are cross variables we want to take that into a table that's called tall so where we have a longer table but we have a variable that identifies the a PNC's and let me show you what it looks like and hopefully that I'll be more clear if you've never seen a stack so if I got a tables stack and I'm not gonna change anything yet but I'm gonna do keep dialogue open because I want you to see what happens when I do the simplest of stacks I'll take a B and C put them into stack columns and click create and if we look at the two tables and look at how the observations were moved Row one here right with a1 b1 and c1 now is represented across three rows a B and C the previous column variable headers are now labeled they're part of a variable called label and the data a 1 B 1 C 1 is now in a tall format across the rows in a column labeled data now label and data are what we're specified here so if these were actually something different let's imagine if I go back let me just delete this table let's do it all over again I'll get a table stack what if we knew that these particular values a1 b1 and c1 they were I know some specific type of outcome maybe they're parts per million of some measurement we have but they're of different maybe chemicals so we have ppm is the data and chemical is the label that's what a B and C actually stand for if I create this stack keep dialog open again jump will actually label the columns for me appropriately so notice that the stack data column and stack label column those just define the names of the data set that comes after but really the only thing you need to specify is which columns you want to now rotate and put across rows now this becomes very powerful because again if you need to use that variable within an analysis for instance in fit Y by X you needed a variable that defined what chemical was and parts per million needed to be the response of course it would be numeric in the actual example right that wouldn't be something the data table would afford you in the original format if I had this original table and I want to analyze how parts per million varied across there's no way I can define that I don't have a variable like I do on the right hand side that let me define the different chemicals and so when you start thinking about how your data is oriented you may ask yourself do I need a variable that defines this thing let me go to an actual example of this simple kind of stack that gives us some real variables before we do a multiple series a different kind of stack and I'm just going to pull it in the sample file I guess I linked it wrong here it's called dogs and it's one of the ones in the sample data layer library if you don't know where that is it's under the help menu and I just put the sample data I'll pull up in dogs here and this is measurements of histamine levels for these dogs that aren't different drugs and so what you have is histamine at different time periods so this measurement is numeric we have these different dogs there you can see the ID numbers increment so dog one is wrote is all across this row and if we wanted to ourselves a question of did histamine change across time we could probably with graph builder although it's a little bit harder craft a graph where we have histamine zero one three and five for each dogs and draw a line between them it's possible because graph builder is very flexible but it'd be a lot easier if we just had a variable that said time period zero one three or five and then we had histamine level as just one outcome variable so let's try this one I'll go to table stack I'm gonna take histamine zero one three and five and we know the data here is actually histamine level and the label what these levels define is time period I'm gonna keep the dialogue open here now I'm gonna click create but I want you to see it gets complicated because I have all these other variables in the table that are just replicated right because dog one had a bunch of other values right that actually we're not using for this analysis and what jump has to do is put them across rows as well because if we're gonna represent zero one three and five for dog one all those other variables have data and jump doesn't just want to eliminate it unless you tell it to but for simplicity let's actually do that I'm gonna show you another option within this window within stack I'm gonna say select or keep only certain columns in the new table I just want to keep drug and ID and of course jump will keep the histamine columns because I'm stacking those but I want to make a simpler table so I'm going to tell jumped you just keep some of those I can also do the inverse drop the ones I select but I'm gonna say just keep these and so you'll see this makes a lot simpler of a table drug one is right or dog one is right here there's the zero one three and five time period and there's a histamine level now just so you can see what I mean about the affordances of this structure if I get a graph builder I can say well show me the histamine level on the basis of time and for ID let's make that nominal and I'll put that as an overlay and do some lines and so now I have for each dog across each time the histamine levels and I can actually visualize them if I was working with the original table that becomes a lot trickier I mean there's ways to do it but notice that instead of having a variable I can put on the x-axis I have to sort of convince jump to use all four of these draw across them and ID buy time and you have to do some some stuff with combining the scales and make them parallel and some other things we won't do it it's a lot more work if you have a variable that identifies time period it's as simple as putting them as the X so affordances of a stacked structure now if we're in the stack structure let me just minimize this one going backwards cement remember if we didn't eliminate rows this would be non-destructive I eliminated or sorry if we didn't eliminate columns this would be non-destructive but just for simplicity I'll stick with a simple and where we did eliminate those columns moving backwards I can reconstruct that wide format table pretty simply since unders tables split split is the inverse of the stack operation and the way it works is you have a split by a column and a split columns which sound very similar but once you've used it enough you'll you'll see why they're labeled exactly that the split columns are the data you want to now be represented in the new table so what column actually contains the histamine levels you want to reorient so that's hist level and split by says what is the variable that will identify the new column labels what variable is it that was the column headers before and in this case it's time now group is if you have something that identifies the individual units and you want to enforce that that is stored in the new table and also all lines the responses in this case we have ID ID identifies the units because you see dog one here is what is represented across those four rows almost always jump we'll figure it out on its own if you have tables that are well-formed I like to always specify that if you have it so saying what identifies those units and you will again have the option of whether you keep all the other columns or data or drop them all I usually select keep all and I usually select keep dialog open just so I could always go back and make changes that I need to so let me hit create and you'll see that we get back to the original table it's now just reoriented it and that missing value was always missing by the way it didn't just lose a data point out of nowhere so those you can go back and forth you know all day you can have fun and stack your split table again and then you can split your stack table again so there's no need to do this no one has fun like that but you could okay any questions on what I would call the simple stack and split operation here before we look at a slightly different version of it good let's look at a multiple series so sticking with his base table idea first I want you to just take a second to soak in what this looks like so again imagining we have we can say chemicals and these are parts per million we're measuring and so we have chemical a B and C what if we had instead of just chemical a B and C we had to slight variance of chemical a B and C so they were really the same chemical a one and A two but just manipulated some very tiny way so we would say that a1 and a2 are really the same thing just slight variance b1 and b2 are same thing but slight variance and C 1 and C 2 or it's the same thing with slight variance we could stack this whole table and I'll just do this quickly so if we stack take all the columns I won't do any of the labels we could stack because really it is true that these are different columns but what if we wanted something slightly different we actually wanted the series 1 a1 b1 and c1 to be all in one column with the data and we wanted the series to a2 b2 and c2 to also be in an aligned column such that we end up with a 1-1 and a 2-1 on the same line so if you can imagine that that would actually be nice because sometimes we want to treat those in a multivariate way they're the same instance or the same unit across that other type of variable so let's see how we would do this so that the simple stack gives us this tall table with 24 rows because we've just moved all the data from columns to rows but if we want to treat them as multiple series this is an actual option under your stacked platform so there's this multiple series stack option and let's take a second to look at this so before I change anything I'm just gonna take our columns there are six of them I'll put them into our stack column section and the way jump asks you to define this multiple series is by asking you how many series you have and it can figure out which orientation the columns are in so if I'm considering my series to be this block of one this is you know a seer we can say like this is all together a series and this is a series so we'd have to or we can say that we have three series we have sorry a 1 a 2 B 1 B 2 C 1 C 2 so depending on which orientation we're thinking about it those are either two series or three series and in this situation where we have two series notice that all the series one are contiguous they're next to each other and all the twos are contiguous and in the other orientation they're staggered so they kind of that's one series this is one serious this is one serious let me make this a little more concrete in the definition here I can actually specify how many series I have and whether the columns are contiguous so we'll say let's try that first option we have two series the ones and the twos and they're contiguous they're next to each other and let's see what kind of table that gives us so there it is and I'm gonna go back really quickly and just do the same thing but do the the other one I'll recall and I'll say that we have three series that are not contiguous that's again treating the a1 and to us the same thing B 1 & 2 is the same thing C 1 and to us is everything and I'll hit OK so these are perfectly legitimate stacks but treating the data a little bit differently and let's look at what happened at the top and we said we had two series so now we have three rows representing that first row in the table and a1 and a2 are on the same row right so we've treated a1 and a2 as a series so they actually get stacked and so we have a 1-1 and a 2-1 compare that to the situation where we said we had three series which is really that all the a1 b1 and c1 s are the same thing and a 2 B 2 C 2 s are the same thing so that's down here we have a 1 a 1-1 B 1 B 1 - 1 and C 1 C 1 - 1 and then we have the twos as a separate row so depending on what you're after these are both perfectly fine stacks it just gives you data in a different orientation and so the multiple series stack this is one that I'll be totally honest when you first start using you end up kind of playing around with a few times you'll end up keeping that dialog box open I highly recommend you always check that button because you you frankly will end up stacking multiple times to see what's gonna give you the data you need but the big thing to remember is whether you're talking about contiguous columns or talking about non contiguous and again the contiguous would mean that your series have all the data in the columns aligned next to each other and then the next series starts as a group of columns now I should say it doesn't really reflect the table itself just how you enter them into this stack column section so if I had entered them in here differently I mean it's nice to take them as a block and put them in but you can also enter them individually so this contiguous not contiguous doesn't really depend on the table so much is how you define them all right so that one's a little bit more complicated any questions about the setup of that at least or what we ended up with okay let me show you a real world example of that that maybe works a little better once you see it then then just having these abstract columns I'm going to pull up this teaching survey this one had the situation where they entered their you know column names is ridiculous let's actually just here's another nice thing you can take the first row of a table oftentimes that has from Qualtrics especially the actual fields right and instead of having these ridiculous column names I'll just copy the first row I'll select all the column names and paste so now it fixes that there duplicate columns because they were identified the same so I can just delete that now alright without getting into it this is a variant of a teaching survey and we asked people a bunch of questions about what kind of classes they taught and in here you have essentially five courses that people could be asked about and we asked them you know what they're teaching the semester how many students they have what level the course is in and so basically for each of the five classes that they entered they were asked a series of questions so we actually have five series if we think about the courses as the series unit or we have as many series as questions we have it's just to make this just a little more concrete let me just get rid of a bunch of these columns normally I wouldn't delete columns but let's just do it for this situation and let's get rid of some of these so we make it really simple we have response ID who the individual unit was and then we asked them three things what the course name is how many students they had and what level is a graduate or undergraduate so we have three separate series here for each individual and some people just only had three classes so they only entered a few and some people had you know ten classes so they entered a bunch I guess they can only enter up to five so in this case we have three series and if we want to stack it I'll go stack here drop them all in so I just took a minute as a clump and then I'll say we have a multiple series stack with three series and notice that we wouldn't say contiguous because the series itself are in separate places so let's actually keep dialogue open to make sure I did this correctly and we should have these are all tell us about the course course one two three four actually these are contiguous I think let's try it again like I said that's wrong all the time let's see this is such long column names oh my god yeah so they were contiguous that I apologize there yeah so the orientation of it me go back to the survey itself the series again identifies what we're asking about so one series was the course name the second series was students the third series was the level of the course so when we define them in here they're next to each other so those three series all have data in clumps yeah the orderlies the one that we yeah the actual one we put in here so if we had done it separately we could you know take these like this let's actually do it you know it's a lot easier if your tables oriented already because then you're not defining it this way but certainly if you did this now this is three series that are not contiguous yeah and that should fingers cross produce the same table with these unruly long column names there we go yep so those are all the same question with the data all right so multiple series stack sometimes that's how your data is oriented and being able to do it with a single operation and jump becomes really really valuable okay any questions on that one before we keep going great today aggregating now there are times when you have to apply a destructive method to make your analysis simple and and one that I like to talk about often is if you have data that you have multiple observations across let's say something like years and you want to do an analysis but you you don't want to treat those separate observations as independent you just want to average over them to produce an analysis so this is a simple example also from the sample data these are US states and we've measured things like SAT so a standardized test and we've measured them across ears so for any given state let's just pick one of them we have 8 observations because we have eight years that these data have been collected over so if you wanted to perform some kind of analysis like how is this standardized test score for a verbal you know how is that related to let's say the I don't know the salary of teachers if we were to look at that in a scatterplot right now the the clumpsy are really because of the states their states that are represented eight times in the data set and so if you were to do this as a regression and actually try to get a p-value for it that is gonna vastly overestimate your degrees of freedom you are treating in non independent observations as independent and that's not kosher statistically what you really need is one observation per state and so unless you want to treat this as a mixed model and do something more complicated what you really need is a summarization or a tabulation of this and so there are two platforms and jump that make this really really easy and it's table summary and analyze tabulate they can do subtly different things but they for most cases one of the two will be what your you're going to use let's look at tabulate first and then we'll look at summary and some of the the good affordances there so tabulate is it's generally used for pivot tables but you can also use it for summarization of variables so just to give us a sense of where we're heading what we're going to have is a situation where we have States across rows and I'm gonna do this more slowly we'll have verbal scores as me and we're also gonna have salary for each state as a mean and this is gonna give us a table that we're gonna be able to analyze just like before I'll do it in graph builder again where we have verbal scores on the basis of salary with a regression but this time we actually will only have one data point per state so we won't have a situation where we're overestimating the degrees of freedom so let me back up the way tabulate works which is really powerful is that it's drag and drop like graph builder and there are drop zones that identify the locations that you can put columns and depending on where you drop things it'll do difference and there's always this undo button so it makes it easy to go back if you if you put something the wrong place for this type of restructuring this type of tabulation I always like to start with something that identifies the rows because it gives you a table that's easier to operate on and in this case we want our units of analysis to be at the state level one row per state and so you'll see when I hover in different places it highlights and so the drop zone for rows that'll identify for me the states it's all just drop it here and the first thing that jump will do is it'll identify how many rows you have counting within that grouping and there's eight rows per state here and notice the tabulates interactive too so if i select here it will select in the table andy but that's not good enough for us right we need now columns that represent the units or the variables of interest but some operation over them such that we end up with one observation per state and the the variables we're going to use our SAT verbal that standardized tests and salary now we can take them both at once and drag them into the middle I don't know why it shows all the columns but we're just dragging two of them and if we do that jump we'll pick a summary statistic it says okay well I have eight observations for each state I need to aggregate them in some way I can't represent all eight values if it were to represent all eight values that would just be our data table again now the most simple or default summary statistic is the sum so it added up all the values in the table sometime that's useful in this case it's not we want to use the mean there's two ways to change the summary statistic I can drag and the mean on top of that statistic and jumps mine almost always dragging on top of something else will say replace just so you're clear you can actually drag to the left or to the right so if i zoom in so you can see it up I can't zoomin while I'm doing that let me zoom in first if I drag to the left notice there's this tiny drops on there and it jumps mine dragging to the left or to the right of something is the append or prepend operation so I can have both summary statistics I could actually drag a whole set of them you know so you want to have all these summary statistics for each of those variables you can do that on undo so you can do a lot with tabulate there it makes it pretty straightforward in our case we just want the mean so I drag mean on top of that now just so you can tell if you wanted a table that also retained another attribute of the variable for instance you wanted to keep let's say what region they're in so region here is listed as a categorical you can actually use that appending or prepending dropzone next to state so remember I could take region and drop it on top of state and that would replace so now we've summarized at the the region level let me undo that we have that prepending or a pending drop zone I'll drop it just to the right of state and since there's only one region in which each state lives it just gives me that variable now appended to this table which is kind of nice so we're almost there we've made this tabulation I can click done and this is a fun little interactive tabulation you know it's nice to have you can print it out but I think one of the more powerful things is you can make this into a proper data table so under the red triangle making today at a table returns this nice table that now I can operate on yes this is where I can do that proper regression again this is destructive if you were to get rid of the original table you can't recover the individual years over which you've now averaged but you're not gonna be able to get that back but sometimes this is what you need for the analysis you're presenting okay so that's using tabulate if we were to use summering we're gonna do something similar but it's not interactive in the same way and so table summary albeit very powerful is going to be a dialog based method of doing this just so you're clear in tabulate there is also a dialog based version so if you want to use tabulate end up with an interactive table but you like specifying the way that summary specifies you can do that by the way graph builder you may not or notice also has a dialog option so there's there's multiple ways you can pop into these things if you need them okay so summary summary will do the same things that tabulate can do but it produces tables by default instead of producing a report that has a table drawn what you get from summary is an actual jump data table directly and so to produce the exact same thing we just did we would take verbal and salary and we're going to define a statistic that will operate on it when we group it or aggregate over it and if we click on statistics you can see we have the option so I'll select mean here so mean for salary and mean for SAT verbal now I'm gonna keep dialog open and just click create so you can see without defining anything else this just gives me the mean across the entire table which sometimes is really easy it's a fast way to get the average salary or the average verbal it's almost always better to go through and do it from distribution but you know if that's all you need you can do that but we wanted it across States averaging over time so we'll take state and define that as a group group defines rows subgroups define columns so I want to show you what that looks like I'll do group here create table just so you can see we end up with the table that we wanted but I'm not gonna save that because I want you to see how some of these other options work subgroup will define column operations so if I take state and put it into subgroup it's gonna be a kind of a useless table for us but now we have one row with the mean for well for every state but also for salary and also for SAT so they're all the way across this table not not super valuable you could do a multiple series stack at this point this is a crazy thing to do you wouldn't want to but you could put them all in here it's a multiple series I think we actually have 51 because it's plus the District of Columbia and we end up with I didn't do that one right I won't even do it I with two serious rather that are all across there so you got to do stack a bunch of times just so you can see this works because I just thought of this let's do multiple serious contiguous there we go so you could end up with for each row this is for every state for salary and every state for verbal a crazy way to go about forming that table but that's what subgroup does subgroups stretches across columns now you may actually want to use both group and subgroups so if we have States here but we want for each region to have them separate if region actually was crossed with state you could have him across different columns or just really simply maybe what you really want is something like this so at table where you have regions across columns just one row because you want to copy that out so subgroup is sometimes useful when you have lots of levels of that variable it's not going to be that handy okay so one final thing I'll say about summary which I think is very valuable if I bring back my mean here but stay back in is my group and I'll keep it dull go up and that's fine the table you produce with your summary is a linked table by default which means when you select observations in this table it links back to the original data table now you saw in tabulate we could do that as well that was the default report level table was linked but one nice thing is in summary that is proper that also has some special affordances so there are times when you want to aggregate over a variable like we want to know the mean SAT verbal for Alabama but we want that back in the original table so we want to have a new variable in this table that says what's the average for for Alabama for Alaska for all those will look at formulas in a minute which make that easy but there is a pretty simple thing you can do by grabbing the column in a summary table you can drag it back to the original data table and jump knows how to do the update operation so it is actually updated and matched based on the linked column so just say a sense of what I just did there in the summary table you kind of just click wait for a second now you can drag that column then you just drag it across so it will it'll do that average but now have it across the different rows so within Alabama here which is these you know eight rows here the verbal average is the average of those values and so it's just taken as a summarization so summaries great table summary is a really valuable way to create these summary tables have them be interactive and it's also very fast just a point about this if you're new to jump almost all the time when you do prep methods like this a source script is saved to the table and if even if you don't write JSL jumps written it for you to recreate this summary table so if the original table were to change so if I were to realize oh gosh that's actually you know 620 for that year jump will warn you that your final table has changed if I just rerun the source script it's remade to that table using the new data and so it's a real handy thing to have saved okay so that is data aggregation let's check on time or a 9 17 okay cool questions on that before we move on okay data merging so there's a number of these tools under the tables menu so joining update concatenate and query building they all do slightly different things but they all come down to the basic idea if you have data in different tables and you want to push them together you want them in the same table and that's very useful for jump because jump requires operations to be performed on data in the same table with one exception which is virtual joints which we won't talk about really today much but you can still have data stored in several tables but virtually connect them and maybe I'll show that quickly with one of them but just to give you a sense of what this looks like all I'll show you a final table here so we have these world demographics and for each region in the world we have all these different variables and so it's highly unlikely that this was the way the table was was found online most of the time these are formed by joining tables from different sources so let's imagine for the sake of this demo that you have all these individual tables that you have a table with population you have a table with Education Statistics you have a table with health statistics right so you have all these different tables and what's nice is something links them all that is there's something in each row of the table that allows you to say the data from this table matches with the data in the other table and that's territory almost always you have to have some unique identifier that allows you to make this relational system and if you're familiar with relational databases that's the fundamental idea something identifies the relations between the data in different tables so jump has ways to hold these data together and this is a problem where what you're really doing is you have a table that identifies all the rows any one of these table tables has all the roads 239 but you have variables you're pulling together and so this is an operation that really is best served with joining or updating concatenative which I won't talk about a little bit later it's just where you have the same columns but you have rows from different sources and you want to put them together tall wise right now we're kind of talking about expanding the width of the table and so let's start with any one of these and look at how a platform like join works so here's a table with just life expectancy so we have just four columns and let's imagine we want to attach a table to it with the health statistics we want to bring these health measures in now this problem is actually pretty simple because the rows are all in the same order but for the sake of just this demonstration let's imagine they're either all out of order or you only have some subset of the rows and the other we're not gonna simply do a copy/paste of course here we probably could and be fine but let's imagine it was more difficult so tables join is a great place to start here what joins gonna let you do is starting with the table selected the final table you want to add the data to or you can actually have a produce a new table but we're gonna add the data here if you get a tables join it says first what would you like to join this life expectancy data with what table do you want to join this with and let's pick health that was the other table I had open most recently and if I were to do nothing else right now I'll just keep dialog open and hit create jumps gonna yell at me and say I need to match in some way it's not just gonna willy-nilly put the data together without you defining something about it and so under matching columns matching specifications we need to identify to jump something that's going to relate these two tables now if it really is just row number you can tell row number and jump needs nothing else and it'll make a new table where row by row it just pushes the data together and we end up with this table now with eight columns right that actually has the data from both tables and in this case that actually works the data were in the same order and so it's fine but let's again imagine this was more difficult so we're not gonna use row number Cartesian joined venture to guess one out of every 10 you know one out of every hundred jump users has ever used Cartesian joins a great little tool and I'll show it in a second but it's not what we're gonna do here and I want even to find what it is until I show you because it's it's interesting by matching column is what we're gonna do because we have columns in each of these tables territory and territory that identify the units they are matching them and the way you define the match for jump in each of the table I can select a column those are the columns in each of those tables and I can just click match and jump defines the identity territory in the first table equals territory in the second table sometimes you have more than one column that identifies the unique units maybe you have first name and last name as separate columns and the matching is based on both those two things matching so you would specify another pair of columns and click match in this case that wouldn't be what I would do but if you have a table that has two columns that identify the unique units that's what you would do there are operations check boxes here for how you want to handle circumstances that occur when you do a matching if you have multiples so let's say a certain region Albania is represented multiple times in the match table the table you're joining with the original table what do you want to do you want to drop the multiples do you want to keep them right now I don't have any checked so it's gonna keep any combination or any multiple that it finds any duplicated rows but I can toggle which ones I want to have what about non matches so what if you have a really big first table that has all sorts of countries in the world you're combining table only had a few what if you only wanted to complete keep the complete records so only wanted to keep the few table or a few rows that was happened when you merge the tables and have the small second table added so you can say include non matches from the first table and this little diagram helps you to see that it keeps everything on the left the first table and adds just what it can from the right you could do just the with table keep what you can from the right and join what you can from the left or do the full you know keep everything whether or not it matches so you have a table where some data is complete with the left-hand column some data is complete with the right-hand column and some data you actually have for both columns or both tables that's the inner so again this is one where I would say keep the dialogue open and play around with it most often I leave these as this and I key to an inner join I like to have the complete data sometimes it makes more sense to have the all the data whether or not it's complete across the two tables okay so those are options under the matching finally there's some options under how you want to produce the table you want to preserve the first tables order or the second table do you want to merge the data into the same table or produce a new table so I can update the first table what I'm gonna do here if there are columns in both tables that have the same name what do you want to do with them you want to match them or merge them so territory in a case here I have territory in both tables I'm just gonna merge them by default if you toggle that you get something called a match flag that will be an indicator of one two or three that jump will write to a column and say what did I do with the data in this row did I find it in both tables did I find it in the first table or did I find it in the second table that's nice to have and then some things with how you could work with the formula it's do you suppress them before or suppress them rather after they they get copied or do you copy them in the first place let's just leave it all there I'll hit create and just see what happens I only have one territory column because I merge them I have all the columns from each table whether or not there were data in the first or second table in fact some of these we're missing altogether so it kept them and I have now the columns from both tables included all right and the match flag here notice is writing a 3 2 or 1 I guess in jump 14 we added a value label which is a good idea because before it just said 1 2 or 3 so now it says it was found in both tables so what that means is world as a row occurred in both tables and so here it is in this final table okay before I do anything else let me just show you a quick representation of what a Cartesian join is because it's gonna come in useful sometimes for your work let's say you have two tables and what you want is a new table that has it's called the Cartesian join we're gonna have the combination of all these levels so I want a table that says in one column a and the second column one and then I want a and then two and another row a three in another row and then B 1 B 2 B 3 and then 3 or C 1 C 2 C 3 let's call it Cartesian join and so if I take one of these tables I'll take the first one I'll do join join it with open back up here there we go tables join I'm gonna join on title 78 with what is untitled 77 and their matching I'll do Cartesian I'm gonna do nothing else and just see what it produces for me so I have a 1 a 2 a 3 B 1 B 2 B 3 and C 1 and C 2 C 3 you may not think you'll ever need this but it comes in handy sometimes when you're generating column names and you know you have something collected from a machine where you have these these kind of combinations combinatorial and instead of writing a script you can just make tables and have it produce you this this list of columns so actually see that in one of our examples later it's a surprisingly useful thing that occasionally is useful yeah okay so let me back up we just did one joint and if you were watching we have a bunch of these tables that we're gonna have to merge together and that could be done with a bunch of joins iteratively because joint only does one table with one table at a time but there's a much faster way and that's jumps query builder so this was introduced in 13 that's right got better than 14 Careerbuilder uses jump sum of implementation of sequel Lite in the backend and it's really just a local version of a relational database and so we can use tables as if they're in a database schemas and alright sorry as tables in a database and it'll find the relations between them it does a pretty good job of guessing what the keys are what the rows and the columns are that match such that it can produce a table and you can do things like you would do with the database so i'm just gonna specify arbitrarily some primary table education and i'm gonna say that the rest of these are all secondary tables there are tables that have data that i would like to add to education my recommendation is whatever your primary table is it just works best if that primary table is the tallest that has all the rows that you could possibly match on i'll add a table education is duplicate so let's add everything but education let's clear these all out okay um there we go I don't know why now I can only cancel dogs do not dogs I don't want to join so it's found very noticeably for me that there wasn't a column that identified dogs to my world demographics which is good because there's no way that those could join I didn't have world regions in the dogs data set so I'll delete that out now I have the option to build a query you can click run now and jump will just make guesses about what you want I'm gonna click build both great will we end up with now is this the section where we can now define among the available columns what the final data table we want to have looks like I'm just gonna say at all I want all the columns that were in all those tables to now be in the columns in my final table and if you look I get a preview here it's you know that 200 and something rows long it's actually a hundred because it limits the preview but now we have the rows or sort of excuse me the columns from every table and in fact we also have that that territory column duplicated multiple times because jump agreed agreeably I said I wanted all the columns so it's putting them in there if I want I can remove them from here but I'm just gonna leave it be just so you can see jump is actually also written the sequel forming so if you want to give it somebody you know how to write sequel you can have jump write it for you with query builder don't do that I'm just gonna click run query so you can see now in one step jump has produced this final table with all the columns from all those tables matched on those rows so a real real time-saver one final point I'll make the scripts that jump writes there's some really powerful ones here this joined table or this query built table gives you the option to modify the query so I won't save that query builder window if I click modify query it brings back that query builder window so I can redefine how I did the joins and how I did the work and really powerfully if all those tables existed on your computer or on a you know remote drive somewhere in people were updating them this updates from database will call those tables again and redo that join and so if there are new tables or new data that people have updated in there it'll rebuild it for you query well there's really powerful it's beyond the scope to talk about all the things that can do but that's one where if you're constantly finding yourself joining I would highly recommend you know learning more about alright I have lots windows to close just so you know if you ever have our in this situation you can do a closed multiple so if you have hundreds of windows you've opened it's a nice thing to do this in you could say only select you know the data tables or only select the scripts so it's close multiple I've already gone this far all right any questions about those joins yeah the students with what you've seen re analysis so that's a little bit I find difficult with each other and then I really like there was really kind of from schema okay like almost like a graphic showing it's nice I know yeah it'll be leo leo Tracy that's right literally everything is scripted there be Embassy so that you don't have to do that for example whatever I'm saying probably can be customized using some script and understanding but yeah easy yeah it's mostly crutch I would say um you don't get the visual and that's something we've we've really wanted to like I definitely echo your your desire for that because it would be beautiful to see in that visual way here are the steps I did and here's the important you like to say there they're having some attempts that I've seen people build add-ins that somehow that's sometimes try to do that my typical method is to build a via script window so every step I do within a process and then just you know within scripting my favorite thing to do so I'll call something data prep and I'll say X PR you wrap it in an expression Handler and what you end up with if you toggle the right things in JSL you can have these as little foldable cold code snippets and so you can run them and it's not as visual as what you're saying but you end up with a something that you can look at as a totality and so you may have data prep as a section you may have this is like imports you know then you have another one that's and so you can run these all in a series actually have a webinar it's called was JSL scripting for reproducibility and I talked about a lot of this I'm happy to have send that link out to everyone if you want but it talks about a lot of these data prep methods but ways you can store your steps and procedures for reproducibility purposes which is exactly that it's yeah it's not as visual yeah yeah but I I echo the need I we would like that too yeah okay I'm gonna skip show and concatenate for a second we'll come back to it and update as well we'll come back to in a separate table but I'm gonna leave data merging temporarily for now just so I can make sure to show you some additional things here I think we go and tell it's 10:00 o'clock right okay this is C this is Julian just continue to talk here we are halfway through okay some of these go rather quickly though there are things I just want you to be aware that really can improve what you can do with jump so table views so I'm going to pull up on a table here this is just restaurant tips now in case you've never seen it jump has lots of ways you can look at pieces of your data and this ends up being very useful when you want to either simplify the problem in your head and you have too many columns or you literally need jump to ignore completely certain rows or certain columns so for example what if the first ten rows of this table is all you cared about and what if the first two columns of this table are all you cared about under table subsets you can actually define how you want to pull out this little bit at a table and so subset is is very powerful you can do lots of things we won't cover but the simple operations are what you want to do with the rows you have selected and the columns you have selected and I can tell jump only give me a subset a table of the selected rows and the selected columns I'm gonna keep the dialogue open here and now I have this little subset of table that's actually linked because you'll notice how I didn't link it here let me link it first so the link option and when I do that this is the same table if I change data in one it'll change data in the original and this ends up being really useful if you're typing data back into a table and you just want to focus your attention on one of these so as far as the data prep method it's kind of useful operations you perform in the linked table go back to the original we can have them not linked so it'll create a subset but by only selected two at that point it'll create a subset that doesn't actually affect the original table so something maybe you want to modify or want to work with you can also specify just the rows that you have selected but keep all the columns all right so subsets is a really powerful way to pull out that little bit of your table now one additional thing I'll mention about subset the subset by ends up being very useful sometimes you have something in your table that you want to break out multiple tables from so we have server here so a B and C I can tell jump to subset by a server keep all the rows and keep all the columns and it'll produce three tables for me so tables that included stuff from server a B and C and I find this very valuable when you have to split apart a table and send out different versions to different people and if you want to bring them back together that was that table operation we didn't look at yet concatenating that's taking tall versions of tables that have the same columns of squishing them together so I can say server a B and C those are separate tables with the same columns and I want to pull them back together as a single table so subsetting bi and concatenated are the inverse operations as long as you're using the same variable to combine them all right so subsets are really useful one sometimes you don't want to go through the hassle of actually creating the subsets like this you can actually do something even simpler this is one of my favorites let's say you have ten rows selected and your rows column or rows panel down here has some operations you may not have known about in it you can see of course how many you have excluded and selected if you right click on selected and go to data view the data view are just those rows as a linked subset and so at any point if you just want to view certain rows and this ends up being really valuable let's say you're in distribution and you're looking at you know a bunch of columns and you had identified a few that were weird so you're like oh wow those tables had high bills you can go back to the table right click and data view and look at just those tables you had selected they're really extreme ones you can actually do that in distribution as well so if you wanted to look at just the credit card guesses if you double-click that is the data view of credit card yes be aware that the double clicking of the histogram keep coming back to permit the double clicking of the histogram honor is what columns you have selected so right now I have no column selected if you select two the first two columns and double click the yes it'll only show you those two another trick you may not know this section here if you click there that clears your column selection if you shift-click it that selects all same thing with the Rose clicking in this section here deselect ship clicking selects all so I can't tell you how many times people ask me how they can deselect columns it's just that little section there but just be aware that the double clicking of a histogram bar anywhere will also produce the data view finally the row editor I think this is so so handy let's say something about this row is really interesting the person who tipped $14 or $15 so if we find that in a table I'm gonna command shift right bracket that goes to that that row that's the operation under here so row selection go to row or next selected so those are really good things to know if I double click the row label it gives me the row editor and the row editor shows you every variables value for that row and these are editable so if I wanted to correct something and I found that oh it wasn't $15 it was actually $5 you know I can go and change that right here and the row editor is great you can toggle through rows and look at each row it's a really nice way of if you have a table with lots of data you don't want to scroll across you'd actually just want to edit that row particularly just double click that's the row editor so subsetting data views and row editors all really powerful ways not changing your data just looking at it in a way that focuses your attention so I find that very very useful missing values we'll get through this one kind of quickly because it's really easy to see that world demographics data there's lots of missing missing this and you can scroll through the table and kind of get a handle on it but with very large tables you need to automate your way of looking at the missingness in a way that's useful two things I'll point out that I love under tables missing data pattern incredibly powerful the way this works is you give it all the columns you want to interrogate for missingness I'll actually just select everything I'll keep all the defaults and John will produce a table where what we've represented is all the different ways columns had missing data and those are called patterns actually let me just do this with only a few columns so you can see then we'll look at it for the whole table the missing data pattern for just the first three columns and so there are three ways for ways really that those first three columns had missing or non missing this the zero zero zero refers to the columns so this was a situation where two hundred and twelve rows of the original table had no missing on any of the columns so zero zero zero means not missing not missing not missing and I selected it here notice in the original table those 221 are selected those are the complete data and if maybe I wanted to analyze just those I could just go to dataview and now I have a table with just complete data cool let me go back there were one one region here that had one missing in the second column crude death rate and you can see that because it's 0 1 0 that's the pattern of missing this there were 6 where 0 1 1 missing on the second two columns our second and third I should say and then we had 11 that had all all missing on those three columns and this is handy if I select that and say hey exclude this is a linked table so excluding here excludes in the original table data are missing on everything so maybe I actually want to delete those they're not actually useful cases but excluding us in a safer way of doing that that's missing data pattern there are some scripts that are saved also that I find really cool that tree map shows you proportionally what patterns were represented so most of the data was complete zero zero zero and if I hover you can see it even tells me that's the pattern zero zero zero 92 percent of the cases some were completely missing one one one and then we had that mixture of missing this and then finally the cell plot really cool this is showing you what those patterns look like so based on the missingness for each of them the complete zero zero zero you know some with missing certain variables and some with all the variables so that's missing data pattern a really cool thing and so it generalizes really well if you do that with missing data pattern with all the columns it's gonna be a little more complicated to look at either in the cell plot or the tree map but it's showing you what the distribution of missing this is across all the columns so now the patterns are very complicated but you can see we have 88 that have complete data if you want to know the fastest way to find how many rows you have with complete data it's missing data pattern because that'll find across all of them which ones are complete and you have 11 here that are completely missing so we can say those 11 rows are waste on us we don't need them in a table right and some you just can say well I really care about the ones that don't have missing this for certain variables and so you can actually just do and I'll show this later select matching cells you can right-click on any cell and jump and select matching and so now I've selected all the data and the this data set that is missing on that one or any others I want very useful the other option all point out is explore missing values which is a bit of a more elaborate platforms under screening explore missing values this will require numeric data if I try to put categorical it's going to yell at me because this essentially is a stepping stone to impute a tional methods we're going to talk about them here but I'll let it just use the numeric columns now it does a lot of really nice things but again it's sort of a stepping stone to doing this invitation putting in data where there aren't data originally but there's some really cool things that this platform alpha gives you so I love the missing value report just tells you how many observations are missing for each of the different columns the missing value clustering is so cool this is a two way clustering method which says which rows tend to co-occur and how they miss data and so the dendrogram just like in hierarchical clustering is telling you closeness based on the location in the tree so a 2 in this case rows are close to each other those are territories in the world that means they're missing this is very similar and so if I grab a cluster that's similar and let's do a data view these territories are ones that have missingness that's pretty similar and why that is the case who knows we can investigate it later but that at least is kind of cool and then the other way since it's a two-way if you imagine the rotation of this this is which variables have missingness that is similar and so if we look at some of these it kind of makes sense health expenditure per capita tends to miss when health expenditure per capita rating it's missing one is the derivative of the other so if we're missing the first kind of miss the second and so it gives you a sense of which variables are related to each other based on the missingness in the data table yeah missing or not yeah so for the Rose it just says red is the data weren't actually available yeah that's really cool the other one that I really love is this missing value snapshot which is a way if you could look at a really high zoomed out level and black was missing in the table that's what your table looks like and that sometimes really valuable if your table is ordered in a particular way you maybe it's by a machining type or something you could spot really quickly oh my god like one of these machines was not spinning out data like he was spotted really about quickly there that's the missing value snapshot and then the rest of these are based on the Invitational methods do you want to imputing a presumed structure of the data and that's beyond us here but note that if you do favor imputation or methods they're all here and you can do some really powerful stuff sure if you want to call like an hour function you mean yeah yeah so I'll just point to it here if you look at the scripting index there's an entire section just for how you work with the are commands actually I'm already in functions here and I'm happy to chat more about it later but you can do what's called an are cement so you just throw over the fence a function to arm and then you can pull back objects from art it's really cool and I like to do it with rappers so you can write a rapper and jump that that does that for you yeah so that's missing values explore explore missing values very cool missing variables oh yeah oh how I got it sorry about that yeah so it's under analyze screening there's explore missing values and so once you're there you just have to define what columns you want to use and then you get into this one yeah yeah missing variables so I'm like missing values this is where variables just don't exist in your table that you need to have and I'll just use cereal really quickly here this is another sample dataset lots of ways to make variables I want to show a couple of them instant formulas are really cool in the table let's say you wanted to let's say you wanted to take I don't know sodium or else to protein times fat I don't know why you'd want to do that let's just say you want to do it select the columns right click and you may not notice that there's this new formula column option and you have lots of ways you can work with the columns you've selected in this case combined is an operation I want to do and I want to take a product and this will write to the table a column formula instantly with two columns selected you don't always have to operate on them as a joint operation you can say do something distributional with them do I know it's standardized so that's taking the z-score so that we'll write two columns that are the standardized version of that now the way these are working all three columns here their column formulas it didn't just write the values in fact if I right-click and go to formula you can see it's written the actual column function that operates on that column and that's a really powerful thing because when you add or change data it's going to update so that's an instant formula it's instant in the sense that it's in the data set I can right click go to instant transform let's say and let's just do a log you know so it just applies it to the data table instantly yeah Jim twelve yeah it was added so that's instant formulas they are also existing actually in launch windows so let's say you're you're going to distribution and you're like oh I didn't do the log of calories yet and you don't want to go back to the table because you've come too far so right click on calories you can do it right here as a temporary formula so I can say right click transform into a log and you know it's temporary because it's italicized it's just existing here in the ether and if I use it it doesn't get added to the table but it's actually utilized within the platform and jump does it in a really clever way if we look at the script for this output it's using something called the transform column function and so all it's really done is made a transform column a temporary column that it uses internally and doesn't do anything to the table just calls on the values in the table that way really cool you can do it anywhere so let's say you're in graph builder and you really wanted to look at you know fat on the basis of the log of sodium form log and put that in the wine and it's a weird thing to do but you can do it so the other operations are also there so distributional combining transforming it's all there so those are instant forums and temporary formulas they all work on the operation of a column formula so for any of these that are calling formulas you see I have a big fat plus sign I can click on that that's the column formula for that column if you make a new column the way you add a formula is just right click and go to formula jump has a huge number of functions that it can work with some incredibly powerful ones character functions that allow you to do a lot of I mean we can do a whole talk just on the character functions for cleaning up data but worth looking into so like the word function stripping off particular words of text you know so if I I don't know if we have any we do so within the names you know let's say we want the second word from each of the serial names right that is a function now that'll return the second word of all the serials you know so you can write your own custom functions that way a lot of stuff in there so column forums are very powerful another thing that I find really valuable is using make indicator volume make indicator columns from text Explorer so let's say you're looking and this is one of these off-label uses if your in text Explorer and this is good for for pulling out the words within particular columns so I'll pass the name over here and let's say you want an indicator column for any time a column or a serial mention wheat now you can write a function that searches you know you can go through a lot of work to write your custom JSL to strip through and look for the word wheat or you can just go to text Explorer right click and say make an indicator column there you go save indicators so with it selected now I have the wheat indicator where it's a 1 if that cereal mentioned wheat and it's a zero if they didn't pretty useful sometimes you need to pull out something granular like that or if they contain corn right click Save indicator you can select a lot of these too and save all the indicators at once and so maybe you want to know did they say wheat corn or hunting so instant formula it's a 1 or a 0 right click Duke new formula combined and take the maximum so that says if any of those was a 1 then now I have a 1 here so now I have a column based on several characteristics of something I pulled from text explore kind of useful so maybe you want to only look at those so you can right click and do a data view and now here are the serials that only mention wheat corn or honey so a nice way to really quickly dig down to two subsets of your data that are interesting it's a text explore don't overlook it even if you're not exploring the text for the purpose of defining a structure numerically it's a good way to just pull apart your data all right finally bidding and I'll mention I'd add in here but let's say you're looking at fad and you instead of wanting this numeric you wanted this to be like low medium or high fat you know so you want to make this into categorical data from numeric data don't always recommend that but sometimes it's defensible and that's what you want to do so there is a function in the table if you go to the columns menu with phat selected its enter utilities and you can do make binning formula and a bidding formula says how do you want to cut up this numeric data so how could a to have categorical labels or outcomes by five minutes unbelievable I talk this much I'll tell us one quickly so offset just says where do you want to start the first bin app negative one how why do you want the bin to be let's all say the bin is two milligrams wide and let's just use all the defaults I'll hit make formula column and now if I look at this I have this bin version negative 1 to 1 1 to 3 3 to 5 5 to 7 9 to 11 so I have a categorical bin version of that numeric data and you'll notice that it's it's honored the original structure because it's just done bins of equal sizes so if we look at that compared to phat it kind of has the same shape you know the smaller the bins the more it's going to match the original shape more often I find that people want to bin based on quantiles I want 25% of the data in the first section 25 percent in the second 25 and third 25 in the fourth and that's where I would recommend this add-in that one of her colleagues Brady Brady us put together the interactive bidding Adam I think Jeff originally defined it or made it but it's a really nice atom because what it lets you do is define these bins based on distributional characteristics and so I launched this one pretty quickly I have an add-in already here Interactive binning interactive binning and I can say let's bin based on phat it gives me the histogram and one of the really nice options is I could say set the cut points for the bins at percentiles so at the 25 percentile so I want four bins each with a quarter of the data I guess it wants that to be numeric in whole units that's funny so 25 and you'll see it's giving me the little sections here they're actually draggable but I won't change it and now I can go to the red triangle again and save a group column so now I have bins here four of them that are as equal as possible based on the cut points I've defined so low 2 point 5 point 5 to 1 1 to 2 and 2 to high so that's that's the bidding so I'd recommend that again it's called interactive bidding tool or interactive bidding v2 very powerful if you're constantly making categorical out of numeric data okay well I've talked a lot so I was gonna leave a lot of time for questions because in my head I was going to get through this in like an hour that didn't happen just to give you a quick sense I want to do answer questions and I can stick around as long as anybody's interested until they kick us out I will mention that there was other things I was going to talk about and maybe I'll just record it and put it in the recording later but talking about looking for extreme values in different ways wrong values for things like recoding please know there is an amazing recode tool and jump if you have a column that has things stored incorrectly like people are entering the days of week all wonky you know like that it here there's this great tool under columns recode and it's just so powerful you can basically tell jump to just like give a shot at recoding on its own so group similar values it'll figure out most of them and may miss a few and you can group them and then you can recode to a new variable never do you have to do fine and replace again unless you want to so know that that's there but learn more about it and then some interactive table tools we looked at a few of these like select matching but also remember that the data filters are your friend you know in any data set the data filter under the rows menu gives you a really powerful way to let's say look at subsets of your data so if you want to filter based on calories you can say only include and show in reports you know calories from 50 to 100 right so to subset or to look at pieces of your data it's so powerful and that's available within every platform tube so if you're in graph builder and you're looking at calories based on protein you can toggle a local filter on and say let's look at that you know filtering based on sodium so show me this output looking at just low sodium cereals and I can look at how that changes across the range of sodium so filters are just incredibly powerful and then there were some case studies which it was very ambitious of me to think that we would ever do that so I'm sorry all right I wanna pause and actually stop talking in the last few minutes are there questions I can I can answer for you or things that sparked your interest or hopefully find value from people learn things this wasn't things you all knew right okay who the data pictures yeah you have files that say yeah yeah say that's a really good question so the way you're gonna do it let's just I'll use a serial here and I'll do calories and protein are fine so when you do a filter what you're talking about our conditional filtering x' so let's say you wanted to filter on fat but you also wanted a filter on sodium but you wanted your selection on fat to affect what got selected for sodium actually let's do with categorical cuz it's gonna be a lot easier to see here salt to consumer let's do consumer preferences area so with categorical you'll see it more more easily so let's say we're looking at you know salary on the basis of somebody's job satisfaction and we have a histogram or a box plots so I'm gonna local data filter and let's have agen and I'll do an and and say let's look at I don't know let's do years in current position so under the red triangle if I toggle conditional what this will do is as I select here it'll change the ability of what's selected here because the subset of those here notice what happens so you focus your attention right on this number the 15 as I squeezed down the asian years like if I'm looking at people up to age 27 there's nobody who's been at their current position for more than seven years because you know we're not child labor but if I I say we're allowing people up to 74 years old some people have been at their position at 20 so the selection on the second filter is actually changing what's available based on the first I could do the other direction if I'm looking at people who are you know older there's nobody who had fewer than eight years at their current position so that's a that's a hierarchical filtering system it works the same way for categorical variables so if I have let's remove this and do it again I don't know if we'll be able to see it as easily but if I have a few of these and I'll do conditional you can have a situation where I guess we don't have any where it came up but you could have a situation where some categories disappear because they just don't exist in the data actually us demographics is probably a better one for this but hopefully that that gets at the question you're asking is a little bit yeah I'll do here this way I'll do region and then I'll add state this is a okay this is what I should have shown first so I'm making additional if I click on Midwest only the Midwest states are showing they click on northeast only the Northeast states so this gives you ability to yeah drill down into it conditional yeah yeah if you wanted to script this one thing that's if you know that the ordinal position in the data table will always be the same the first variable is the first one you want to filter by the second variable is the second there's a nice construct and jump where you don't have to call the column name you can just say column one and that'll actually do the first column and it's where is it : I think I spell it out call them 1 there we go and it'll do that but I would to suggest that as a general practice because order of columns is often something that people don't pay attention to you can get a lot more creative with it and make a dialogue where people select but yeah yeah oh absolutely yeah so yeah oh yeah they'll talk about that there is interactive ways for it but like so for instance I I'm I'm gonna talk later about health data so here's a - or here's a dialogue that I built that actually calls an R script in the background and so you can you know make dialogues to do whatever you want it's actually jump is a very good environment for that kind of thing yeah great questions other questions and then we'll probably to clear aqus I think the next talk starts at is it 10:30 okay well we have a little bit of time then I will continue to stick around as long as they will let me or as long as you're interested other things that sparked your interest or challenges that you faced in your own data preparation the question about stringing together multiple things is a really a really good one actually I'll just point to that webinar so it's a jump scripting reproducibility should get there so I was doing this mostly for for research scientists and academia but it's called the scientific workflow and jump creating reproducible analyses and the whole idea of this is how do you possibly script a workflow that you need to reproduce and so for a research scientist it's really super critical because they're having to defend all the steps they did to say that they didn't you know improperly hunt through the data for output and so creating reproducible and documented analyses is critical there but it applies to every industry you know I always say like only work as hard as you have to and if you are constantly every day doing the same series of steps to prepare data there's no reason you shouldn't automate that and so jump makes it really simple almost every operation you perform and jump will write a script or every output and jump that you can produce has a script they're always under those red triangles so there are ways you can with a minimal amount of JSL string together a series of operations to make it basically one click for yourself yeah it's worth worth spending an hour or two to learn all those things to save yourself hours every day put it that way yeah any additional questions
Info
Channel: Julian Parris
Views: 1,621
Rating: 5 out of 5
Keywords:
Id: faRPP8RRqcM
Channel Id: undefined
Length: 91min 13sec (5473 seconds)
Published: Mon Jun 17 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.