Introduction to R Programming for Excel Users

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I am your host this evening I'm presenter my name is Dave Langer I am the VP of data science at data science dojo which doesn't mean much it's a small company so I wear many hats it's an impressive title it looks good on my card what's more important is I have an extensive background in technology more than 20 years I've had many jobs in that time I've been a developer I've been an architect I've been a manager and most importantly I've spent quite a bit of time over the years doing bi data warehousing and analytics so traditional what these days is called descriptive analytics a last job before coming De Santis dojo that was a senior director of Microsoft where I managed a team of technical program managers that had accountability for all of the data platforms used to run Microsoft's 10 billion dollar-plus supply chain apparently they're making cool notebooks and laptops now too I got hooked on data science about I guess it's almost six years ago now when I was getting my master's degree in computer science I had the privilege of taking an introductory machine learning course from a gentleman who was a member of the second place Netflix prize team and what he taught me was these assets that I've been working with my OLAP cubes my data marts my data warehouses all that stuff was great it allowed me to look backwards in time what happened in the business figure out why understand trends all that kind of stuff but he also taught me that I could also use those assets as a kind of a crystal ball and look forward in time what's popularly called predictive analytics these days and that got me super stoked essentially I got totally addicted and he became an obsession of mine so much so that my wife called it my second part-time job doing cattle competitions reading books studying doing all kinds of stuff one aspect of that obsession was a couple years ago I started a YouTube channel where I put data science tutorials and in fact this meetup tonight was inspired by an actual tutorial I have oddly enough titled our programming for Excel users so if you like what you hear tonight you want more depth more more content I've got about six and a half hours of videos up right now there'll be one last half-hour video next week so seven hours of training total if you're interested in that so you may be wondering why I would leave such an esteemed position at such a great and company to come work at a small startup and the answer is to democratize data science my YouTube channel was started under the idea that you don't need a PhD in machine learning you don't need a PhD in statistics to learn some data science tools and techniques and then apply them in your day job and get business value you don't need that you don't even if that you don't even need a bachelor's degree in those two things so data science dojos motto is data science for everyone and that is the company's philosophy that is our philosophy this meetup tonight as an example of that so it was a good meshing of my philosophy in the company's philosophy so I said this is place I want to be and it gives me an opportunity to talk to all of you nice folks on a sunny sunny Wednesday evening in April who would have thought that okay so what's the motivation for this talk this graphic is the motivation for this talk so the I Triple E if you're not familiar with it is a easiest way to think about it is it's an international standards body one of the things that they worry about or their concern themselves with is computer science out of that concern each year they rank programming languages which are the most popular programming languages in the world and what you'll see here is that our is fifth on the list now you may say so at Dave fifth who cares but this is indicative of this thing that I have a written over here our has been experi interesting rapid year-over-year growth in popularity now if you're not a computer scientist you may not know this but just trust me on this although these six things are is distinctly different these are all general-purpose programming languages for example you can build operating systems in C like Windows and Mac OS X but Python Python is also very popular data science is a lot of you may know but Python actually started as a general-purpose programming language they bolted the data science stuff on later but R was never that way R was built from the ground up for statisticians by statisticians it's not a general-purpose programming language it is a specific language for working with data and that's what makes this so impressive what that tells me because it agrees with my bias is that R is awesome there's a reason why it's a rising in popularity because it is so useful because it's cool and it makes me more productive and allows me to do new and interesting things and deliver business value that's why it's now more popular than c-sharp now if you don't know the c-sharp is Microsoft spent billions of dollars on this sad face somewhere over in Redmond right sad face so I would offer this up that as an excel user this is indicative of why you may want to be interested in R because since it's designed to only work with data basically and it's the fifth most popular programming language in the world that probably means something probably means something ok so now that I've given you some some hard sell on our let's talk about the expectation settings of this this particular presentation so I'm going to assume the following which is what I was asking earlier that your experience with Excel but the only level of experience I'm going to assume is that you're familiar with tables in formulas and using functions and pivot tables no vlookups no VBA no macros nothing fancy like that just basic out-of-the-box clicky clicky clicky in Excel I'm also going to assume that you're not familiar with our a lot of you are not excellent and that you're interested in learning our which a lot of you are and hopefully after you saw the previous slide you're even more interested now so here's the thing we've only got an hour and 15 minutes if I don't get too excited and don't go over to talk about that so we're going to use all excels the framework for a quick intro to our programming but because of the time frame I'm going to gloss over a lot of stuff I'm just going to say don't worry about that don't worry about that don't let me know it's okay don't worry about that and I'm also going to illustrate some art of the possible I'll have some fancier code towards the end of the demo and I'll just kind of wave my hand and say don't worry about this you can learn more about that if you're interested on the YouTube channel but I just want to show you what some of the art of it possible is some of the things for example you could do very easily an R that you cannot do in Excel out of the box at all and here's the thing and this is the most important thing and believe it or not I'm not being cliche I really really really really believe this my goal is to get you excited and confident about learning are confident that as an Excel user you can learn art it's not actually that difficult because you're going to use R for very specific things and the things that you're going to use it for are conceptually exactly the same as the things that you do in Excel and you let me say that over and over again and then hopefully if I show you the art of the possible you're going to go ooh I want some of that too and that's gonna get you excited to learn R that's my goal I'll let you be the judge by the end okay so here's the prerequisites if you're going to follow along you're going to need it Excel obviously you're also going to need the R programming language itself and that is sufficient if you want just use the arc command line but I would also recommend that you get our studio I will be using our studio and essentially it's a graphical environment for doing our it makes it a lot easier makes it a lot more productive if you're not if you're not familiar with using command line tools and if you're going to follow along you're going to need also these two are packages so how many people here are familiar with the concept of Excel add-ins okay almost everybody right you can think of these just as add-ins for our you can reach across the internet you can download it install it in our it gives you additional functionality just like if you put the data analysis toolpak for example which is an add-in for Excel all of a sudden you get a new thing on the ribbon same idea and as we saw earlier there's a github repo the link is in the meetup page it's also in the deck and you can get the source code the data file and the slides as well okay so let's talk a little bit about the data we're going to use this evening we're going to use the Titanic training data set from the Kaggle website if you don't know what cattle is don't worry about it it doesn't really matter what's most important is why would I pick to use this data set as an example and the first answer is everyone's familiar with the problem domain so let's just double check that real quick how many people here are not familiar with the Titanic and what happened awesome second it's actually a good proxy for business data as an example it's a good proxy for customer profile data in a scenario like churn analysis right where maybe I have an indicator variable in my Excel table that says did this customer churn or not and then the rest of the columns are the characteristics the profile of my customers how much they bought how long they've been around where they live you know all that kind of thing so it's a good proxy for that but most importantly everyone's familiar with the problem domain so here's the data in a nutshell since you're all data professionals I figure you'd be interested in the data dictionary or the metadata or the glossary or vocabulary without how would have a term you use so it's a very simple data set so first and foremost there's a column that tells you whether or not somebody survived or they perished here's the class of the ticket so your first second third class their gender their sex their age and years and then two interesting demographic columns this one I call sip spa and that stands for the the count the number of siblings and or spouse spouses although technically no one was married to more than one person on the Titanic when you get the idea it's it's the combination of hey how many people in my family was I traveling with and they happen to be my siblings or my spouse next up is parch which is the number of parents and/or children that I was traveling with aboard the Titanic so this essentially is a demographic variable to tell you something about the family structure and the nature of the family breakdown ticket number how much the the passenger paid in pounds obviously because British so British pounds their cabin number which is interesting because not everyone on the Titanic actually had a cabin assigned to them and where they got on the boat the port of embarkation Cherbourg Queenstown or Southampton so as I said earlier you can see if you squint a little bit you can see how this is a good proxy for a customer profile data for things like churn analysis right binary indicator and then demographic information profile information of the customer okay so as I mentioned earlier I am a recovering Microsoft program manager so because I worked there for eight years it warped my brain and I think of everything in terms of personas and scenarios it's just the way I think about the world so for the sake of this presentation I'm going to do the same thing we are data analysts and in particular we have asked to analyze the Titanic data set so that we can understand more about some of the underlying reasons why certain passengers survived in certain passengers did not typical data analysis tasks nothing fancy going on like I said this has an analogy in business customer churn analytics very popular very important spot-on analogy to what we're doing right now and here's the kicker throughout this presentation I'm going to start in Excel I'm going to start exactly how I would do analytics in Excel maybe your mileage will vary maybe you won't do things exactly the same way I do but hopefully you'll identify that everything I do in Excel is probably something you do in your day job as an analyst we'll go through the thought process we'll do it in Excel and then we'll see how you do exactly the same thing in R and this is super important I mentioned this earlier the most important thing the what what do I do to analyze my data does not change it's exactly the same the only thing that changes is the how and if you want to get in really gross terms in Excel i clicky clicky clicky and in RI typey typey typey although you type in Excel to your obviously if formula isn't that sort of thing but you'll get the idea so that's the gist of it nothing fancy before I dive in please don't hesitate to stop me and ask questions okay so first up what we're going to do is I've got the files from the github and are you going to work please sorry sorry for the AV problems here folks this is going to be problematic there we go hook up so I've got the files from the github here and not surprisingly I've got a CSV file very common right next port from another system a data warehouse who knows so we double click whoops double click it as with most operating systems including Mac the operating system automatically Associates dot CSV files with Excel and it just pops it open now here's the thing I don't know about you but here's the first thing that I do when I get a data like data file like this I turn into a table I turn into a table here and here's the reason why Excel loves tables Excel gives you lots of functionality when you have tables Excel loves itself some tables guess what r also loves tables so that's the first conceptual similarity between the two actually not even similarity the same exact Mis between the two of them Excel works on data in tables well that's a pivot table app our pivot table or just a regular old worksheet table R is the same thing it works on data in table format and gives you a lot of benefit for example notice the first thing here even though I didn't do anything I just loaded up the data and put into a table our said hey this this column e labeled sex the sex column has has text data in it string data in it and it's only got two values male and female guess what I'll let you go ahead and filter on those two categories automatically now those of you out of you're saying Dave that's just the way Excel works but you'll see that's actually pretty awesome not everything works that way and R doesn't actually work that way exactly the same way which is why I'm bringing it bringing it out but if you notice I can do really cool things right I can just say look I don't want males and I only want the females now that I'm in a table and I've got a numeric column I can filter on number ranges the sale I want the females that are older than 16 years of age so on and so forth I get a lot of power I get a lot of benefit out of the box with using Excel tables so I don't know about you but that's my default right I load up a data if it's in CSV format for Sandy I slap it into a table now here's another thing that I commonly do when I look at when I'm analyzing data I take a look at things that may not necessarily play real well with my intended audience so the survived column here is a1 and a0 is binary right one the person lived 0 they did not however if I decide to create a chart and then I present that to some other folks that maybe aren't so necessarily analytically bent I probably don't want my chart in terms of ones and zeroes probably not going to look very good it's probably going to give me more questions than answers what does the one mean Dave what is the zero mean Dave how does should I think about this so one thing that I do a lot is I say okay look I want to make certain I want to transform certain pieces of data in my table to make it more amenable to things like present presenting it in chart format so Excel makes that pretty easy so I can add a column called survive label notice once again that Excel is super handy it's super nice to me it says okay Dave you want to add a column to this table I know what to do I'll go ahead I'll go ahead and add it to the column add the column to the table and I late they add entire column all the way to the bottom of the table with rank data empty data but I know which I know what you want to do you want to add data to this table you haven't told me what you want it in there yet but I've assumed that that's what you want to do and sure enough we can easily populate this with a formula so for example if I say if B 2 is equal to 1 maybe I want the text survived because now my charts will say survived instead of 1 and that's probably a good thing depending on the audience that's looking at the chart otherwise I want died now all of you know what's going to happen when I hit the enter key red magic no not the swirly ball of death ok there you go magic once again notice what Excel implicitly understood Dave you're putting a formula in the first blank cell of a column so I'm going to auto populate all the way down right this is super important I know a lot of you're saying Dave why are you telling me the obvious here we'll get there we'll get there so another thing that we that I do frequently and I'm sure you do as well is I enrich the tight the tables data I enrich the table data so this is purely semantics right or shouldn't say somatic this is purely you know graphical sugar right to make my charts a little more interpretable than this I haven't really added anything to the data I haven't enriched it so that I could do more powerful analysis but that's something that we commonly do in Excel so for example I have variables that we taught that we saw earlier well I have variables earlier that talked about the family the demographics of the family how blings and spouses how many parents and children but what I don't have is any data actually talking about the family size in aggregate because here's an interesting analysis how did family size potentially affect survivability on the Titanic maybe my hypothesis is is that larger families didn't survive at the same rate because it was harder to get everybody together get them through the ship make sure they all got on survival boats so that's hypothesis the data currently doesn't allow me to do that very easily because the family size is actually scattered across multiple columns right now so I can fix that so I know that my family size starts with the individual passenger right because these columns are the number and types of other people the passenger was traveling with it doesn't include the passenger themselves so I have to start with a 1 and then I can add to it the number of siblings and spouses I'm traveling with and then the number of parents and children that I'm traveling with and for some reason my Excel is running exceedingly slow right now but notice it auto populates once again right same thing now these are two very calm yeah oh sorry it's good point so the code is correct if b1 equals one survived died Wow what's going on here man Oh b2 sorry look at that thank you no it's not b3 Dave it is b2 oh that one's b3 sir and that one should be good thank you for pointing that out okay why is that not working that's fine let's do this g1 plus h1 plus that's wrong to you Wow so my apologies folks there we go okay and just to make sure we'll copy this all the way down just so that it's right so imagine if you will you actually typed in your formulas correctly unlike what I did Wow what is going on here okay you know what I'll just do that again real quick this time I'll do it correctly yeah survive label and that equals if B 2 equals 1 then survived survived their wise died alright that looks good and then once again family size if you spell correctly 1 plus G 2 oh my goodness okay there we go all right polls are over the world my apologies on that my bugs notwithstanding the thought process is sound trust me okay now we can see here we have some quite a bit of variability right we have some sixes and sevens and a lot of ones that sort of thing so that may be indicative of what's going on in the data now here's the thing that thought process right the thought process of what I was doing there fixing up the labels so that my charts look pretty adding new features that enrich my data that allow me to do new and interesting analyses exactly 100% the same exactly one percent this thing you do exactly the same thing and are the only difference as I mentioned earlier is the how so let's take a look at our unfortunately all the our code is pre coded so no bugs no bugs all right so I double click on the our file and as with as with Excel the operating system has a linkage so it knows to open up our studio directly from the our file and what we have here essentially is an embodiment of the same thought process that we just looked at in Excel just implemented using R so the first thing that we notice is that we want to load up the Titanic data into what's known as an R data frame in our data frame is just the same it's just the our specific terminology for a table any time you hear the word data frame just think in your mind's eye an excel worksheet table you're in the right spot now here's the thing in some ways R is a little bit dumb well compared to Excel let's be honest in a lot of ways R can be is a little bit dumb so one of the things that we can do one of these we have to do I should say in R is we have to tell it look it's not smart enough to know where the data is that we want to work with so we have to tell it here's the location where the data I want you to work with is on my hard drive so that's actually pretty easy to do so the first thing we do is we go to the working directory go to the session excuse me set the working directory and choose the directory this is how we tell our look on my hard drive this is where the data is located now in my quick access this is where it's located on my hard drive so I just say ok select that folder and you'll notice that it spits out a big long line of ugly code right down here you could type this out by hand if you wanted to but I mean why would you use the GUI it's just easier okay now that we've primed our and we tell it hey this is where the data is located this next line of code is actually going to be super intuitive I would argue for you to look at which is this right here I mean especially look at this well what's that Titanic at sea is we know what that is that's the file we want to look at right that was the thing we double clicked in annex it out and it opened up in Excel we also intuitively understand what this means as well because we created a table in Excel and we said yes please it has headers make sure you treat the first row headers that's exactly what this says - there are headers in this CSV file so that part very easy to understand so the only thing from our that's interesting is this right here now Excel has lots of functions in it hundreds of functions hundreds of functions one of the cool things about Excel is that has a rich help system you can look up all these different kinds of functions in Excel and it provides you a nice help file good news is is that R has the same capability it has a built-in rich help system that you can use out of the box the bad news is is that while Excel has hundreds of functions are has tens of thousands of functions total in the entire ecosystem so good news is that you don't have to worry about all ten thousands of those all at once you can just deal with a smaller subset but you will be using the help system a lot in R and it's really easy to use in our studio you just come down here at the bottom type in the question mark and then type in the name of the function that you're interested in so every dot CSV and easily enough you get the help file on read CSV I won't drain this because you can look at it at your leisure but as you might imagine read CSV obviously reads a CSV file from disk and loads it up into R now the next thing in this piece of code that we should look at is this thing right here this is the name that we want to give the data frame this is the name that we want to give to a table now here's the thing this concept is 100% analogous to excel every table in Excel has a name usually it's implicit most people don't bother naming the table so they usually default to table 1 table 2 table 3 on a per worksheet basis but they thought that all tables have names in Excel and in fact you can actually reference them in your formulas I can be in one worksheet I can type equal sign on a cell and then type in another worksheet name and then access the table by name and then select data out of it and put it in that worksheet so the idea of a table name is integral in Excel you may not use it a lot but it's still there in our that same concept is actually super important you deal with it all the time my tables have names my data frames have names so what this line of code says is our please read this CSV file and assign it to a table with the name Titanic and all this little arrow means is assignment where as I like to say cram it into because it's a very technical term take all that stuff on the right side all that sous-vide data goodness cram it into the thing called Titanic so if we run this line of code this is what we get you can see up here in the up here in the data section I now have a table called Titanic it's 891 observations and twelve variables what that actually means is just think of in this terms that means 891 rows of twelve columns so here's some synonyms for you in the art world an observation is the same thing as a row a variable is the same thing as a column which is the same thing as a feature and when you talk to people that you are whether you're reading a blog post or watching a video they'll often use these terms interchangeably they'll use the term table they'll use the term data frame they'll use a term row then use the term observation they'll use the term column variable feature just know the mapping in your mind data frame worksheet table observation route feature a variable column and you're golden so all the same stuff it's nothing complicated nothing complicated and just to illustrate that if you actually click on this in our studio you actually get a spreadsheet view because it's exactly the same thing it's a table of data there's absolutely no difference okay so this should look pretty familiar to you as well I mean if you squint a little bit it almost looks exactly like the formula code that we typed in an excel it's got a little bit added added extra stuff but conceptually it's pretty close all right this part right here the highlighted blue that's exactly the same as the code that we typed in in excel that's pretty close it's even got the same first two letters but as usual we'll go ahead and open this up in the help file and it says hey if else is conditional elements selection in particular if I run a test and yes the test is true do this otherwise no do that it's the same thing as the F function in Excel exactly the same thing it's just called if in excel they just called if-else in art it's the only difference only difference now it's a little hinky from an Excel perspective is this this chunk of code right here which is the test it looks kind of like what we saw earlier when I typed it in correctly b2 equals 1 but this is a little bit more I had to type a little bit extra stuff in but conceptually it's basically the same thing what this says is our please go to the Titanic table access the Titanic table that's what the dollar sign means access it this is the equivalent of dot or the exclamation point in Excel and go look at the survived column the survived variable the survived feature and then check to see if it's equal to 1 if it's equal to 1 cool that's what I'm interested in now this is one of the things I'm just going to gloss over like I mentioned earlier in programming languages you use equal equal for equivalent I wonder why it just is be sad and accept that this the way it is for right now in our it's just our exceeding excel it's just one equal sign in an R it's two but this says essentially what look give me only check to see if each individual value of survived of the survived column is equal to 1 and if it is then I want this value into place others I want died and if I run that and then I pull up the spreadsheet view I get exactly what we saw in Excel now again the what the thought process is the data analyst between R and Excel exactly the same the only thing that's different is just a little bit different way of doing it that's all yeah so what you're asking for a scanner me close this one um you asked about this part I haven't got there yet yeah then we got there yet thank you for reminding me yeah is our case sensitive yes very very important question I didn't want to bring that up because I didn't want me to people too many people sad but yes our is case sensitive our is case sensitive so the question was about this right here this is super important and my apologies for not bringing this up earlier we know what this does we just went through that right go through the survived column everywhere where it's won a string survived everywhere where it's not one give me the string died but what do I do with that thing you've done all that work for me are what do you want me to do what I want you to do with it and with a highly technical term right we cram it into this thing called survived label now this is how we interpret it we say our please go to the Titanic table access it using the dollar sign operator and then I want you to go look at the survived label now are smart enough to know wait a second Dave at that point in time there was no column on the data frame no column on the table called survived label and it says oh I won't you want to add it no problem I will add a new column called survive label because it doesn't exist so the question was if I had actually put an existing column name instead of one that was brand new what if would it have overwritten the existing column with the values and the answer is yes you might have overwritten it your devoted written very good question okay next up is the family size feature notice once again that conceptually it's not very different from excel start with the Titanic table access it please work with the column the variable the feature called family size our says Dave there is no column called that on this you want me to add it for you okay cool what do you want to be stored in that brand new column and that's everything to the right take the value one add it to each value of SIP spa and parch coming from the Titanic data frame notice that once again this is conceptually exactly the same as Excel right if you squint this almost looks kind of like what you did sell instead of it being G 2 and H 2 which is a little bit you know it's not as verbose it's the same idea right I'm still accessing particular columns of the table and I'm adding them together same conceptual idea same conceptual idea now this line of code right here is just how you programmatically open the spreadsheet view so if I run these two lines of code together and I scroll over to the right you get my family sized column now this probably doesn't seem like rocket science to you but here's the dirty little secret of data science work and are you spend an awful lot of time doing just this working with tables manipulating data and tables an awful lot of time exactly the same way in analytics with excel you spend a lot of time working with tables it's exactly the same thing okay now you men if you remember earlier you know I talked about how Excel tables gave us a lot of really cool stuff out of the box how it automatically recognized that the sex column only had two values and a male and female and it created a nice smart filter for us to filter based on male or female how it recognized that the age column was numeric and it gave us smart filters instead greater than or equal to or less than or equal to and all that that's because behind the scenes Excel takes a look at the data and tries to understand what kind of data it is and alter its behavior automatically so sex it said look you don't get greater than than less or less than filters because that's string data that makes no sense but you can get them on the age column because it understands that the data type the kind of data in those columns is different now that's what makes Excel really really nice it's pretty smart it actually infers a lot of stuff for you automatically but as we all know as Excel users that's not the only way that data actually can be used in excel you can actually format your data this column is currency this column is date this column is numeric this column is actual text data and when you do that you're actually telling Excel more about what the data is supposed to be like and therefore it changes its behavior based on that you tell Excel the types of data that are in the in the in the table yeah so the question is if I format this in Excel and I read it into R do I lose the formatting and the answer is as I'm fond of so fond of saying as you well know it depends because if you save this off as a CSV file CSV files don't store that information right CSV files are relatively dumb if you save this off as a dot xlsx file an excel file that is all preserved if you read it into R if you want to read an excel file into art you can do that it's a lot more complicated it's way beyond the scope of tonight's presentation but if you do do that yes all of that formatting is preserved but it makes a lot more complicated so the short answer for tonight is no the formatting does not persist because we're using CSV files and they don't store formatting now this idea of data formatting data typing is super important in are like most programming languages are isn't smart enough to just kind of whoa with the data as you give it and just need just kind of in first things on its own it wants to know exactly how to treat the data at all times at all times so if we actually look at this function here called STR this gives us a way to actually understand how it does our think of our data right now it says display the structure of an arbitrary are object when I apply it to a table to a data frame this function will tell me how is the data currently formatted how does our see the cell formatting essentially right is it currency is it numeric is a text is it this is it that is it the other thing so this is super important in arm well super important in Excel too I mean how many of you here have ever worked with an ill formatted excel file before yeah sucks doesn't it I mean it's terrible so even in Excel data formatting obviously makes a lot of makes a lot of importance okay so when we asked our hey tell me what the data formatting is at this table it provides us a lot of rich information it says first of all Titanic's a data frame 891 observations 891 rows 14 variables because we added two they used to be 12 but now it's 14 because we added two new ones and then it tells us by column by variable what's the current data format integers integers now here's what's important notice when I get down to name there's this thing called a factor and factor is ours way of telling you that I'm treating this thing as a category as a category I'm explicitly understanding this to be a category now what we see in excel is exactly the same thing but it's implicit because excels a little bit smarter than our it just does this for you automatically and let me demonstrate if we go back to excel and I widen the name column a little bit and I open up the filter and to make the filter a little bit bigger how many check boxes do you think I have here 891 because each one of these names is unique so excel automatically treats each one as a distinct category and allows you to filter on same concept exactly the same concept by default when you read in a CSV file in our it interprets string values as categories just like Excel does and it creates for you automatically the cat the appropriate categories in this case 891 unique values just like the Excel filter for the name column gives you 891 to stink check boxes that you can filter on as you will know in Excel that's not particularly useful because we are going to deal with 891 individual clicks and it's not going to do that generally speaking but it's how Excel works by default are works exactly the same way and you can see the other types here factor female and male just like we saw in the Excel filter so on and so forth but I do want to call out this one right here survive label notice we added that after the fact when we load it up the CSV file are automatically looked at every string and said look I'm going to make that a category by default but we added that later on and what are says is okay until you tell me otherwise I'm going to interpret that as text CHR stands for character and R as an Excel user just think of that as being formatted as text I went to the cell formatting and I said this is text data same idea now here's another difference between R which makes it a little more complicated in Excel Excel is smart enough most of the time to treat text data as categories implicitly when it needs to for example in a pivot table it will just do that automatically for you don't have to do anything in our it's a little bit more hit or miss depending on which function that you're using some functions will say hey I'll automatically transform text data into categories for you just like Excel does but not all functions in R do that so you have to be really really careful so generally speaking when you're doing our programming it's best to say look anytime I'm creating a category change it to be a factor and how you do that beyond the scope of tonight if you're interested check out my YouTube videos okay the notice once again the similarities right how they work Excel and are they work almost exactly the same in many many circumstances okay so let's say hypothetically speaking that I'm interested in analyzing male fares I have a hypothesis about males on the Titanic and whether they lived or died based on how much they paid for their tickets okay so these common hypothesis that you might want to take a look at so how would you do that in Excel well you know the default least for me anyway is the mighty pivot table put in a pivot table put in the SEC's filter I said hey I'm only interested in the male's and I want to get a high level intuition about fare I want to know something about it so one of the things I may do for example is take a look at the min maybe I took at the average so on and so forth or I can put the standard deviation I put the max I could put a bunch of stuff up there all the stuff that pivot tables give me out of the box and that can give me a general indication okay look Wow some men pay zero for their fare that's probably interesting from an analytical perspective why why are some people paying nothing for their ticket are they employees traveling on the boat or something I don't know a lot to figure that out but you'll notice that the average fare is 25 that could be potentially interesting but without more summary statistics it's kind of hard to understand now the problem with pivot tables is at least out of the box they're actually quite limited in what they can do so Excel actually provides you a better option which is what's known as the data analysis toolpak how many people here are familiar with the data analysis toolpak a few so if you want it if you want to activate the data analysis toolpak this is how you can do it you go to file you go to options you click on Add Ins okay click on Add Ins click on go and you'll see I already have mine activated by by default I believe this is actually turned off so you have to actually go in and turn it on and the data analysis toolpak actually gives you access to a lot of good stuff from analytical perspective one of the things that allows you to do is create rich summary statistics over numeric data but here's the problem if I only wanted to get rich summary statistics over the mail fares if I just go into the table here and filter on females it doesn't work because like most excel functions it's not smart enough to realize that the hidden rows should be exempted from the calculations right let me just show you this real quick right so if I only look at the male's this works okay but if I actually use the data analysis toolpak it's not smart enough to actually discard the rows the female rows which are implicitly hidden right now so one technique to make that work is I create a brand new table so that I've got my filtered male data I just copy it paste it in to a noose worksheet make it a table and now I've just got male data no hidden female rows it's just the male data it's its pristine it's it's pure and now I can use the analysis toolpak to create some rich summary statistics so I go to the data tab in the ribbon and you'll notice that when you activate the in data analysis toolpak we get this cool new feature right here so if I click that I get a whole bunch of Awesomeness too much to go into obviously what we care about right here is descriptive statistics so we say okay we say what's the input range it's right here I want these these values and I want you to create summary statistics and go ahead and put it out put it on my current worksheet and just spit it out right here now what this does it gives me a lot of rich statistics this is this is richer than I can get out of the box with a pivot table that it gives me a much it gives me a much better intuition about what's going on with male fares for example the average is twenty five and a half pounds that's what the average price of the male ticket is but notice that the median value is only ten and a half so what that tells me is that we got skew there is a few people that pay a lot a few males that paid a lot for their tickets most people paid probably quite a bit less because if the mean is 25 and the meetings only ten and a half there's going to be something going on I don't know exactly what yet no there's probably something going on there later defer I couldn't get this with out-of-the-box pivot table it doesn't do median calculations for example and you can see the range that's the big one some guy paid 512 pounds for his ticket 512 pounds yeah hmm the skew yet so yeah so I'm kind of skipping over kurtosis and skewness because they're kind of advanced topics and kind of sticking with the basics but yes you're absolutely right if you just look at the range the median and the mean that tells you a lot tells you there's skew in this data skew in this data now let's see how we could do exactly the same process and our because once again and I'm going to beat this dead horse the what the what of the data analysis process between R and Excel exactly the same it's only the how that differs only the how so not surprisingly we're going to ask our to go to our Titanic data frame here and notice that I don't have a dollar sign here remember before we said the dollar sign was access the table please the dollar sign or excuse me the square bracket means filter the table please that's the syntax for applying a filter now the first thing that you need to know about R is that you can apply both row filters and column filters at the same time and that's what this comma is for everything before the comma is a row filter and everything after the comma is a column filter so what this says is hey R I would like you to filter the Titanic table since there's nothing after the comma I want all the columns I'm not filtering any of the columns at all just give me all the columns but I would like to filter the rows and in particular what I'd like you to do is use a filter by going to the Titanic data frame accessing it grabbing the sex column and checking to see if it's equal to male if it is I want it this gives me the males and if you think about it this is exactly the same in a worksheet table in Excel going to the filter on the sex column opening it up in unclicking the female so that only the male is clicked and that gets rid of all the females and leaves you only the males same idea conceptually the same thing and then lastly please cram in all those rows that I just filtered down to only the males into a table called males and sure enough 577 observations and to just a double check we can go back to excel and notice that the count here is also 577 so they match up conceptually the same thing now here's where our really really shines now not surprisingly as a programming language built by statisticians for statisticians every statistic that you can think of in the world has an implementation in our everything and stuff you've never heard of and Bobbie will never care about all available anything you can do this is going to be an underlying theme going off going forward and rest the presentation if you learn how to use our you are unconstrained you are unconstrained in your data analysis because any any data set any data file any collection of data that will fit into Excel will easily fit into our and you can do anything with it because you have access to I said tens of thousands of functions to do anything you can think of any kind of visualization you can think of any kind of mathematical calculation machine learning anything clustering you name it so it's why it's super awesome for an excel analyst you already know a lot of the basics because you do excel analytical work already all you need to do is just learn how to do an R and then a whole universe opens up to you you are unconstrained in your work yeah question does our support JSON the answer is yes can I read in Excel files can I read in JSON files can I read an XML files can I read an HTML files can I read in blah files the answer is yes because that's the power of our because basically generally speaking anything that you need somebody else needed before you and someone created a package with functions to allow you to do it yeah seriously I mean Google is your friend you just type in our blah and you get back a bunch of particles and Stack Overflow posts and all kinds of things about how to do exactly what you want to do it's awesome so I'll just run all this code and you can see here what I get is a lot of summary statistics I won't drain the individual functions you could of course use question mark summary question mark SD but the summary function creates some nice summary statistics for us the minimum the maximum the median the mean the first quartile the third quartile this is the variance this is the standard deviation this is the total of all the fares and this is the number affairs that they were notice how once again this conceptually aligns to the summary statistics output that we got from the data analysis toolpak but I would argue this is actually easier and faster and more intuitive than actually using the data analysis toolpak yeah no problem okay so summary statistics and then of course anything that you can dream of that Excel doesn't have ours ours got it yeah yeah so the question is data Dave everything you talked about is a structured file what about something like a log file now log files are also structured because that be read by something yeah so as long as you can find some way to apply structure to the file which you would have to you're not gonna be able to work with a log file if you can't apply some sort of structure to it right even if you say this column or this chunk of the data in the log file is freeform text and it's unstructured you at least there is some structure are can read it just like anything else it's a programming language so that can read anything just like that and any it can actually even work with truly unstructured data as well like images because images are by definition structured otherwise you wouldn't be able to get it off of your phone on your computer so the files themselves have structure the binary content the pixels in the image that's unstructured you can even work with that in r2 so yeah it works with everything good question if you have varying columns across rows so most programming languages are going to have a problem with that so you have to have some way of actually identifying that so JSON is a different story right so JSON is actually semi structured Excel Excel HTML is semi structured our works with that because yes even though it's even though it's semi structured it still has to hold to some rules right well formatted JSON has a format and therefore you can interpret it a runtime because it has a format and there are our packages that allow you to work with JSON good question yeah yeah so the answer is Dave okay look that's not even an Excel scenario by the way just so you know it's a little unfair by the way for this presentation but I'll answer it anyway because as you well know I'm an AR aficionado the answer is yes you can read in files line by line there are even packages that you can get that will actually stream data from disk if necessary and you can actually say look this file is huge I want you to do a calculation on it and it will say okay look I will only load up so much in memory flush the stuff that I just read read some more in continue the calculations so on and so forth in fact Microsoft actually bought a company called revolution analytics which actually has a really awesome capability for doing exactly that yeah lengths so that essentially says how long is that variable how long is that column how many how many cells are in it and there are 577 cells yes floodgates are oh no that's not a question oh yeah then I'm running 3.32 so this one's a little bit behind so the latest version of ours I think three three point three point three right now so this one's a little bit behind but not a lot yeah yeah so the question is Dave why is this blue why is this blue that's just color coding that's a common thing that you see in programming languages and the tools that you use for programming languages certain keywords certain things that are important to the actual programming language not the stuff you make up yourself as the programmer but stuff that the language is typically has a different color so true is a constant it comes with the language it's not something I make up myself as the programmer it comes with a language and fern is another example you'll see here one is blue because because it's the number one I don't make that up as a programmer and you'll see here this is also blue because that's a number range good question an excellent segue by the way thank you because that's the next line of code all right so what is the colon do in excel slice range range of cells if I tell you a 1 colon a 7 as Excel users you know automatically what I'm talking about in your mind's eye you see the top left of a worksheet and I'm highlighting a 1 a 2 a 3 a 4 a 5 a 6 a 7 or I'm selecting those cells it's a range guess what exactly the same thing in R if I even run this line of code one two three four five ranges work exactly the same Excel just has this this the syntax where you use a one you know all that sort of thing you have a you have the columns specifically lettered with you know using letters for the columns in our colleges numbers it's just one two three all the way unlike in Excel what's a through Z and then it's a A through Z Z and so on and so forth the know they're just numbers which actually makes things a little bit easier because you know when you're dealing with 333 columns in an Excel table who knows what the actual letter combination for the 330 third column is I don't do you and then there does numbers in R so they're easier to work with so we know implicitly what we're doing here already right we're going to the Titanic data frame and we're asking to filter it and specifically we're saying hey I want the first five rows because one through five the range one through five is before the column and there's nothing after exceed me before the comma before the comma and there's nothing after the comma so give me the first five rows give me all the columns and then please store that into a new table called first five first top five so if I run that line of code sure enough I get five observations of 14 variables now this is just kind of I mean argue this is kind of silly but I just want to demonstrate how we can also do a column filter go to this table filter it please I want all the rows because there's nothing here but I only want the first five columns I want the first five columns there you go right so in some ways certain things are easier for example slicing and dicing data is actually easier in our than it is in Excel especially programmatically yeah sorry say one more time the dot Oh Dave why is it first dot five what is this craziness that's purely Dave's preference so unfortunately one of the thing when you get into are you'll start to become programmers and one of the unfortunate things that programmers get super religious about is coding style and convention so so folks would say Dave name that thing the right way with a dot other people would say blasphemy that should be an underscore or something else so in the art community for example a common common thing that people do is they go look at Google's coding style and they say that's the style that we follow I'm not so I'm not so beholding to Google so I just kind of do what I like so that's the reason why it's a dot so did you want it so I just want make sure I understand the question are you asking now that I have first five can I add just the column values there right right mm-hmm yeah so the question is can we only add the new column for males and the answer is no think of it in Excel in a worksheet table I can't add a column just for the male's in Excel either right the best I could do is on the female rows the values would be blank right that that you can do an r2 but you can't just say look this column is only for the 577 males and that column doesn't exist for the females because they're all part of the same table and that works just like Excel you can do you can't do that in Excel either we did make the new table called males but that wasn't her question her question if I understood her question was on the original table in my correct on that yeah if we had the new columns will that be updated in the CSV very good question and the answer is no and here's the way to think about it I can change that CSV file all I want in Excel until I hit control s it doesn't get saved back to disk same thing n are you have to explicitly save the changes using some code right there's a function called write dot CSV write rewrite that would essentially save it back to disk so just like it Excel you have to save it yeah good question yes exactly so think of it this way when I double-click on a CSV in Windows it automatically loads up into excels memory but until I save it change that I make is not permanent yet same thing in our it's not permanent until I save it back to disk just like an Excel so I can excel okay so now let's go back we created this family size column right we create this family size column and we thought we could do some interesting analyses with it so there's an Excel user what's the first thing we're going to do we want to put in a pivot table right because this is what you do in Excel you put in a table so let's go back our pivot table and I'm going to change this up a little bit I don't care about few more so I'll get rid of that and let's say I'm particularly interested in analyzing family size by the combination of gender by sex and by P class all right that's the power the pivot table right drill down yeah okay so I'm going to put P class lips put P class in there and I'm also interested in survivability right I'm interested in analyzing things in terms of survived and died so one way I could do that to make this the column so now I'm analyzing females by class males by class by diet and survived and what I'm interested in looking at is family size now that's not particularly useful from an analytical perspective it's just basically a big table of numbers what does it mean how so one thing we may do is you may say look you know what let's make that the average actually and I'm getting there I mean it's still table numbers but you know Excel is pretty cool I can insert this thing called a pivot chart now we're now we're talking about something now we're talking about something I can see blues died and of course I could change that I could go into it and configure the chart but just just roll with me here that blue is dyed and oranges survive and we can see based on the averages lat Wow if you're a female and your family size is bigger on average your chances of survival seem to be worse smaller average family sizes seem to indicate better chance of survival right notice the difference in the bars males and second-class it's the opposite oddly enough you survive if you had a bigger family in general this is what is saying kind of sort of because here's the problem this is just averages what I really what I would really really want to know essentially is given a distribution of family size right the distribution of family size for females in first second and third class the distribution of family size for males in first second and third class what is the proportion of survived versus died in those distributions that would be more important because this is just averages right all I can say to somebody with this chart is well females in third class had an average family size of three and a half and it appears that they perished a little bit more frequently than those that had an average family size of two not super useful actually analytically okay no problem cool cool cool so we go back to excel and we say okay cool you know what I'm going to going to get rid of that I want to do a pivot chart I want something that shows me the distribution of the numbers I want a histogram because that's what shows me the distribution of data wha-wha-wha no dice Excel cannot do this it cannot do this with a pivot chart it cannot do this with the data analysis toolpak it cannot do this out of the box in doing it will not allow me to drill down and say by gender and by class what is the survivability rates as it varies across family size because my hypothesis is remember I think that larger families probably didn't survive as much as smaller families I have no way of answering this question easily in Excel now some of you are thinking Dave not to worry I have the mighty tableau I have power bi I have Spotfire yes you could do that in those tools but they're not because schools are you know this to be true you know it's you know it empirically to be true it is not they're not as cool as are and the reason for that is once again those tools are great tableau is great power bis grade and they do a lot of cool interesting things but once again you are eventually constrained by those tools and in are you are not constrained you can do whatever you want for example you can use a package the our version of an add-in called ggplot2 ggplot2 is quite possibly the most single awesome data visualization in the history of the planet and even though I'm biased I think a lot of people would agree with me on that so if we run this line of code let me prove it to you here we have an awesome plot across the top females across the bottom males first class second class third class oranges dyed this turquoise color is survived oh there's a whole lot information right there a whole lot of information first thing you notice is hope for the gentleman's request my apologies first thing you notice if you're a female in first row second class it kind of doesn't matter what your family size is it's it's good it's good to be a female in first second class in the Titanic that'd be very few parish not so much for dudes in third class it's pretty much sadness but a couple things that are interesting is that you'll notice that starting about here for females in third class and for males in third class it's overwhelmingly orange died so it does seem to support the hypothesis in with a modification yeah family larger family size seems to end give you a a poor indication of survivability mainly in third class mainly in third class not so much for females at first and second and not really for males in first and second class either this kind of visualization you cannot do out of the box in our our seeming excel you can't do it out of the box to do in our but here's the thing some folks say look Dave this is this is super this a lot of typing there's a lot of typing why would I want to do this is a lot of typing right it's easier for me just to click around an excel and create a chart well here's the thing you're actually super wildly productive because look how easy it is for me to create a new chart less than 10 seconds but now I'm looking at fair and this one has this one has a lot valuated - first and foremost good Lord it's small what does that tell you it tells you there's a lot of skew I mean look at that the reason why it's small is because of this big long line right here and notice although it's a little hard to tell if you had it on your computer you would see that if the vast majority that's orange so tells you a lot right there's hue in the Feres right we know this already right we knew by looking at it earlier that the range was really high 512 I mean the medium is only 10 and a half for males for example and now you know why yeah and then look at this guy right here it's like Bill Gates right there yeah oh yeah he's green oh yeah he's great he's tasty oh well you know Bill Gates got off its all right he's all good right so even this tells you a lot just just you just buy the shock and awe of how small and tiny everything is it tells you a lot already right in Italy it took less than 10 seconds to create this plot this I would argue this takes less time than clicking around and Excel to create a chart not to mention that you can't even create this chart in Excel oh and I would also tell you by the way it's actually faster than using power bi - because I use power bi I've timed it trust me I've timed it it's faster so that's the more the possible but wait there's more I've got more using another package another add-in called deep liar you can actually create your own pivots so for example if I run this code I can now create a pivot by P class first second third drill down by just sex drill down by labels average family size passenger counts if you squint a little bit this looks exactly like what you would get out of a pivot table in Excel right but here's the thing once again I am unconstrained any mathematical calculation that I want I can put in this pivot table anyone any of them I'm unconstrained for example if I want median to be added let's say I want the median of fare to be added it's just that simple I'm unconstrained I can add any mathematical calculation in one the data analysis toolpak gives me a lot but data analysis toolpak doesn't have everything in it guess what it guess what does have everything in it ours got everything in it oh yeah oh yeah okay so that's the more out of the possible now you can create pivots you can create beautiful visualizations quickly and easily stuff you can't do an out of the box of excel stuff quite frankly that you can't do in tools like power bi and tableau and if you're interested we actually have a recorded meetup that actually demonstrates some are visualizations that you cannot do in power bi so we put them in power bi using the are integration so if you're interested in that we have that up on our channel as well but there you have it there you have it okay come on now okay there we go questions can you pull from two sources and join them the answer is yes so if anybody here how many people here are familiar with like working with relational databases at least a little bit so you can actually simulate anything that you do in a relational database you can simulate in our did the caveat that it ought to fit memory can also fit memory on the computer but yeah you can load up one table like this one table like that and you can join them together as long as they have to have like pieces of data that line up you can join them absolutely absolutely so let me ask you let me ask you guys a question how many people are interested in learning more about our after this yes this how many people here think it's not as hard as they thought it might be yeah yeah it's not as hard as you might not think because here's the thing most most of the stuff that you use there's a whole bunch of are that you'll never use as an analyst because you don't need it all that crummy computer science stuff don't worry about it just use the stuff that works with the analysis right it's a lot easier yeah figure it out so what do you mean so the question the question Oh automatically cleaning all your data for you why would you want that dude so the question is Dave is there a magical art package that totally destroys my job security thank God the answer is no it does not destroy your job security the answer man someone's always got to be a negative Nelly on this little thing man yes so the current answer is there are a lot of package there are packages that can help there are specific packages that do automate a lot of standard data cleaning and data transformations however is it going to work in your specific scenario with your specific data set it depends it depends and I would argue that's a good thing because job security but I'm a computer scientist by training I'm a programmer in a very real way most of my career was to put people out of work and get computers to do it for them I'm a data scientist now I don't want that happen to me so it's possible that it will happen at some point but not right now not at this stage yeah can you give some examples of where ours being used at scale and really large scales on war wasn't really oh yeah so me personally are at scale so when you say at scale what do you mean yeah sorry to be pedantic when you say a really large data set what do you mean because one person's really large data set is another person's not very large to instance exabytes I don't me personally I don't know of anybody that exabytes scale and if you're really talking about exabytes scale more than likely you're talking about some sort of custom solution anyway right I mean even even Hadoop itself I think probably would start choking on multiple exabytes so let me ask do you work at Google do you have an exabyte problem I don't yeah yeah if you work with custom if you have you have customers that have exabyte problems more than likely you're going to have a hard time finding off-the-shelf solutions of anything that's going to work at that right yeah yeah so for example they give you an example you can go to Microsoft's cloud obviously obviously I'm familiar with this because I used to work there you go to Microsoft's cloud you can spin up then I do cluster in the cloud called HD insight and you can actually provision a spark cluster on top of that Hadoop cluster and they will actually run algorithms in parallel using R because of the aforementioned purchase of revolution Alex I talked about and push that down across all the nodes in the cluster so you can easily talk about terabytes hundreds of terabytes of data doing that now what I have what I showed in my laptop that I got 16 gigs of RAM in that laptop so that's that's not going to work that's not going to work but I can hook that up to a cluster in the cloud write my code right here hit enter it gets pushed to the cloud it gets distributed and works a bunch of data it aggregates the results and brings it back to me so that's absolutely good but if you're talking about exabytes I mean there need to be honestly there's not many companies in the world that really have an X of a problem not really yeah so they so I may interpret your question Dave I'm not a statistician can I still use our productively then my answer would be if you're not a statistician or using Excel productively the answer is yes right because there are plenty of statistical functions in Excel there's lots of them doesn't mean everyone I mean most people don't use them right tool for the right job but there's you can do actually pretty sophisticated statistical analysis in Excel if you want not many people do because they don't it's not necessary same thing in our just because all that power is there it doesn't mean you have to use it I'm sorry say one more time using our yes it depends depends on the problem yeah so first thing that you would do is you would say this is my problem domain I would go you would go to what's known as cran which is the internet repository for all of the packages all of the add-ins Ferrar and you would find what's known as a task view and what you look for is the financial task view and you would say which packages are available that fit my problem domain unless your code unless you're coding all the calculations from scratch where are you trying to build a machine learning model you feelin me all right so let me give you a let me give you a specific example there is a so the bootcamps that I teach working for data science dojo there are a collection of students that are currently working on a very similar type of financial problem around currencies using R as their platform and they found a bunch of packages out of the box that work for them well I can't you can't talk too much about it that if you're finance you know right secret sauce you can't talk about only bringing that up as an example of what what's possible now again given the specific nature of your problem because I'm not a finance guy I'm not a banking financial person so you'd have to go look and see what's available my guess is there's probably some stuff that'll work for you now spinning it up as a production data product a bunch of different options one easy way to do that is to look at what's known is as your ml as your machine learning it has the ability to essentially run our code and our models in the azure cloud and quickly turn them into rest web services that can be called and then you for example you could sell your API to people for example calls your API any other questions or depends yeah is are able to work with a messaging service like for example I'm sure if it's the right packages but is it able to inject stable yeah yeah yeah so the question is Dave I've got streaming data most prototypically yeah exemplified by I have a Kafka bus or I got Kinesis an AWS or I've got an event hub in in measure can are read from those and the answer is yes with the caveat hopefully is that some ones are in a package for you to use now if you have to you can write it yourself for example it's actually not that difficult to write up a Java jar and then wrap it in our package and then you can call from our for example and basically if you can get access to Java young bass achill get access to spout anything Kafka for example because it's written in John good question yes good question Dave this is a programming language and I'm not a computer scientist how difficult is it remember that the people who originally built our weren't computer scientists either they were statisticians so the answer is you don't need a computer science background right because it's not a general-purpose programming language for example there's no heavy-duty constructs and are like pointers if you don't what a pointer is don't worry about it just just know that it's something terrible you don't want it it's terrible you don't want it nr doesn't have them so you're not to worry about so it's not it's it's not that difficult because again it was built by statisticians for statisticians it wasn't built for computer scientists for computer scientists which is why a lot of computer scientists thumb their nose at it because they where are your pointers oh no pointers so but it will take effort I mean don't get me wrong we'll take effort but you should not feel intimidated by it because you know the computer science background is there any prerequisite to do before doing this i if you asking me I would say no and in fact I believed it so much that I like I said earlier I put quite a few hours and building a tutorial on my YouTube channel to teach our programming to people using the Excel based thing that we just saw here right you do this in Excel this is out looks an R you do this in Excel this out looks an R and then eventually we did the tutorial just goes to say okay look we're leaving Excel now but by that point you're pretty comfortable with R and say look here's some our specific stuff so I would argue no okay so I'm not familiar with prism to be honest with you I have coded in MATLAB that cutter did you might during my master's degree a single basic biggest advantage of our has over MATLAB is that R is free and never discount the power of free not only to mention the fact is that MATLAB as well as SAS also suffer from the same problem they don't innovate nearly as fast and there is a worldwide Advanced Research community that builds stuff an R so it's constantly growing constantly getting new stuff same thing with Python by the way right which is also open source and free so that's the primary those are the two primary advantages over MATLAB and in SAS for example is there a single repository the answer is yes it's called the interwebs and it's called Google Scholar so the answer the answering question is no there's you talk about academic research of like I know I want I want no you want finance I know what your time yeah look at you try to deflect with the image processing come on now yeah yeah no I'm with you integration machine that's actually a really excellent point it actually dovetails with your question it's called introduction to statistical learning yeah is El Hass tha STI e that's one of the authors but this is actually dovetails with your question which is is there one single repository for all academic research in a particular area the answer is now Google Scholar helps a lot with that but what's actually interesting is is exactly what this lady brought up which is if there's actually a ton of books on a wide range of subjects about our to be honest with you wind surprised me if you went to Amazon and did a search that you may not fall you may find a book closely related to your subject matter where they used R as the language to talk about it so for example you can find books that are specific to are for like financial time series forecasting that's all they talk about how you do that in our right so you may find a book that's are based on that subject so for so on yeah so you you may want to check out Amazon there may be some someone may have brought a book that's in that space it's possible I'm not saying it's happened but it's possible because there are so many books written about these various topics and values are because it's open source it's free and there's lots of packages yeah this scope oh man I know scope is so you're asking why Microsoft built scope well that's a really good question I've been wondering cosmos for that matter right you know we're probably Ballmer I would imagine because when cosmos was originally built Hadoop existed at the time it is but the code the question is why would Microsoft build a custom proprietary big data store when Hadoop already existed and open source tools already existed and the answer was at a certain point in time that wasn't allowed at Microsoft it was what's known as competitive technology so wasn't allowed started for now Satya genius totally different company now as far as that stuff goes but when cosmos was originally built yeah there was a different kind of company given kind of culture now that's one man's opinion by the way don't take that as gospel that's one small little cog in a big machines opinion so for example well so so you so you can implement stochastic gradient descent in Excel and if you don't know what stochastic gradient descent is it's just really complicated math basically you can do it in Excel you can actually MIT implement neural networks in Excel if you really really want to I don't know why you would write but you can as long as you're willing to wait for it to process for Excel do all the calculations and to your data small enough that it will all fit inside of the Excel oh yeah absolutely absolutely I'm done not saying do it messing do it but is it possible yes but here's the thing right and this this is a fundamental truth I would argue just about anything in them just because you can do something doesn't mean you should do something and typically what determines whether you should do it or not is cost if you could implement stochastic gradient set in Excel you would probably take you so long to do it that it doesn't make any sense for you to do it just use an R function or Python scikit-learn or something like that right it's a cost it's a cost problem yeah yeah so the question is hey this seems like David the impression I get is this seems to be a client-side technology right it was built for statisticians working on their local computers it's not a enterprise class server system and the answer is that is true however companies like our studio and Microsoft have now transformed that so Microsoft sells something called our server now which actually provides you a server side solution if you're lucky enough to have sequel Server 2016 you can have our services which is pure goodness because what that allows you to do is run our inside stored procedures right in your database engine which is so awesome you can do so many cool things with that yeah yeah not surprised that they're adding Python but right now the 2016 going to get element another bias I've worked with sequel server both as a Microsoft employee in both as a non Microsoft employee for most of my career so I like that database product so it's one of my biases other database products also incorporate our for example sa piece hana database also runs our in the engine as well so that's another server-side deployment option for you as well but the ubiquity of relational databases makes it an excellent deployment vehicle for production systems it co locates all the advanced processing with data in your data Mart's and your data warehouses in your transactional databases it also makes it very easy for your application developers to invoke that code as well because if it's wrapped as a stored procedure there know how to work with store procedures yeah I'm not surprised as you might imagine that is familiar with Amazon is maybe you would think given my background yeah well so the thing to remember to those that if you run our server-side which you're going to want from an air price setting you're going to want thought of stuff that you normally care about how reliable is it how secure is it all that kind of stuff so you can always run are just plain vanilla open-source are on the server you can do that you can buy a big box and just put R on there but are you going to get all the enterprise characteristics from an operational perspective that you want that's where things good that's where things get more difficult hey I will stay as long as people have questions when's the next session so the next session is going to be - June 7th Blair something like that so I think it's June 7th and the topic is going most likely going to be machine learning with our in the carrot package ok and the care package is a is a meta wrapper that makes doing machine learning and are much easier I'm not going to say easy because that would be true but make it easier to what level will that be for us newbies that are you'll be advanced it will be advanced I will try to be entertaining as always in my style hopefully not so many bugs as we saw with Excel but it will be relatively advanced because we're talking about machine learning right so if you look back at our history of our meetups some of them are more advanced like we did event log mining and our that's a little more advanced and then past few have been a little bit more not as advanced and now we're kind of swinging back to the yeah because we have a diverse community of folks right some folks are more interesting the bad stuff so okay I'm being apologetic I'm so sorry yeah okay yeah so we'll talk about things like how you do stratified random sampling how you build models how to use cross-validation to tune the hyper parameters make your models better and stuff like that so yeah no problem anybody else I'll also stay after if people just want to ask me questions one on one as well so that is nothing else thank you for coming [Applause] you
Info
Channel: Data Science Dojo
Views: 105,170
Rating: 4.931973 out of 5
Keywords: Introduction to R Programming for Excel Users, R Programming for Excel Users, R Programming, R Programming Language, Programming, Microsoft Excel, Excel, R Programming Training, R Tutorial, data visualisation in r, excel tutorial, data science excel, r programming excel, excel r programming, r programming for beginners, r programming 101, r programming basics
Id: Ekp2mfxQSzw
Channel Id: undefined
Length: 105min 57sec (6357 seconds)
Published: Wed May 03 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.