Lecture70 (Data2Decision) Factorial Design in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to lecture 70 of my course from data to decisions I'm Chris Mack your instructor and this is our lecture on factorial design one of the series of topics were covering the overall topic of design of experiments here we're going to use R to do some creation of a factorial design and analysis of data from a factorial design as always the R script that you are using in this lecture elbow on the course website so you've already hopefully already listened to the lecture on factorial design yes this lectures in this course and the first thing you want to do when doing any kind of design of experiments is in fact do the design right there's two steps we design our experiment we carry it out and then well two analysis steps designing and then carrying it out and then finally analyzing the data that was generated by the experiment so there are some fairly simple rules for simpler factorial designs that we can use a full factorial design or 1/2 factorial design um there are some more complicated designs possible especially if you don't quite have enough data to create a full design or if you want to do say a quarter design or something like this rather than memorizing or learning all of these rules maybe an easier way to go is simply to use a piece of software that can generate designs for you I think out in the the world of industry the most popular piece of software for doing design of experiments is jump a MP from SAS we're gonna use R which is also quite popular one is in being that it is free and open source so how do we design how do we create our designs for a factorial design well as you might expect there's a couple of different packages out there and R that can do that here I'll use one called a GL design I've already installed it so now I'll just load the library they have a function called Gen dot factor which generates the factorial design you you tell it number of variables and VARs 3 in this case for our example that's three actors and then for each actor there's a certain number of levels so what are the levels I'm gonna say the first variable has three levels the second variable has two and the third variable has three it's just an example if I said Center equal to true then I have a symmetric design where the levels are labeled as -1 0 and 1 for example and then I can supply very only variable names actors I just call them F 1 F 2 and F 3 all right so when I run that it prints out here in the console at design so we see that we're gonna have to run 18 experiments which is 3 times 2 times 3 but full combination of + replicates if we do any replicates I can output this into into a data frame that's the output of genda factorial and then I have everything set up to we're also have to do is when I make the measurements to add another column of the measure results or multiple columns if I have applicants alright here's another simple example just a plain old 2k design - - the cake design so the level is equal to number of variables equals 3 if I run that I can output the results to this data frame called dat here and if I print out that dat the data frame I see the conventional eight experiments are for a two to the three full factorial design two levels this might be the kind of thing we use for a screening experiment alright so once you have your design you go out run your experiments you fill in a last column with the actual numbers you got from your measurements in your experiment and you're ready to do some analysis so let's go down a little further in this file and we see that I have put together a factorial design CSV file with some results that I took out of this classic textbook on experimental design by box hunter and hunter statistics for experimenters aims to be the textbook I used when I was in college and took her class on experimental design and it's still one of the best experts out there although there's certainly a lot more as well so I will load this up into a data frame called yield let's take a look at what that is so yield is a simple table it's got eight runs is a if you look at the results you'll see it's a two to the third a full factorial design for every yield is the output temperature you see there's only two values 160 and 180 then degrees C concentration percent concentration also only two values twenty percent and forty percent concentration of some material and then catalyst I have two different kinds of catalysts so there's an indicator variable 0 or 1 and what type of catalyst I use and then the yield is the percent yield of action of the chemical and I'm trying to achieve in this manufacturing process and the idea here is to maximize the yield so that's our data notice that I have this column called run which I don't really need so the first thing I'm going to do is remove column 1 by using yield minus 1 to remove column 1 and when I look at it now I see it's gone now we only have the actual data also it's a catalyst either a 0 or a 1 but those are just numbers in the file that I read in what I want to do is turn it into factors and and that way I can do as these are factors use this as an indicator variable in in my analysis so to do that I use this factor command it will take variable yield dollar catalyst and into a factor then I'll stick it back right in to yield dollar catalyst when I do that I can't see any difference when looking at the data entirely how our knows that this is a factor rather than a numerical value well as always one of the first things you want to do when you look at data is to plot it up and see what you can into it about the data before you do any statistics when you have multiple progressive variables it gets very complicated to plot things up and it's especially true for factorial designs because you don't necessarily have a dense array of values if in a particular predictor variable sometimes when we have two s in this data set that could be a little bit difficult to look at the graph and see what's going on on the less we can do the standard plot which gives us these kinds of plots and because it's only two levels for each factor again not necessarily easy to see what's going on but these marginal plots are really with yield as the output is really the most important thing so we have yield along the y-axis and then here for example I have temperature only two temperatures 160 and 180 basic thing I see is the lower temperature has lower yield in the higher temperature has higher yield overall general trend right concentration every yeah I can't tell what what data point is what in terms of temperature and catalyst but concentration I see yields and yields and I don't see a very general trend going on likewise catalyst there's catalysts number zero and catalysts number one for a callous number one catalyst number two and again there's no obvious trend with catalyst all by itself so either temperature like might be effective in improving yield all by itself but the other two might have interactions but as individual parameters it doesn't look like they're gonna affect the yield very much whoa we can get that much from that but we can also pop things up using their action plots similar but I look at the interaction example there's interaction plot of temperature and concentration with yield as the output etc I can generate three different ones and I'll plot them all up on one page so I'll do this par MF Rho 2 2 that gives me a 2 by 2 array of graphs and when I'm all done I'll switch it back to the normal a single graph on a page alright so let's do that and look what these graphs look like when i zoom in I see the mean of the yield as a function of temperature for two different concentrations alright so this one it's averaging the two different catalysts together so I don't see any of the catalyst behavior I do see yield and concentration and these are two different concentrations the legends are a little bit messed up see if I can fix that a little bit without fighting coda if I stretch this up and over and run it again I'll sticks my legends in a little bit more readable fashion so now I see this is the concentration of 20 that's the concentration of 40 so higher temperature is good for either concentration that's one thing we learn here's again yield versus temperature the two different catalysts this is the average of the two concentrations I see that the slope is what's changing when I change catalysts so a temperature rise raises the yield I increase the temperature but it works differently for the two catalyst catalyst number one better than zero and here ask one look at the mean of the yield versus concentration with these two different catalysts and I'm averaging the two temperatures together and I see that the higher concentration actually is producing a lower result and there doesn't seem to be an interaction between catalysts and concentration these lines look pretty much peril to each other all right so that's one way to plot another way to plot and the same thing is using this lattice library which gives me other ways of doing these kinds of interaction plots so let's load up the lattice library and do a plot of yield versus temperature with different catalysts given the catalyst all right so what I have here is catalyst 0 + catalyst 1 and I have yield versus temperature but rather than showing me the average yield I actually show you the two points I don't know which concentration goes with each of the two points but nonetheless I see the same thing we saw before that for catalysts zero I get a kind of a gentle slope up and catalyst want to get a greater slope up for higher yields at higher temperatures so you can play with different competent combinations here's yield versus concentration for the two different catalysts for example in calais 0 and 1 instead of showing me the average yield I'll show you the actual data points so all of these are interesting ways of exploring the data another way is with box plots I'll show you three box plots I can zoom in to see and this shows the mean and range of the yield for two different temperatures so remember that's gonna have both concentrations and both catalysts so the mean and the range when the temperature is 160 the mean in the range when the temperatures from 80 see that no matter what combination of concentration and catalysts they use we always do better maybe versus 160 and more ambiguous things when I'm looking at yield versus concentration or yield versus catalyst type so you can study these Ochs plots box-and-whisker plots as well and there's another way to do the box plot where I put multiple box cuts together in one graph for example I can do instead of yield versus catalyst I can do yield versus the combination of temperature and concentration combination of temperature catalyst alright so let's take a look at what one of those looks like and you see I'm showing these combinations temperature 160 in concentration 20 percent so it's labeled temp and the first number is the temperature the second number is the concentration here's the yield so these two data points will be the two different catalysts now I show four different box plots higher temperatures are these two higher ones the lower temperatures are the two lower ones I can look at the mean and the range to different add lists first I don't really know which catalyst is which just looking at this data likewise I can look at temperature and catalyst so here's catalyst 0 + catalyst 1 and then the top and bottom of these oxen which whisker plots will be the two different concentrations alright you try to look at the data with graphing to see what you can learn from it then you do the analysis and what you want is your understanding of what's going on with the data based on the graphs Hach Oracle values you get when you do your modeling in your analysis I just discovered this little command an options command that allows me to turn off the significant stars when doing modeling linear modeling or ANOVA tables I hate those significant scars alright I don't hate them I think they're I'm irritated by them they're annoying I don't want to see them so I'm gonna turn them off with this little option command you're irritated by those stars you can do the same all right how do we analyze factorial data there's two basic Oh approaches I'm a regression guy do lots of linear regressions I'm used to thinking about the world in terms of progressions and for a two-level factorial model more data set we are only going to be able to do straight-line models linear models so an way to understand what's going on is to do the linear model LM and create a model of yield versus whatever this allows me to look at effects kind of a bit of it at a time so for example only was cared about the main effects didn't want to include any interactions I would do yield as a function of period period tells me give me all the main effects but nothing else all right so if I run that and then look at the summary let's squeeze this back so I can see the summary in more detail I see that I have the intercept temperature concentration in catalyst 1 as my main effects it tells me no interaction terms it tells me that the temperature is significant but that the concentration in catalyst may be not statistically significant r-squared of my model is point 8 4 so I am explaining 84% of the variance of the data with this but maybe concentration in catalyst 1 and in the catalyst type are not really explaining anything that variance ah but that's only when we include the main effects we'll begin to see that interactions can be significant so let me do another model call Auto Tune or I do yield as a function of point squared point means everything squared means give me all the two-factor interactions as well as the main effects so I will run that model and look at the summary oh I see temperature concentration and catalyst plus the temp concentration combination temp catalyst come on some accommodation and the concentration catalyst interaction as well right what do I see I see that well my main effects are less have become less significant but interaction of temperature and catalyst is very significant even though the interaction of concentration and catalyst is not so now I might be be thinking that best model or maybe includes the temperature catalyst combination if I wanted to compare these two models I could decide if including the two factor interactions is any significant advantage over including just the main effects well first our R squared jumped way up to 9 9 9 6 so I'm explaining almost all the variants with just two factor interactions let's do the ANOVA comparison of the two this is the partial F test because one is a subsequent model is a subset of the other and here I see p-value of 0.06 it's marginally better if I had a point O 5 significance level I say no maybe not but it is explaining for the variants but some of that explanation might be due to chance there's a no 6 probability of it just being chance I got this difference in the models the third thing we could do is to include all the interactions so this would be full factorial data set with a full factorial model all the way out to the three factor interaction so if I do yield pose as point cubed that would mean I'm including all the three factor interactions and here I can't get any standard errors because oh I've got exactly the number of parameters to fit as I've got data points so I will explain 100 percent of the data R squared will be 1 well that tell me that doesn't really tell me how important these three factor interactions are compared to the two interactions etc another thing we could do would be to search through all the models up to 3 factor interactions and then look at something like the bicker or mallow CPU or something as we've discussed in the past for the finding the best subset model but I kind of think that from from looking at the two-factor interactions that concentration temperature and catalysts plus the temperature catalyst interaction is is what's really significant so I can create a model that only includes those factors and equation the way I do that is with this equation here so I said yield is a function of concentration plus picture times catalyst oh just remember an R that is not a mathematical equation it's shorthand notation for explaining what you want I want concentration as a main effect I want the interaction between temperature and catalysts but when I use the multiplication sign that's telling me I want those two and I want the main effects individually so I want temperature individually and catalyst individually plus the interaction that's just the the notational form of commands and R for creating a formula alright so let's do that I will look at that I'll run that model and then look at the summary it will be able to see exactly what's in the model the model includes concentration temperature catalyst and the temperature catalyst interaction now I see that every single one of these terms is significant and concentration may be possibly not to the point O one level but probably good enough it's possible that we'll leave concentration out altogether but it's close it's kind of on the borderline but if I left it in you'd see I'd explained you know ninety nine point six percent of the variance dusties so adding the other interaction charms and in the three-term interaction uh is only adding tenth of the variance or you explained I think probably we don't need that could also look at the the values and what they mean what they tell you et cetera but this is how you would look at the problem of a linear modeling perspective creating a least squares regression of a straight line fit because we only have two levels you can also approach whole problem oh by the way i can do the same anova sorry this a manova comparing this model to model two and model two is the one with all the interactions and all the interactions compared to this one there's not a statistically significant difference it's a p-value of 0.3 so i might as well go with the simple bottle of only one interaction term versus having all the interaction terms present they're almost the same oh all right another way to analyze this is with anova just doing an over properly so instead of looking at slopes and coefficients of the slopes i'm gonna look at average values of the yield different combinations well to do that i have to treat concentration and temperature as categorical variables as factors rather than as numerical things numbers and i'm going to use a for modeling so like I did with catalyst I'm going to well first I'll copy the data into a new data set so I'll have to mess with my original then I'm gonna change concentration to a pair of factors called twenty and forty and temperature is a pair of temperatures called 160 and 180 instead of numbers those are just the labels those actors right now I have a data set where I am only factors and if I do ANOVA it will simply look at these as if their categories rather than numbers so how do I do this I can do a anova created an ANOVA table for yield versus all the main effects just like we did before and if I look at the summary of that shows me sum of squares explained by temperature sum of squares for the concentration in the summer squares for the catalyst and the summer squares for the residuals and F values and P values for each of those and I see that the temperature is the only one that has a p-value less than a significance level that might be reasonable concentration and catalysts don't seem to matter that's the same result thing kind of finding we came up with we did our linear modeling as well likewise I can do yield this is in effects plus all the interactions using the same kind of formula we did before and using the AOB ANOVA analysis command or function I would summarize and I see going straight to the P values the P values for the main effects and all the ten all the interactions I see the P values are the same are they exactly the same let's let's see if we can answer that question I'll just print out the summary of model two as well all right but here's the p values for example temperature and catalyst was 0.03 one when I did the linear modeling and it's point O three one eight here as well all right when I when I had temperature and concentration the p value is point two oh four eight I did the linear modeling 2.20 four eight just like we had before so these P values that we get from the ANOVA analysis identical to the P values that we got in our linear modeling it's essentially the same analysis being done just being done in a different structure different process but we're doing exactly the same kind of modeling work you should get the exact same results all right I can compare those two models together again whoops let's go back to our Nova I can look at all three interaction terms as we did before here I can't get p-values because I have exactly the number of parameters it's data points and like before I could say at a model of concentration plus temperature times catalyst so the only interaction term all the main effects are included the only interaction term included is temperature and catalyst it's my fourth kind of model and here's the summary of that results and oh we got we get the same kind of answers that we got before all right oh if you're more comfortable doing ANOVA analysis that's great if you're more comfortable doing linear modeling like me that's great too they both give you the same results I can plot some of the ANOVA results here's the residuals for the case of this fourth model concentration temperature and catalyst as main effects in the temperature catalyst interaction included here's the residuals and we don't see any specific trends I can also plot the QQ plot not very many data points but owl seem to be I'm falling along the slope equals one line as well another thing we can do is create full ANOVA table so model dot tables give it the output of the ANOVA and say I want to show all the means of all the the the in effects and the interaction terms I can create that table as well here I'll use the out the model number two which included all the main effects and all the interaction terms I am model dot tables and get the full here table Oh beans grand mean beans for the two different temperatures for example at 160 if I average all the data that was collected at 160 I get this mean value for yield and I 52 and I get 75 I same thing the two different concentrations the juban catalysts but I could also create the tables for the temperature concentration interactions temperature catalyst interactions and concentration atlest interactions standard errors for the differences of the means ah so lots of different ways of looking at this but the two main approaches are to do a linear modeling approach a regression approach or an ANOVA based approach and either one could decide but these factors are telling you about how they affect yield well that's the the lecture there's plenty of more to do when it comes to full factorial analysis partial factorial analysis etc I'm not gonna go any further in this class of course you could take an entire class on design of experiments what we'll do next time is move on to response surface modeling till then
Info
Channel: Chris Mack
Views: 19,862
Rating: 4.9642859 out of 5
Keywords: statistics, data analysis, linear regression
Id: vpIcOOYYw3c
Channel Id: undefined
Length: 30min 27sec (1827 seconds)
Published: Sun Nov 13 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.