Hello everyone, welcome to the course on Biostatistics
and Design of Experiments. Today, we are going to talk about design of
experiments, it is also called DOE. Design of experiments are extremely important
if you want to do a well-planned out study of a very complicated system. If you do not plan your study properly, then
whatever data you collect, will be completely wrong. You will not have a statistical basis for
analysis and statistical basis for coming to a conclusion. So, design of experiments is very important
and it is not taught in many courses. Many of the software, but have facility to
give out designs of various types and most of you might not be aware how each software
spews out these different types of designs. So, we are going to talk about in the next
few classes, how one goes about designing or planning the experiments and how do you
vary the variables and so on, actually. So, some of these references, and I have listed
out here. So, if you have access to these references
that will be very useful for you. There is a reference relates to life sciences
and then, there is one on design and analysis; there is also understanding industrial designed
experiments. I do make use of this book also, this is quite
simple and very practical and so, I think you should have a book for yourself so that
you do not just rely on a software all the time. You should get the philosophy of how the designs
are done and if you understand it, it is extremely interesting and very fascinating actually. So, why do we need to do experiments? Actually, this is a fundamental question,
why should I do experiments? Why should I have a design of experiments? So, design of experiments is a statistical
methodology for systematically investigating input-output. So, you may have several inputs and you may
have several outputs also. Like, for example, my carbon concentration,
nitrogen concentration, the pH, the temperature, the agitator, rpm, the amount of oxygen bubbled,
these could be input. My output could be, amount of, say, biopolymer
produced, amount of biomass produced, amount of secondary metabolites produced, so lot
of outputs. So, you could have several inputs, several
outputs and each of them may behave differently for different inputs. These inputs are called x’s, independent
variables, parameters and so on. The output is called generally the dependent
variable, the y. So, we do these experiments to identify important
design variables. You may have hundreds of variables, but only
few of them may be important. So, if you are running a plant, you are interested
to know, which ones I should focus on. Which x’s should I think about having a
good control on? So, I do not have to spend money on looking
at other x’s, so I focus only on the important x’s. Optimize my product and process design, this
is very important. Ultimately, you want to get the best out of
your plant, you want to minimize the energy usage, raw materials usage and get maximum
amount of your desired product. Whether it is a biopolymer or whether it is
a secondary metabolite or whether it is an antibiotic, I want to maximize its production
and minimize my raw material usage, that is obvious, right. That is called optimization. And similarly, if I am doing a product design,
I want to improve the quality of the product. Product which will have the best, say, tensile
strength or compressive strength or flexural strength or maximum reliability and so on. So, that is called the optimizing the product
design. Achieve robust performance, ultimately we
want the, say, bioreactor to be robust. It should not go out of control for small
changes in your x’s. You know, the temperature changes by one degree,
we do not want a very large change in my product amount and quality. So, that is called a robust design. How the process is able to absorb small, small
changes in your inputs. For example, raw materials can have different
amounts of impurities, will that affect too much on my product concentration, product
purity? If it affects too much, then I need to have
a very pure raw material. So, even for small variations in the raw material
concentration, if my product concentration or yield changes a lot, then it is not very
robust. But whereas, if it can absorb the concentrations
of the impurity present in the raw materials and still give me the desired amount of product,
desired quantity and concentration, then that is called a robust design. This is, design of experiments is very, very
important in product process development. So, if you are moving from a small scale,
that is, lab scale going right up to a manufacturing scale without performing a design of experiments,
you cannot just jump and start making in a large scale. This is very commonly used by chemical engineers,
by bioprocess engineers in any manufacturing. Whether you are manufacturing a chemical,
whether you are manufacturing antibiotics, whether you are manufacturing metabolites,
secondary metabolites, whatever be it, unless you do a proper design of experiments, you
cannot move from small scale to large scale. You cannot expect to have an optimum process
with the minimum raw material and energy usage and maximum product yield and desired product
concentration. So, that is what we are going to talk and
I will be talking about how one varies the various x’s or various input parameters
to achieve the maximum information as well as maximum output, desired output. So, we are going to controlled changes to
input variables to gain maximum amount of information, this is called a cause-effect
relationship. We need to have design of experiments performed,
so that we can develop regression relationship. We will talk about regression also later. So, we want to develop equations like, yield
of my desired product is equal to function of various input parameters, right. So, in order to derive such an equation, I
need to perform experiments so that gives you a cause and effect. You know, I may develop equations like this,
right, the yield is equal to function of temperature, pressure, dissolved oxygen and so on. It may be a linear relation, non-linear relation,
it could be anything actually. Now this is more efficient, design of experiment
is more efficient then changing one variable at a time. Imagine I want to look at temperature, pH
and rpm, that is, agitator rpm. It is not very intelligent just to do experiments
by changing temperature alone, few experiment changing temperature alone, then keep everything
constant, then now keep temperature also constant, change pH alone, different values of pH, then
keep all of them constant, then change rpm alone, different values of rpm. That is called one variable at a time or one
factor at a time and that is not very, very efficient because it will not be able to identify
interactions. You know what is interactions? I talked about interactions many times in
ANOVA, two way ANOVA, three way ANOVA. So, when you change only one factor, you will
not be able to identify whether there is an interaction between two factors like temperature
and pH maybe having interaction. Unless you simultaneously change this, you
will not able to study those effects, ok. Also, statistical software will also have
in the market these design of experiments. I, like I said, you know, it can spin out
different types of designs, these packages can do that actually. So, it does not require much intelligence
at all. So, what are the activities involved in DOE? First, you need to prepare the design. We will talk about it in the next few classes,
how do you prepare. Once we have the design, which gives you the
different levels of the input parameters, then you go to the lab or plant and collect
the data. If your output or desired dependent variable
is biomass, so you measure biomass at different input values or input variables, then you
statistically do the analysis of the data. You may use T test, F test, we looked at so
many tests in the past, say about 30 classes and then you derive conclusions. Based on that we will say, we will accept
null hypothesis or we agree to reject null hypothesis then. So, we agree on alternate hypothesis. Then, we develop mathematical relation between
various input parameters with the output parameter and then we formulate recommendation because
of all these actually. So, we decide, that temperature should be
only between 35 and 37, pH should be always . So, these types of recommendations we make
based on our design study actually. These are the basic steps in design of experiment. If we look at design of experiments historically,
it has been there from 1920s, early 1920s. So, it was used in agricultural and factorial
designs were developed during agricultural studies. For example, studies were carried out to see
whether this particular fertilizer is better than that or this treatment of pesticides
was better than that and how they performed on different types of land areas and how they
performed with different plants. So, we had many parameters and you cannot
do too many experiments, so design of experiments was thought of at that point of time, that
is, 20s. Then, came sequential designs in the area
of defense and of course, by around 50s chemical industries started using these different types
of designs. This is called response surface designs, which
was used for process optimization because ultimately in chemical industries, they want
to maximize the production of the desired product, minimize the usage of chemicals. So, the design is called response surface
designs were incorporated in the early 50s. Then, came the robust parameter design. As I said, I do want my product quality or
product performance to change too much with respect to my input values. So, it should be able to absorb these variations
and that is called the robust design that came into manufacturing and quality control,
ok. Even if, for example, the quality of my fuel
varies in a range, the performance of the car should be so robust enough to give you
the same mileage per liter of the fuel. That is called a robust design. Then, came virtual experiments using computational
models design of experiment were also used in computer simulation, especially for simulating
semiconductor performance, aircraft performance, automotive performance. So, design of experiments was also started
being used in mathematical modeling and simulation also. So, it has been there, it is being used in
almost many fields of science and engineering. And biological engineering also has taken
it and they have started using the various design of experiments tools in the biological
research. Let us go forward. So, good experiments are always comparative,
you know. If you are, say, comparing BP in subjects
treated with placebo to BP in new drug. So, if we are looking at a drug, I will always
compare it with the placebo. We talked about it in many times in the course
of these weeks, so either placebo or existing drug. So, if I want to say, this new drug is better
or as good with respect to placebo or existing drug, so we need to do that. So, you may compare say male volunteers with
female volunteers on the performance of a drug. So, always good experiments are comparative. We never take historical controls and then
compare it that is very, very rare. So, if I want to introduce a new drug into
the market, I will always carry out clinical trials with the old drugs, with the set of
volunteers and new drug with set of volunteers and make a comparison, ok. That is always done. I will never take historical data. The data performance of the old drug is given
in the literature, so I will take that and do it; that is not a good idea at all. So, it is always good to have set of volunteers
for control or for old drugs if you want to introduce a new drug into the market. So, comparison and control are very, very
essential. We have being looking at many problems in
this idea. Never, never compare with the historical controls. That is not a very good idea unless you do
not have a control. For example, you can say, the life span of
people have increased from, say, 40 years in the 19th century to almost 70 years. So, if I want to do that sort of study, I
may get volunteers in the current age, but I will be not able to get volunteers from
the 90s, 90s, 19th, right, so that is a problem. So, in such situations, of course, we cannot
have a comparison. The current, concurrent controls, we have
to make use of the historical controls only in such situations, but otherwise it is always
good idea to have concurrent control, be it placebo, be it old drug, old assay, old volunteers
and so on, actually. So, then next comes replication. We talked about replication or reproduction
that is very, very important. That means, you carry out the entire experiment
not just once, may be twice, thrice, four because that gives you an idea about error
and if you want to get error sum of squares without replication, it is very, very difficult. So, suppose I am looking at blood pressure
on control group and those we treated, it is very bad idea to just do experiment with
only one volunteer, one of each, that is very bad because we have no idea about the error
involved. But it is always a good idea, say you take
10 volunteers per group, so the blood pressure may vary of the control from, say, 85 to 97
and the treated could vary between 90 to 115. So, we have a range of a values. So, we can calculate variances for the control,
we can calculate variances for the treated, we can perform F test and so many things we
can do. But with this we cannot do anything. Actually, it is just a single point control. So, replication of experiments is extremely
crucial. And I also showed you before, that when you
do not have replication, it becomes very, very difficult to understand error sum of
squares or even sometimes it is very difficult to understand confounding or interactions. Why replicate? Reduce the effect of uncontrolled variation. So, we increase a precision, quantify uncertainties
because say, any assay, any methodology will always have an error. So, replication helps you to find out what
is the error margin. So, replication is same as reproduce like
I said, but it is not same as repeat. Repeat is just taking a sample and repeating
the measurement in the instrument three times, but replication is by performing the entire
experiment with the x’s; that is replication. Randomization, this is also very important. We have to randomize otherwise we will always
have a bias. If I am going to take, say, 20 volunteers,
I will put some of them into placebo and some of them in the drug. I will randomly pick volunteers and put into
these two groups. I will not go with certain bias, I will not
take people who look healthy and put them into placebo or vice versa, that is not correct,
that is called biasing. So, we can randomize using a, there is a random
number generator software was there, table was there. So, if there are 20 volunteers, you can make
them, ask them to stand in a queue and then, use a random number generator or even toss
a coin and pick them randomly and put them assigned them into these two groups. That is the correct way of doing it rather
than bringing in a bias, otherwise that is very, very dangerous. So, randomization is very important when we
perform experiments. Why randomize? It avoids bias. So, randomly selected volunteers for control
and test group rather than based on physical features, like as I said, you know, we look
at people who look healthy and put them in control. That is not correct; that is bias and if you
look at healthy volunteers or unhealthy volunteers and put them into test where we are going
to give the drug again, that is not correct actually. That way we have the chance. Randomization allows you to use the probability
theory because the entire probability theory is based on random tossing of coins, tossing
of dice and so on, actually. So, entire statistical analysis techniques
can be applied if we use a random method rather than a biased method. Next comes blocking or stratification. So, for example, I am taking some, say, blood
glucose measurement or blood pressure measurement of volunteers with test group and control
group. These may data will be made in the say, morning
or afternoon. So, if you think there is going to be some
differences when I take data in the morning or in the afternoon that is true with blood
pressure or even with glucose. For example, blood pressure may be low in
the mornings, whereas it could be high in the afternoon. So, in such a situation we can have equal
number of subjects in each group, you know, that is called blocking. That way we can take account of the differences
between periods in your design. So, you do not have to worry their morning
data collected and afternoon data collected is going to give you problems. For example, you are testing a fertilizer
in a field, there are different types of field. So, you do not, you are not very sure, that
whether that is going to affect your, the performance of a fertilizer, then we can sort
of, different types of lands could be blocked. Similarly, if you have different bags of raw
materials for performing bioprocess experiments, suppose I take samples from one bag and do
some experiments and take samples from another bag and do experiments. If I am worried, that each bag may have some
variations, which may affect your results, then I can use bag as block. So, I will control, I mean, sorry I will perform
a, measurement, calculations only in each individuals block and we can also later on
do between block analysis to see whether block has a effect, that is called blocking. So, look at this, 20 males and 20 females. I have, half of them are going to be treated
with drug, other half left untreated or with placebo or old drug. I can do the treatment only for 4 volunteers
per day. So, Monday to Friday only I am going to do
the work. So, how will you assign individuals to the
treatment groups in two days? So, I have 20 males, 20 females and half of
them in each group will be controlled, half of them in each group will be the test. So, how am I going to perform this design
plan? One design plan, Monday I will have a control,
control, control female and then control, control, control, again female on Tuesday,
like that. And then, later on, in the next week I may
have the treated, treated, treated male. This is a very bad design; this is extremely
bad because you are completing all of one set and then all of second set. There is no randomization; there could be
bias coming into the picture. So, that is a very bad design. So, another alternate will be randomize design. So, what we do is, we may take a treated person,
a drug female and then we could take a control male, then we could take a control female
and then we could take a treated male, drug treated male. So, we have different types. We have a female and male taken here because
you have pink and pink and blue and blue, but you also have treated control, control
treated, that is, on Monday. It is quite random. Next day, we may take two treated male and
two treated control male. Next day, we may have two treated female,
two control female, like that. Now, this is quite random. As you can see, it is randomly done. There is no pattern at all coming into the
picture. This is called a randomized design. If you want to block it also, then we can
do it like this. So, we will have the female control and test
together, then we have a male control and test together, like that, you know, we have
some blocking. So, this is a block design, like that we can
do. So, as you can see, never, never have a design
like this where the complete one set of all the female control, then you go into treated
and so on. This is a very bad approach to do, whereas
this a much better randomization and this is blocking of the data of male and female
together. So, if you can fix a variable, like if you
want to do only adult male, then it is ok, but if you do not fix a variable, then block
it, that is, if you are going to take both adult and old volunteers, then we can block
with respect to age. So, we, and have some group of volunteers
adult, some group of volunteers who are old and then you perform the experiments and then,
later on, you can also look at effect of age also. That is a good , but if you can get only a
adult male between the age of 30 to 45, then no problem, age will not come into the picture. If you can neither fix nor block a variable,
then better to randomize it, because there could be situation where you might not be
able to get all adult and old people. Suppose, if you are testing some drugs for
sudden treatment, most, some disease may happen only in certain type of population and so
on. Then, say, you just randomize it. So, this is how we do plan the experiments. Now, there is something called factorial experiments. We will look at these factorial, you are going
to come across this word factorial quite often. So, imagine, I am looking at a drug and diet
for cholesterol lowering, so you could have no drug, drug and then normal diet, high fat
diet. So, you can have four different treatment
strategies, right. No drug, normal diet; no drug, high-fat diet. . So, we can have a drug, normal diet. Then, finally, drug, high fat diet. So, we have four different situations because
we have two factors: no drug, no drug, normal diet, high-fat diet. So, 2 into 2, 4. So, by doing this we can learn more, we can
look at effect of the diet, we can look at effect of drug, we can even look at effect
of, that is, each one is a single factor and then we can even look at effect of drug and
diet combined together also. So, that is an advantage. So, this is called the factorial experiment. We have two factors, that is, drug is one
factor, diet is another factor and each at two levels, that is, no drug, drug; other
one is normal diet, high-fat diet. So, 2 into 2, 4. So, we will be doing 4 experiments. So, it is always better to look at four different
types of experiments. How do you do? We will take, first experiment will be no
drug, normal diet, that will the first experiment and see the performance; no drug, normal diet. Next experiment could be: no drug, high-fat
diet. Third experiment could be drug and normal
diet. Fourth experiment could be with drug and high-fat
diet. So, we are combining both these factors and
getting four experiments. So, it is much better than doing single factor
experiment. For example, single factor experiment could
be, one experiment be no drug, next experiment could be drug, next, third experiment could
be only with normal diet, fourth experiment could be high-fat diet, no change in the drug
pattern. Whereas, the factorial experiment, we are
changing both simultaneously in some situations, that way we will be able to look at even interactions
very efficiently. So, many design of experiments makes use of
factorial experiments or factorial designs, so we are going to look at factorial designs. So, this is called a two-level factorial design
because we have two, two levels: no drug, drug or normal diet, high-fat. And we have two variables here or two parameters
here: one is called the drug parameter, other one is called the normal diet, high-fat diet
that is another parameter, that is, diet as another parameter. So, we will talk about this factorial experiment
in the subsequent classes. Thank you very much. Key words,
Design of experiments, variable,factor, ANOVA, Interaction, experiments, Excel, replication
of experiments, Randomization, blocking or stratification, One design plan, factorial
experiments