As a chemical engineer with roots as an R&D process developer, the appeal of design of experiments (DOE) is its ability to handle multiple factors simultaneously. Traditional scientific methods restrict experimenters to one factor at a time (OFAT), which is inefficient and does not reveal interactions. However, a simple-comparative OFAT often suffices for a process improvement. If this is all that’s needed, you may as well do it right statistically. As industrial-statistical guru George Box reportedly said “DOE is a wonderful comparison machine.”

A fellow named William Sealy Gosset developed the statistical tools for simple-comparative experiments (SCE) in the early 1900s. As Head Experimental Brewer for Guiness in Dublin, he evaluated hops from various regions soft resin content—a critical ingredient for optimizing the bitterness on preserving their beer.1 To compare the results from one source versus another with statistical rigor, Gosset invented the t-test—a great tool for DOE even today (and far easier to do with modern software!).

The t-test simply compares two means relative to the standard deviation of the difference. The result can be easily interpreted with a modicum of knowledge about normal distributions: As t increases beyond 2 standard deviations, the difference becomes more and more significant. Gosset’s breakthrough came by his adjustment of the distribution for small sample sizes, which make the tails on the bell shape curve slightly fatter and the head somewhat lower as shown in Figure 1. The correction, in this case for a test comparing a sample of 4 results for one level versus 4 at the other, is minor but very important to get the statistics right.

Figure 1. Normal curve versus t-distribution (probabilities plotted by standard deviations from zero)

To illustrate a simple comparative DOE, consider a case study on the filling of 16-ounce plastic bottles with two production machines—line 1 and line 2.2 The packaging engineers must assess whether they differ. To make this determination, they set up an experiment to randomly select 10 bottles from each machine. Stat-Ease software makes this easy via its Factorial, Randomized, Multilevel Categorical design option as shown by the screen shot in Figure 2.

Figure 2. Setting up a simple comparative DOE in Stat-Ease software

The resulting volumes in ounces are shown below (mean outcome shown in parentheses).

- 16.03, 16.04, 16.05, 16.05, 16.02, 16.01, 15.96, 15.98, 16.02, 15.99 (16.02)
- 16.02, 15.97, 15.96, 16.01, 15.99, 16.03, 16.04, 16.02, 16.01, 16.00 (16.01)

Stat-Ease software translates the mean difference between the two machines (0.01 ounce) a t value of 0.7989, that is, less than one standard deviation apart, which produces a p-value of 0.4347—far above the generally acceptable standard of p<0.05 for significance. Its Model Graph in Figure 3 displays all the raw data, the means of each level and their least significant difference (LSD) bars based on a t test at p of 0.05—notice how they overlap from left to right—clearly the difference is not significant.

Figure 3. Graph showing effect on fill from one machine line to the other

Thus, from the stats and at first glance of the effect graph it seems that the packaging engineers need not worry about any differences between the two machine lines. But hold on before jumping to a final conclusion: What if a difference of 0.01 ounce adds up to a big expense over a long period of time? The managers overseeing the annual profit and loss for the filling operation would then be greatly concerned. Before doing any designed experiment, it pays to do a power calculation to work out how many runs are needed to see a minimal difference (signal ‘delta’) of importance relative to the variation (noise ‘sigma). In this case, the power for sample size 10 for a delta of 0.01 ounce with a sigma (standard deviation) of 0.028 ounces (provided by Stat-Ease software) generates a power of only 11.8%—far short of the generally acceptable level of 80%. Further calculations reveal that if this small of a difference really needed to be detected, they should fill 125 or more bottles on each line.

In conclusion, it turns out that simple comparative DOEs are not all that simple to do correctly from a statistical perspective. Some keys to getting these two level OFAT experiments done right are:

- Randomizing the run order (a DOE fundamental for washing out the impact of time-related lurking factors such as steadily increasing temperature or humidity).
- Performing at least 4 runs at each level—more if needed to achieve adequate power (always calculate this before pressing ahead!).
- Blocking out know sources of variation via a paired t-test,3 e.g., when assessing two runners, rather than them each running a number of time trials one after the other, race them together side-by-side, thus eliminating the impact of changing wind and other environmental conditions.
- Always deploying a non-directional two-tailed t-test4 (a fun alliteration!)—as done by default in Stat-Ease software; the option for a one-tailed t-test requires an assumption that one level of the tested factor will certainly be superior to the other (i.e., directional), which may produce false-positive significance; before going this route consult with our StatHelp consulting team.

- For more background on Gosset and his work for Guiness, see my 8/9/24 StatsMadeEasy blog on The secret sauce in Guinness beer?
- From Chapter 2, “Simple Comparative Experiments”, problem 2.24,
*Design and Analysis of Experiments, 8th Edition*, Douglas C. Montgomery, John Wiley and Sons, New York, NY, 2013. - “Letter to a Young Statistician: On ‘Student’ and the Lanarkshire Milk Experiment”,
*Chance Magazine*: Volume 37, No. 1, Stephen T. Ziliak. - Wikipedia, One- and two-tailed tests.

- One Factor tutorials in Program Help.
- Stat-Ease Academy eLearning PreDOE course, which includes a t-statistic software tutorial.

In a previous Stat-Ease blog, my colleague Shari Kraber provided insights into Improving Your Predictive Model via a Response Transformation. She highlighted the most commonly used transformation: the log. As a follow up to this article, let’s delve into another transformation: the square root, which deals nicely with count data such as imperfections. Counts follow the Poisson distribution, where the standard deviation is a function of the mean. This is not normal, which can invalidate ordinary-least-square (OLS) regression analysis. An alternative modeling tool, called Poisson regression (PR) provides a more precise way to deal with count data. However, to keep it simple statistically (KISS), I prefer the better-known methods of OLS with application of the square root transformation as a work-around.

When Stat-Ease software first introduced PR, I gave it a go via a design of experiment (DOE) on making microwave popcorn. In prior DOEs on this tasty treat I worked at reducing the weight of off-putting unpopped kernels (UPKs). However, I became a victim of my own success by reducing UPKs to a point where my kitchen scale could not provide adequate precision.

With the tools of PR in hand, I shifted my focus to a count of the UPKs to test out a new cell-phone app called Popcorn Expert. It listens to the “pops” and via the “latest machine learning achievements” signals users to turn off their microwave at the ideal moment that maximizes yield before they burn their snack. I set up a DOE to compare this app against two optional popcorn settings on my General Electric Spacemaker™ microwave: standard (“GE”) and extended (“GE++”). As an additional factor, I looked at preheating the microwave with a glass of water for 1 minute—widely publicized on the internet to be the secret to success.

Table 1 lays out my results from a replicated full factorial of the six combinations done in random order (shown in parentheses). Due to a few mistakes following the software’s plan (oops!), I added a few more runs along the way, increasing the number from 12 to 14. All of the popcorn produced tasted great, but as you can see, the yield varied severalfold.

A: | B: | UPKs | ||
---|---|---|---|---|

Preheat |
Timing |
Rep 1 | Rep 2 | Rep 3 |

No | GE | 41 (2) | 92 (4) | |

No | GE++ | 23 (6) | 32 (12) | 34 (13) |

No | App | 28 (1) | 50 (8) | 43 (11) |

Yes | GE | 70 (5) | 62 (14) | |

Yes | GE++ | 35 (7) | 51 (10) | |

Yes | App | 50 (3) | 40 (9) |

I then analyzed the results via OLS with and without a square root transformation, and then advanced to the more sophisticated Poisson regression. In this case, PR prevailed: It revealed an interaction, displayed in Figure 1, that did not emerge from the OLS models.

Figure 1: Interaction of the two factors—preheat and timing method

Going to the extended popcorn timing (GE++) on my Spacemaker makes time-wasting preheating unnecessary—actually producing a significant reduction in UPKs. Good to know!

By the way, the app worked very well, but my results showed that I do not need my cell phone to maximize the yield of tasty popcorn.

To succeed in experiments on counts, they must be:

- discrete whole numbers with no upper bound
- kept with within over a fixed area of opportunity
- not be zero very often—avoid this by setting your area of opportunity (sample size) large enough to gather 20 counts or more per run on average.

For more details on the various approaches I’ve outlined above, view my presentation on Making the Most from Measuring Counts at the Stat-Ease YouTube Channel.

My colleague Richard Williams just completed a very thorough three-part series of blogs detailing experiment designs aimed at building robustness against external noise factors, internal process variation, and combinations of both. In this follow up, I present another, more simplistic, approach to achieve on target results with minimal variation: Model not only the mean outcome but also the standard deviation. Experimenters making multiple measurements for every run in their design often overlook this opportunity.

For example, consider the paper helicopter experiment done by students of my annual DOE class at South Dakota Mines. The performance of these flying machines depends on paper weight, wing and body dimensions and other easily controlled factors such as putting on a paper clip to stabilize rotation. To dampen down variability in launching and air currents, students are strongly encouraged to drop each of their ‘copters three times and model the means of the flight time and distance from target. I also urge them to analyze the standard deviations of these two measures. Those who do discover that ‘copters without paper clips exhibit significantly higher variability in on-target landings. This can be seen in the interaction plot pictured, which came from a split plot factorial on paper helicopters done by me and colleagues at Stat-Ease (see this detailed here).

Putting on a paper clip dramatically decreased the standard deviation of distance from target for wide bodied ‘copters, but not for narrow bodied ones. Good to know!

When optimizing manufacturing processes via response surface methods, measuring variability as well as the mean response can provide valuable insights. For example, see this paper by me and Pat Whitcomb on Response Surface Methods (RSM) for Peak Process Performance at the Most Robust Operating Conditions for more details. The variability within the sample collection should represent the long-term variability of the process. As few as three per experimental run may be needed with the proper spacing.

By simply capturing the standard deviation, experimenters become enabled to deal with unknown external sources of variation. If the design is an RSM, this does not preempt them from also applying propagation of error (POE) to minimize internal variation transmitted to responses from poorly controlled process factors. However, to provide the greatest assurance for a robust operating system, take one of the more proactive approaches suggested by Richard.

The goal of robustness studies is to demonstrate that our processes will be successful upon implementation in the field when they are exposed to anticipated noise factors. There are several assumptions and underlying concepts that need to be understood when setting out to conduct a robustness study. Carefully considering these principles, three distinct types of designs emerge that address robustness:

**I.** Having settled on process settings, we desire to demonstrate the system is insensitive to external noise-factor variation, i.e., robust against Z factor influence.

**II.** Given we may have variation between our selected process settings and the actual factor conditions that may be seen in the field, we wish to find settings that are insensitive to this variation. In other words, we may set our controlled X factors, but these factors wander from their set points and cause variation in our results. Our goal is to achieve our desired Y values while minimizing the variation stemming from imperfect process factor control. The impact of external noise factors (Z’s) is not explored in this type of study.

**III.** Given a system having controllable factors (X’s) and non-controllable noise factors (Z’s) impacting one or more desirable properties (Y’s), we wish to find the ideal settings for the controllable factors that simultaneously maximize the properties of interest, while minimizing the impact of variation from both types of noise factors.

The idea here is to simultaneously involve the process factors (X’s) and the noise factors (Z’s) in the same DOE so as to identify the right process (controllable) factor settings to deliver the intended responses (Y’s) with the minimal variation when both the process and noise factors vary.

One of the first to consider this holistic approach was Taguchi in the 1980’s. Taguchi envisioned a factorial space whereby the controllable factors are changed in the usual way, and the noise factors are studied at each corner of the factorial space as a secondary factorial design. In his vocabulary, there was an inner array (of controllable factors) and an outer array (of noise factors). The design principle was as shown below, with 16 data points collected as indicated in blue (for 2 process factors, and two noise factors).

The principle of Taguchi’s approach is sound. There were, however, challenges made regarding the analytical approach and that led to further efforts by others to advance the science. Several concepts have been proposed. A solid candidate would be a dual response surface approach, where process factors and noise factors are combined in the same study, and two responses are measured: the process mean (predicted Y values), and also the variance for the predicted Y values at any given point within the design space. Armed with this knowledge the experimenter can seek regions within the design space where the desired Y values are achieved but are also relatively insensitive to the variation of both process and noise factors.

How are these dual response surface studies done? Essentially the same as described before under Type II Robustness studies, with one key exception. The external noise factors are included as factors within the study, and their influence evaluated as though they were controllable factors (which they are, of course, during the DOE itself). And the propagated error from all factors upon the responses are evaluated as per Type II.

The difference comes during the numeric optimization. Since Z factors cannot be controlled in the field, these factors are set to their nominal value (usually the center of the range studied) in numeric optimization. The standard deviation for these factors will still influence the POE assessment for responses.

Then, the experimenter can use numeric optimization to achieve the desirable Y value criteria while simultaneously minimizing the POE response, resulting in the identification of a robust region that is relative insensitive to variation in X and Z factor variation.

For additional information on using a combined array, and an introduction to using POE to address both process factor and noise factor variation, read the 2002 white paper by Mark Anderson and Shari Kraber on Cost-Effective and Information-Efficient Robust Design For Optimizing Processes And Accomplishing Six Sigma Objectives.

The goal of robustness studies is to demonstrate that our processes will be successful upon implementation in the field when they are exposed to anticipated noise factors. There are several assumptions and underlying concepts that need to be understood when setting out to conduct a robustness study. Carefully considering these principles, three distinct types of designs emerge that address robustness:

**I.** Having settled on process settings, we desire to demonstrate the system is insensitive to external noise-factor variation, i.e., robust against Z factor influence.

**II.** Given we may have variation between our selected process settings and the actual factor conditions that may be seen in the field, we wish to find settings that are insensitive to this variation. In other words, we may set our controlled X factors, but these factors wander from their set points and cause variation in our results. Our goal is to achieve our desired Y values while minimizing the variation stemming from imperfect process factor control. The impact of external noise factors (Z’s) is not explored in this type of study.

**III.** Given a system having controllable factors (X’s) and non-controllable noise factors (Z’s) impacting one or more desirable properties (Y’s), we wish to find the ideal settings for the controllable factors that simultaneously maximize the properties of interest, while minimizing the impact of variation from both types of noise factors.

In this type of analysis, we aim to find the process factor settings that satisfy our requirements and be the most insensitive to expected variations in those settings. For example, we may decide baking temperature and baking time impact the rise height of bread, per the results of a lab-scale DOE. But we anticipate that on an industrial scale, changes in conveyor speed could impact baking time and perhaps that large ovens may cycle in temperature, giving rise to variation.

We can use propagation of error (POE) in our time-temperature DOE to find the sweet spot where such fluctuations in process factors yield the smallest amount of variation in results, i.e., the most robust settings to for success in the field.

For additional detail on using POE as a tool for robust design see Pat Whitcomb’s 2020 Overview of Robust Design, Propagation of Error and Tolerance Analysis.