Design of Experiments

The goal of an experiment is to determine a causal relationship between the factors changed during the experiment and an outcome. Determining a causal relationship through observational studies alone is not possible due to confounding. This process is also referred to as A/B testing, hypothesis testing, or just testing depending on the industry. These types of experiments are helpful in human decision-making, and model-based design of experiments (DoE) can be explored elsewhere on this site.

In order to remove the potential of confounding factors an experimenter changes a relatively small number of factors on a portion of the overall population, while the rest of the population gets the "normal" treatment (control). The performance / behavior of the two groups are then compared using statistics to determine if the differing treatment had an effect on the outcome. The most basic approach is to change one feature at a time, and while this is not necessary it often leads to most interpretable results and has less risk of executing the experiment incorrectly.

For the output of the experiment to be actionable the experimenter needs to ensure that they run enough tests that they can be confident in the results. The more samples you have as a part of your experiment will reduce the risk of missing an effect when it is really there (Type I error) or declaring there was an effect when there was not (Type II error). Please feel free to use the sample size and minimum detectable effect calculator to the right.

Sample Size Calculator

Two-Tailed

One-Tailed

Sample Size

Minimum Detectable Effect

Definitions

Sample Size Calculations

alpha: The probability of a false positive (Type I) error. i.e. the experimenter declares the effect met their thresholds, when in reality noise drove the outcome. Useful to think of it as 1 - confidence level. Commonly set to 0.05 or 0.1

beta: The probability of a false negative (Type II) error. i.e. the experimenter declares the effect did not meet their thresholds, when in reality noise suppressed it. Useful to think of it as 1 - power. Commonly set to 0.8

variance: A statistical measure of how variable the population is. It is the sum of squared differences between the individual observations and the mean/average; divided by the number of observations.

delta: The absolute difference between the treatment and control group.

Minimum Detectable Effect Calculations

alpha: The probability of a false positive (Type I) error. i.e. the experimenter declares the effect met their thresholds, when in reality noise drove the outcome. Useful to think of it as 1 - confidence level. Commonly set to 0.05 or 0.1

beta: The probability of a false negative (Type II) error. i.e. the experimenter declares the effect did not meet their thresholds, when in reality noise suppressed it. Useful to think of it as 1 - power. Commonly set to 0.8

variance: A statistical measure of how variable the population is. It is the sum of squared differences between the individual observations and the mean/average; divided by the number of observations.

proportion: The portion of total sample receiving the treatment.

n: The number of test / amount of sample overall, i.e. the sum of the size of all of the treatment groups