Comparing two methods in a chemical manufacturing plant in an attempt to understand if one has a significantly better yield than the other
Based on the book by Box, Hunter, and Hunter: Statistics for Experimenters
Chapter 3: Comparing Two Entities: Reference Distributions, Tests, and Confidence Intervals
In Statistics for Experimenters we are shown an example of comparing two methods in a chemical manufacturing plant in an attempt to understand if one method has a significantly better yield than the other. I will present the data and the original model and then we’ll model this comparison using Bayesian methods.
An experiment was performed on a manufacturing plant by making in sequence \(10\) batches of a chemical using a standard production method \((A)\) followed by \(10\) batches using a modified method \((B)\). What evidence do the data provide that method \(B\) gives higher yields than method \(A\)?
using DataFrames, CairoMakie, Statistics
# prepare dataframe
= 1:1:20
time = repeat(["A", "B"], inner = 10)
method = [89.7, 81.4, 84.5, 84.8, 87.3, 79.7, 85.1, 81.7, 83.7, 84.5, 84.7, 86.1,
yield 83.2, 91.9, 86.3, 79.3, 82.6, 89.1, 83.7, 88.5]
= DataFrame(time = time, method = method, yield = yield)
data first(data, 5)
time | method | yield |
---|---|---|
1 | A | 89.7 |
2 | A | 81.4 |
3 | A | 84.5 |
4 | A | 84.8 |
5 | A | 87.3 |
The averages are shown for both of the methods so we have a crude difference in yield of around \(1.3\) between these two methods. But is this average difference due to random chance?
\[ \bar y_A = 84.24, \bar y_B = 85.54 \]
# plot attributes
activate!(type = "svg")
CairoMakie.set_theme!(theme_minimal())
= Figure(resolution = (650, 300))
fig = fig[1, 1] = Axis(fig, xlabel = "time", ylabel = "yield", xticks = 1:1:20)
ax
# calculate the mean for each group
= groupby(data, :method)
datagroup = transform(datagroup, :yield => mean)
datamean
= datamean[datamean.method .== "A",:]
methoda = datamean[datamean.method .== "B",:]
methodb
# plot it out
scatter!(methoda.time, methoda.yield, label = "Process A")
lines!(methoda.time, methoda.yield_mean)
scatter!(methodb.time, methodb.yield, label = "Process B")
lines!(methodb.time, methodb.yield_mean)
2,1] = Legend(fig, ax, framevisible = false, orientation = :horizontal)
fig[ fig
Where the sample averages for each method are \(\bar y_A\) and \(\bar y_B\), the population difference is given as \(\delta = \eta_B - \eta_A\), the pooled estimate of \(\sigma\) is given as \(s\), and the number of observations for each method are \(n_A\) and \(n_B\). Then we can calculate the \(t\)-statistic:
\[ t_0 = \frac{(\bar y_B - \bar y_A) - \delta_0}{s \sqrt{1 / n_B + 1 / n_A}} \]
For the null hypothesis we have \(\delta_0 = 0, t_0 = 1.30/1.47 = 0.88\) with \(\nu = 18\) degrees of freedom. Then we want the following for our significance level:
\[ \Pr(t \ge 0.88) = 19.5\% \]
This reference distribution assumes the random sampling model and the observed difference then has a significance probability of \(19.5\%\) which would provide little evidence against the hypothesis that \(\eta_B - \eta_A = 0\).
using Distributions
# number of obs
= nrow(methoda)
n_a = nrow(methodb)
n_b
# average yield per method
= mean(methoda.yield)
avg_a = mean(methodb.yield)
avg_b = avg_b - avg_a
diff_ab
# sample variance
= sum((methoda.yield .- avg_a).^2)
s2_a = sum((methodb.yield .- avg_b).^2)
s2_b
# degrees of freedom
= n_a + n_b - 2
nu
# pooled estimate of σ^2
= (s2_a + s2_b) / nu
pools2
# estimated standard error of y_B - y_A
= sqrt(pools2/5)
std_err
# t-statistic
= diff_ab / std_err
tstat
# significance level
= 1 - cdf(TDist(nu), tstat) sig_level
Scaling a \(t\)-distribution by the standard error of the difference between \(\bar y_B\) and \(\bar y_A\) and looking at the probability density that exists above our estimated mean difference of \(1.3\) we can visualize the results of our test. Seeing approximately \(19.5 \%\) of the probability mass above this difference:
# plotting attributes
= Figure(resolution = (600, 300))
fig2 = Axis(fig2[1, 1], title = "Scaled t-distribution")
ax2
# A scaled t-distribution by the standard error of y_B - y_A
= LocationScale(0.0, std_err, TDist(nu))
scaled_dist
# get density line and clipped values
= -5:0.001:5
xs2 = pdf.(scaled_dist, xs2)
post_range = [x < diff_ab ? 0 : pdf(scaled_dist, x) for x in xs2]
post_clip
# plot
lines!(xs2, post_range)
band!(xs2, 0, post_clip)
vlines!(ax2, [diff_ab], color = :black, linestyle = :dash)
hideydecorations!(ax2)
fig2
Choosing our priors is the first step in this approach. Since we’ll be modeling the average yield for each method I will begin by considering how yield is defined and what that can tell us about its order of magnitude.
Percent yield is a measurement of how successful a chemical reaction has been:
…the theoretical yield, the maximum amount of product that can be formed from the given amounts of reactants. The actual yield is the amount of product that is actually formed when the reaction is carried out in the laboratory. The percent yield is the ratio of the actual yield to the theoretical yield, expressed as a percentage. \[ \text{Percent Yield} = \frac{\text{Actual Yield}}{\text{Theoretical Yield}} \times 100\% \]
Considering that yield here is a percentage I’d think a decent prior probability distribution for the average yield would be a Beta distribution:
# plotting attributes
= Figure(resolution = (550, 250))
fig3 = Axis(fig3[1, 1])
ax3
= Beta(1, 1)
beta_dist = Beta(2, 2)
beta_dist2 = Beta(2, 5)
beta_dist3
# get density line and clipped values
= 0.:0.001:1
xs3 = pdf.(beta_dist, xs3)
post_range2 = pdf.(beta_dist2, xs3)
post_range3 = pdf.(beta_dist3, xs3)
post_range4
# plot
lines!(xs3, post_range2, label = "Beta(1, 1)")
lines!(xs3, post_range3, label = "Beta(2, 2)")
lines!(xs3, post_range4, label = "Beta(2, 5)")
hideydecorations!(ax3)
axislegend()
fig3
Since percent yield could reasonably be anywhere between \(0\) and \(1\) but with those outcomes being very unlikely I think a Beta(2, 2) is an adequate weakly informative prior.
We will be modeling the variance as being pooled between both methods: we will have two means but one shared variance. The variance must be positive and since our means are scaled between \([0,1]\) then the standard deviation should be close to this range.
# plotting attributes
= Figure(resolution = (550, 250))
fig4 = Axis(fig4[1, 1])
ax4
= Exponential(.50)
exp_dist = Exponential(.25)
exp_dist2 = Exponential(.10)
exp_dist3
# get density line and clipped values
= 0.:0.001:1.
xs4 = pdf.(exp_dist, xs4)
prior_range = pdf.(exp_dist2, xs4)
prior_range2 = pdf.(exp_dist3, xs4)
prior_range3
# plot
lines!(xs4, prior_range, label = "Exponential(0.50)")
lines!(xs4, prior_range2, label = "Exponential(0.25)")
lines!(xs4, prior_range3, label = "Exponential(0.10)")
hideydecorations!(ax4)
axislegend()
fig4
Given that the Exp(0.1) distribution is almost completely under \(1\) it will be an adequate prior for \(\sigma\).
\(\mu_A\) and \(\mu_B\) are the estimated population means for each of the methods and they have a shared variance term \(\sigma^2\). The priors for these are the following:
\[ \begin{aligned} &\mu_A \sim \text{Beta}(2,2) \\ &\mu_B \sim \text{Beta}(2,2) \\ &\sigma \sim \text{Exp}(0.10) \end{aligned} \]
The likelihood is broken into two parts for each of the methods:
\[ \begin{aligned} &y_A \sim \text{Normal}(\mu_A, \sigma) \\ &y_B \sim \text{Normal}(\mu_B, \sigma) \end{aligned} \]
Multiplying Beta and Exponential priors with a Normal likelihood results in a non-conjugate case. So in this example it may make more sense to utilize Markov Chain Monte Carlo (MCMC) methods to estimate our joint probability model.
Markov Chain Monte Carlo gets part of its name from the Markov property:
Given the present, the future is independent of the past.
I decided to implement the Metropolis algorithm as a pedagogical exercise below but I’d definitely recommend using probabilistic programming languages like Stan, Turing, or PyMC in practice.
Using the Metropolis algorithm to sample the target distribution and removing burn-in samples results in the following summary for our parameters:
= methoda.yield / 100
methoda_yield = methodb.yield / 100
methodb_yield
function likelihood(param)
= param[1]
mu1 = param[2]
mu2 = param[3]
sd
= logpdf.(Normal(mu1, sd), methoda_yield)
singlelike1 = logpdf.(Normal(mu2, sd), methodb_yield)
singlelike2 = vcat(singlelike1, singlelike2)
singlelikelihood = sum(singlelikelihood)
sumll
return(sumll)
end
function prior(param)
= param[1]
mu1 = param[2]
mu2 = param[3]
sd = logpdf(Beta(2,2), mu1)
mu1prior = logpdf(Beta(2,2), mu2)
mu2prior = logpdf(Exponential(0.10), sd)
sdprior return(mu1prior + mu2prior + sdprior)
end
function posterior(param)
return likelihood(param) + prior(param)
end
function proposalfunction(param)
# enforced positive support but not sure how other PPLs enforce this
return abs.(rand(MvNormal(param, [0.001 0 0; 0 0.001 0; 0 0 0.0005])))
end
function run_metropolis_MCMC(startvalue, iterations)
= Array{Float64,2}(undef, iterations + 1, 3)
chain 1,:] = startvalue
chain[
for i in 1:iterations
= proposalfunction(chain[i,:])
proposal = exp(posterior(proposal) - posterior(chain[i,:]))
probab = min(1, probab)
r
if rand(Uniform()) <= r
+1,:] = proposal
chain[ielse
+1,:] = chain[i,:]
chain[iend
end
return chain
end
= [0.7, 0.7, 0.1]
startvalue = 100_000
chain_length
using Random:seed!
seed!(555)
= run_metropolis_MCMC(startvalue, chain_length)
chain
= convert(Int, chain_length / 2)
burnIn = last(chain[:,1], burnIn)
chain1 = last(chain[:,2], burnIn)
chain2 = last(chain[:,3], burnIn)
chain3
DataFrame(
= mean(chain1),
mu_a = std(chain1),
mu_astd = mean(chain2),
mu_b = std(chain2),
mu_bstd = mean(chain3),
sigma = std(chain3)
sigma_std )
mu_a | mu_astd | mu_b | mu_bstd | sigma | sigma_std |
---|---|---|---|---|---|
0.841692 | 0.0112674 | 0.854442 | 0.0113898 | 0.0352205 | 0.00627626 |
After utilizing MCMC to sample this model we should throw away some portion of our chain. A rule-of-thumb for Metropolis sampling is to only use half of your total samples and consider the first half the burn-in (exploratory) phase and discard those samples.
= Figure(resolution = (800, 400))
figs = figs[1, 1] = GridLayout()
ga = figs[2, 1] = GridLayout()
gb = figs[3, 1] = GridLayout()
gc = Axis(ga[1, 1])
axs1 = Axis(gb[1, 1])
axs2 = Axis(gc[1, 1])
axs3
lines!(axs1, 1:1:(chain_length+1), chain[:,1])
lines!(axs2, 1:1:(chain_length+1), chain[:,2])
lines!(axs3, 1:1:(chain_length+1), chain[:,3])
vspan!(axs1, [0, chain_length/2], [0, 1], color = (:gray, 0.2))
vspan!(axs2, [0, chain_length/2], [0, 1], color = (:gray, 0.2))
vspan!(axs3, [0, chain_length/2], [0, 1], color = (:gray, 0.2))
Label(ga[1, 1, Top()], L"\mu_A", valign = :bottom, textsize = 20)
Label(gb[1, 1, Top()], L"\mu_B", valign = :bottom, textsize = 20)
Label(gc[1, 1, Top()], L"\sigma", valign = :bottom, textsize = 20)
hidexdecorations!.((axs1, axs2))
figs
Here are the marginal distributions for each of our estimated parameters:
= Figure(resolution = (1920/2, 1080/2))
figs2 = figs2[1, 1] = GridLayout()
ga2 = figs2[2, 1] = GridLayout()
gb2 = figs2[3, 1] = GridLayout()
gc2 = Axis(ga2[1, 1])
axs11 = Axis(gb2[1, 1])
axs21 = Axis(gc2[1, 1])
axs31
hist!(axs11, chain1)
hist!(axs21, chain2)
hist!(axs31, chain3)
linkxaxes!(axs11, axs21)
xlims!(axs31, (0.0, 0.1))
hideydecorations!.((axs11, axs21, axs31))
Label(ga2[1, 1, Top()], L"\mu_A", valign = :bottom, textsize = 30)
Label(gb2[1, 1, Top()], L"\mu_B", valign = :bottom, textsize = 30)
Label(gc2[1, 1, Top()], L"\sigma", valign = :bottom, textsize = 30)
figs2
Now that we have samples for the mean of both methods how can we compare them? Simply by taking a difference of the samples we will get a distribution comparing the two means.
What’s gained from this approach is that we can now see the probability of method \(B\) having an increased yield to method \(A\). Seeing how many samples are above \(0\) and then taking the average of those results shows that \(80\%\) of the probability mass is above zero.
= (chain2 - chain1) * 100
meandiff # @show probgt0 = mean(meandiff .> 0) # just for checking
= Figure(resolution = (550, 250))
fig5 = Axis(fig5[1, 1], title = L"\Pr(\mu_B - \mu_A \geq 0) = 80 %")
ax5 density!(meandiff, bandwidth = 0.4, color =(:grey, 0.5), strokewidth=1)
vlines!(ax5, [0.0], color = :black, linestyle = :dash)
xlims!(-8, 8)
hideydecorations!(ax5)
fig5
The frequentist \(t\)-test approach results in a non-significant value. We fail to reject our null hypothesis that these samples come from a single common population distribution.
The bayesian approach gives us an \(80 \%\) probability that method \(B\) improves overall yield compared to method \(A\).
In a business context I could see the bayesian result being far more useful than failing to reject our null hypothesis. A company can weigh their options between method \(B\) and method \(A\). Decide on what’s more economical and have some percentage to wrap their heads around about potential improvement.
This particular example was given in the text by Box, Hunter, and Hunter to demonstrate a case where the \(t\)-test may be an inappropriate statistical test. The \(t\)-distribution may be a poor choice for a reference distribution. The reason why this is the case is due to autocorrelation and Student’s \(t\) test assuming errors are independent and identically distributed (iid).
The exact reason behind the correlated errors was given:
In this particular process the yield was determined by measuring the volume of liquid product in each successive batch, but because of the design of the system, a certain amount of product was left in the pipes and pumps at each determination. Consequently, if slightly less was pumped out on a particular day, the recorded yield would be low but on the following day it would tend to be high… Such an effect would produce a negative correlation between successive yields such as that found.
Moving forward we should account for this and attempt to model the autocorrelation that exists between successive yields.