Comparing two methods in a chemical manufacturing plant in an attempt
to understand if one has a **significantly better** yield
than the other

Based on the book by Box, Hunter, and Hunter: *Statistics for
Experimenters*

Chapter 3: Comparing Two Entities: Reference Distributions, Tests, and Confidence Intervals

In *Statistics for Experimenters* we are shown an example of
comparing two methods in a chemical manufacturing plant in an attempt to
understand if one method has a **significantly better**
yield than the other. I will present the data and the original model and
then we’ll model this comparison using Bayesian methods.

- The original question and the frequentist approach to comparing means
- A Bayesian approach to the \(t\)-test
- Summary of our analysis results and potential improvements

An experiment was performed on a manufacturing plant by making

in sequence\(10\) batches of a chemical using a standard production method \((A)\) followed by \(10\) batches using a modified method \((B)\). What evidence do the data provide that method \(B\) gives higher yields than method \(A\)?

```
using DataFrames, CairoMakie, Statistics
# prepare dataframe
= 1:1:20
time = repeat(["A", "B"], inner = 10)
method = [89.7, 81.4, 84.5, 84.8, 87.3, 79.7, 85.1, 81.7, 83.7, 84.5, 84.7, 86.1,
yield 83.2, 91.9, 86.3, 79.3, 82.6, 89.1, 83.7, 88.5]
= DataFrame(time = time, method = method, yield = yield)
data first(data, 5)
```

time | method | yield |
---|---|---|

1 | A | 89.7 |

2 | A | 81.4 |

3 | A | 84.5 |

4 | A | 84.8 |

5 | A | 87.3 |

The averages are shown for both of the methods so we have a crude difference in yield of around \(1.3\) between these two methods. But is this average difference due to random chance?

\[ \bar y_A = 84.24, \bar y_B = 85.54 \]

```
# plot attributes
activate!(type = "svg")
CairoMakie.set_theme!(theme_minimal())
= Figure(resolution = (650, 300))
fig = fig[1, 1] = Axis(fig, xlabel = "time", ylabel = "yield", xticks = 1:1:20)
ax
# calculate the mean for each group
= groupby(data, :method)
datagroup = transform(datagroup, :yield => mean)
datamean
= datamean[datamean.method .== "A",:]
methoda = datamean[datamean.method .== "B",:]
methodb
# plot it out
scatter!(methoda.time, methoda.yield, label = "Process A")
lines!(methoda.time, methoda.yield_mean)
scatter!(methodb.time, methodb.yield, label = "Process B")
lines!(methodb.time, methodb.yield_mean)
2,1] = Legend(fig, ax, framevisible = false, orientation = :horizontal)
fig[ fig
```

Where the sample averages for each method are \(\bar y_A\) and \(\bar y_B\), the population difference is given as \(\delta = \eta_B - \eta_A\), the pooled estimate of \(\sigma\) is given as \(s\), and the number of observations for each method are \(n_A\) and \(n_B\). Then we can calculate the \(t\)-statistic:

\[ t_0 = \frac{(\bar y_B - \bar y_A) - \delta_0}{s \sqrt{1 / n_B + 1 / n_A}} \]

For the *null hypothesis* we have \(\delta_0 = 0, t_0 = 1.30/1.47 = 0.88\) with
\(\nu = 18\) degrees of freedom. Then
we want the following for our significance level:

\[ \Pr(t \ge 0.88) = 19.5\% \]

This reference distribution assumes the random sampling model and the observed difference then has a significance probability of \(19.5\%\) which would provide little evidence against the hypothesis that \(\eta_B - \eta_A = 0\).

```
using Distributions
# number of obs
= nrow(methoda)
n_a = nrow(methodb)
n_b
# average yield per method
= mean(methoda.yield)
avg_a = mean(methodb.yield)
avg_b = avg_b - avg_a
diff_ab
# sample variance
= sum((methoda.yield .- avg_a).^2)
s2_a = sum((methodb.yield .- avg_b).^2)
s2_b
# degrees of freedom
= n_a + n_b - 2
nu
# pooled estimate of σ^2
= (s2_a + s2_b) / nu
pools2
# estimated standard error of y_B - y_A
= sqrt(pools2/5)
std_err
# t-statistic
= diff_ab / std_err
tstat
# significance level
= 1 - cdf(TDist(nu), tstat) sig_level
```

Scaling a \(t\)-distribution by the standard error of the difference between \(\bar y_B\) and \(\bar y_A\) and looking at the probability density that exists above our estimated mean difference of \(1.3\) we can visualize the results of our test. Seeing approximately \(19.5 \%\) of the probability mass above this difference:

```
# plotting attributes
= Figure(resolution = (600, 300))
fig2 = Axis(fig2[1, 1], title = "Scaled t-distribution")
ax2
# A scaled t-distribution by the standard error of y_B - y_A
= LocationScale(0.0, std_err, TDist(nu))
scaled_dist
# get density line and clipped values
= -5:0.001:5
xs2 = pdf.(scaled_dist, xs2)
post_range = [x < diff_ab ? 0 : pdf(scaled_dist, x) for x in xs2]
post_clip
# plot
lines!(xs2, post_range)
band!(xs2, 0, post_clip)
vlines!(ax2, [diff_ab], color = :black, linestyle = :dash)
hideydecorations!(ax2)
fig2
```

Choosing our priors is the first step in this approach. Since we’ll
be modeling the average yield for each method I will begin by
considering how *yield* is defined
and what that can tell us about its order of magnitude.

Percent yield is a measurement of how successful a chemical reaction has been:

…the

theoretical yield, the maximum amount of product that can be formed from the given amounts of reactants. Theactual yieldis the amount of product that is actually formed when the reaction is carried out in the laboratory. Thepercent yieldis the ratio of the actual yield to the theoretical yield, expressed as a percentage. \[ \text{Percent Yield} = \frac{\text{Actual Yield}}{\text{Theoretical Yield}} \times 100\% \]

Considering that yield here is a *percentage* I’d think a
decent prior probability distribution for the average yield would be a
*Beta* distribution:

```
# plotting attributes
= Figure(resolution = (550, 250))
fig3 = Axis(fig3[1, 1])
ax3
= Beta(1, 1)
beta_dist = Beta(2, 2)
beta_dist2 = Beta(2, 5)
beta_dist3
# get density line and clipped values
= 0.:0.001:1
xs3 = pdf.(beta_dist, xs3)
post_range2 = pdf.(beta_dist2, xs3)
post_range3 = pdf.(beta_dist3, xs3)
post_range4
# plot
lines!(xs3, post_range2, label = "Beta(1, 1)")
lines!(xs3, post_range3, label = "Beta(2, 2)")
lines!(xs3, post_range4, label = "Beta(2, 5)")
hideydecorations!(ax3)
axislegend()
fig3
```

Since percent yield could reasonably be anywhere between \(0\) and \(1\) but with those outcomes being very
unlikely I think a Beta(2, 2) is an adequate *weakly informative*
prior.

We will be modeling the variance as being pooled between both methods: we will have two means but one shared variance. The variance must be positive and since our means are scaled between \([0,1]\) then the standard deviation should be close to this range.

```
# plotting attributes
= Figure(resolution = (550, 250))
fig4 = Axis(fig4[1, 1])
ax4
= Exponential(.50)
exp_dist = Exponential(.25)
exp_dist2 = Exponential(.10)
exp_dist3
# get density line and clipped values
= 0.:0.001:1.
xs4 = pdf.(exp_dist, xs4)
prior_range = pdf.(exp_dist2, xs4)
prior_range2 = pdf.(exp_dist3, xs4)
prior_range3
# plot
lines!(xs4, prior_range, label = "Exponential(0.50)")
lines!(xs4, prior_range2, label = "Exponential(0.25)")
lines!(xs4, prior_range3, label = "Exponential(0.10)")
hideydecorations!(ax4)
axislegend()
fig4
```

Given that the Exp(0.1) distribution is almost completely under \(1\) it will be an adequate prior for \(\sigma\).

\(\mu_A\) and \(\mu_B\) are the estimated population means for each of the methods and they have a shared variance term \(\sigma^2\). The priors for these are the following:

\[ \begin{aligned} &\mu_A \sim \text{Beta}(2,2) \\ &\mu_B \sim \text{Beta}(2,2) \\ &\sigma \sim \text{Exp}(0.10) \end{aligned} \]

The likelihood is broken into two parts for each of the methods:

\[ \begin{aligned} &y_A \sim \text{Normal}(\mu_A, \sigma) \\ &y_B \sim \text{Normal}(\mu_B, \sigma) \end{aligned} \]

Multiplying Beta and Exponential priors with a Normal likelihood results in a non-conjugate case. So in this example it may make more sense to utilize Markov Chain Monte Carlo (MCMC) methods to estimate our joint probability model.

*Markov Chain Monte Carlo* gets part of its name from the
*Markov property*:

Given the present, the future is independent of the past.

I decided to implement the *Metropolis* algorithm as a
pedagogical exercise below but I’d definitely recommend using
probabilistic programming languages like Stan, Turing, or PyMC in practice.

Using the Metropolis algorithm to sample the target distribution and
removing *burn-in* samples results in the following summary for
our parameters:

```
= methoda.yield / 100
methoda_yield = methodb.yield / 100
methodb_yield
function likelihood(param)
= param[1]
mu1 = param[2]
mu2 = param[3]
sd
= logpdf.(Normal(mu1, sd), methoda_yield)
singlelike1 = logpdf.(Normal(mu2, sd), methodb_yield)
singlelike2 = vcat(singlelike1, singlelike2)
singlelikelihood = sum(singlelikelihood)
sumll
return(sumll)
end
function prior(param)
= param[1]
mu1 = param[2]
mu2 = param[3]
sd = logpdf(Beta(2,2), mu1)
mu1prior = logpdf(Beta(2,2), mu2)
mu2prior = logpdf(Exponential(0.10), sd)
sdprior return(mu1prior + mu2prior + sdprior)
end
function posterior(param)
return likelihood(param) + prior(param)
end
function proposalfunction(param)
# enforced positive support but not sure how other PPLs enforce this
return abs.(rand(MvNormal(param, [0.001 0 0; 0 0.001 0; 0 0 0.0005])))
end
function run_metropolis_MCMC(startvalue, iterations)
= Array{Float64,2}(undef, iterations + 1, 3)
chain 1,:] = startvalue
chain[
for i in 1:iterations
= proposalfunction(chain[i,:])
proposal = exp(posterior(proposal) - posterior(chain[i,:]))
probab = min(1, probab)
r
if rand(Uniform()) <= r
+1,:] = proposal
chain[ielse
+1,:] = chain[i,:]
chain[iend
end
return chain
end
= [0.7, 0.7, 0.1]
startvalue = 100_000
chain_length
using Random:seed!
seed!(555)
= run_metropolis_MCMC(startvalue, chain_length)
chain
= convert(Int, chain_length / 2)
burnIn = last(chain[:,1], burnIn)
chain1 = last(chain[:,2], burnIn)
chain2 = last(chain[:,3], burnIn)
chain3
DataFrame(
= mean(chain1),
mu_a = std(chain1),
mu_astd = mean(chain2),
mu_b = std(chain2),
mu_bstd = mean(chain3),
sigma = std(chain3)
sigma_std )
```

mu_a | mu_astd | mu_b | mu_bstd | sigma | sigma_std |
---|---|---|---|---|---|

0.841692 | 0.0112674 | 0.854442 | 0.0113898 | 0.0352205 | 0.00627626 |

After utilizing MCMC to sample this model we should throw away some
portion of our chain. A rule-of-thumb for Metropolis sampling is to only
use half of your total samples and consider the first half the
*burn-in* (exploratory) phase and discard those samples.

```
= Figure(resolution = (800, 400))
figs = figs[1, 1] = GridLayout()
ga = figs[2, 1] = GridLayout()
gb = figs[3, 1] = GridLayout()
gc = Axis(ga[1, 1])
axs1 = Axis(gb[1, 1])
axs2 = Axis(gc[1, 1])
axs3
lines!(axs1, 1:1:(chain_length+1), chain[:,1])
lines!(axs2, 1:1:(chain_length+1), chain[:,2])
lines!(axs3, 1:1:(chain_length+1), chain[:,3])
vspan!(axs1, [0, chain_length/2], [0, 1], color = (:gray, 0.2))
vspan!(axs2, [0, chain_length/2], [0, 1], color = (:gray, 0.2))
vspan!(axs3, [0, chain_length/2], [0, 1], color = (:gray, 0.2))
Label(ga[1, 1, Top()], L"\mu_A", valign = :bottom, textsize = 20)
Label(gb[1, 1, Top()], L"\mu_B", valign = :bottom, textsize = 20)
Label(gc[1, 1, Top()], L"\sigma", valign = :bottom, textsize = 20)
hidexdecorations!.((axs1, axs2))
figs
```

Here are the marginal distributions for each of our estimated parameters:

```
= Figure(resolution = (1920/2, 1080/2))
figs2 = figs2[1, 1] = GridLayout()
ga2 = figs2[2, 1] = GridLayout()
gb2 = figs2[3, 1] = GridLayout()
gc2 = Axis(ga2[1, 1])
axs11 = Axis(gb2[1, 1])
axs21 = Axis(gc2[1, 1])
axs31
hist!(axs11, chain1)
hist!(axs21, chain2)
hist!(axs31, chain3)
linkxaxes!(axs11, axs21)
xlims!(axs31, (0.0, 0.1))
hideydecorations!.((axs11, axs21, axs31))
Label(ga2[1, 1, Top()], L"\mu_A", valign = :bottom, textsize = 30)
Label(gb2[1, 1, Top()], L"\mu_B", valign = :bottom, textsize = 30)
Label(gc2[1, 1, Top()], L"\sigma", valign = :bottom, textsize = 30)
figs2
```

Now that we have samples for the mean of both methods how can we compare them? Simply by taking a difference of the samples we will get a distribution comparing the two means.

What’s gained from this approach is that we can now see the probability of method \(B\) having an increased yield to method \(A\). Seeing how many samples are above \(0\) and then taking the average of those results shows that \(80\%\) of the probability mass is above zero.

```
= (chain2 - chain1) * 100
meandiff # @show probgt0 = mean(meandiff .> 0) # just for checking
= Figure(resolution = (550, 250))
fig5 = Axis(fig5[1, 1], title = L"\Pr(\mu_B - \mu_A \geq 0) = 80 %")
ax5 density!(meandiff, bandwidth = 0.4, color =(:grey, 0.5), strokewidth=1)
vlines!(ax5, [0.0], color = :black, linestyle = :dash)
xlims!(-8, 8)
hideydecorations!(ax5)
fig5
```

The frequentist \(t\)-test approach
results in a *non-significant* value. We fail to reject our null
hypothesis that these samples come from a single *common*
population distribution.

The bayesian approach gives us an \(80 \%\) probability that method \(B\) improves overall yield compared to method \(A\).

In a business context I could see the bayesian result being far more useful than failing to reject our null hypothesis. A company can weigh their options between method \(B\) and method \(A\). Decide on what’s more economical and have some percentage to wrap their heads around about potential improvement.

This particular example was given in the text by Box, Hunter, and
Hunter to demonstrate a case where the \(t\)-test may be an inappropriate
statistical test. The \(t\)-distribution may be a poor choice for a
*reference* distribution. The reason why this is the case is due
to autocorrelation and Student’s \(t\)
test assuming errors are independent and identically distributed
(iid).

The exact reason behind the correlated errors was given:

In this particular process the yield was determined by measuring the volume of liquid product in each successive batch, but because of the design of the system, a certain amount of product was left in the pipes and pumps at each determination. Consequently, if slightly less was pumped out on a particular day, the recorded yield would be low but on the following day it would tend to be high…

Such an effect would produce a negative correlation between successive yields such as that found.

Moving forward we should account for this and attempt to model the autocorrelation that exists between successive yields.