Modeling the temperature along a pipeline using non-parametrics

Based on the book by James-A Goulet: *Probabilistic Machine
Learning for Civil Engineers*

A Gaussian process regression is a *non-parametric* bayesian
approach to regression modeling. With it we’re able to relax some of the
assumptions of linear regression and model more complex relationships.
This post works through an example of modeling the temperature of a
pipeline at specific locations along the pipeline: \(x\) km.

- Our modeling example and Gaussian process regression
- Our prior knowledge
- Our posterior knowledge
- Observation error
- Parameter estimation

Take for example the \(100\) km long pipeline illustrated below, for which we are interested in quantifying the temperature at any coordinate \(x \in (0,100)\) km. Given that we know the pipeline temperature to be \(y=8^\circ C\) at \(x= 30\) km, what is the temperature at \(x= 30.1\) km ? Because of the proximity between the two locations, it is intuitive that the temperature has to be close to \(8^\circ C\). Then, what is the temperature at \(x = 70\) km? We cannot say for sure that it is going to be around \(8^\circ C\) because over a \(40\) km distance, the change in temperature can reach a few degrees.

```
using DataFrames, CairoMakie
= DataFrame(x = [31.0, 70, 30], temperature = [-0.4, 3.2, -0.6])
data data
```

x | temperature |
---|---|

31.0 | -0.4 |

70.0 | 3.2 |

30.0 | -0.6 |

A Gaussian process is a multivariate Normal random variable deﬁned over a domain described by covariates.

In linear regression our observation model is the following and our observed covariates \(x\) are assumed to be exact and free of observation error. The model is \(g(x) = b_0 + b_1 x\) when using a linear basis function.

\[ \underbrace{y}_{\text{observation}} = \overbrace{g(x)}^{\text{model}} + \underbrace{v, v: V \sim \mathcal N(v; 0, \sigma_V^2)}_{\text{observation error}} \]

In Gaussian process regression our observation model is the following
where rather than having a mean and a covariance matrix we have a mean
*function* and covariance *function*.

\[ \underbrace{g_i}_{\text{observation}} = \overbrace{g(x_i)}^{\text{model}} \]

Where the Gaussian process \(g(\mathbf{x}): \mathbf{G(x)} \sim \mathcal N(g(\mathbf{x}); \mathbf{\mu_G, \Sigma_G})\) and \(\mathbf {\mu_G}\) is the mean vector \([\mathbf{\mu_G}]_i = \mathbf{\mu_G}(x_i)\) and the covariance matrix \(\mathbf {\Sigma_G}\) is: \([\mathbf{\Sigma_G}]_{ij} = \rho(x_i, x_j) \cdot \sigma_G(x_i) \cdot \sigma_G(x_j)\)

A Gaussian process regression is a *kernel* regression. Where
the *kernel* is a function that relates our covariates pairwise
by distance. Our predictions combine our observations with weights that
depend on observed and predicted input locations. This kernel function
was seen above as \(\rho(x_i, x_j)\) in
defining our covariance matrix \(\mathbf{\Sigma_G}\).

In this example we will use the *square exponential* function
as our kernel function:

\[ \rho(x_i, x_j) = \exp \left (-\frac{1}{2} \frac{(x_i - x_j)^2}{\mathcal \ell^2} \right) \]

What’s seen in the denominator is our *length-scale*, \(\ell\), which is a *hyperparameter*
(yes there are some parameters in this non-parametric approach) for the
Gaussian process and describes a decay of correlation between the
covariates.

In our example this could be seen as how much does the temperature at position \(x_i\) tell us about the temperature at position \(x_j\)? If they are \(1\)km apart vs being \(100\)km apart how much correlation should exist between the two points?

```
# define our kernel function
function square_exp(xi, xj; length_scale = 25.0)
return exp(-1/2 * (xi - xj)^2/length_scale^2)
end
# our covariate: distance
= 0:1:100
x
# covariance matrix under different length-scales
= square_exp.(x, x', length_scale = 1.0) * 2.5^2
cov_mat1 = square_exp.(x, x', length_scale = 10.0) * 2.5^2
cov_mat2 = square_exp.(x, x', length_scale = 100.0) * 2.5^2 cov_mat3
```

Our prior knowledge in this example is the following:

\[ \underbrace{\mathbf {G \sim \mathcal N(g(x); \mu_G, \Sigma_G)}}_{\text{prior knowledge}}, [\mathbf{\Sigma_G}]_{ij} = \rho(x_i, x_j) \sigma_G^2 \]

The prior mean and standard deviations are assumed to be:

- mean prior \(\mu_G = 0\)
- our scale parameter \(\sigma_G = 2.5 ^\circ C\)

With this info we can begin by sampling realizations from our Gaussian process prior under different length-scales.

```
using Distributions, LinearAlgebra, Colors
# add jitter to diagonal for linalg stability
= MvNormal(cov_mat1 + 1e-10*I)
d1 = MvNormal(cov_mat2 + 1e-10*I)
d2 = MvNormal(cov_mat3 + 1e-10*I)
d3
# obtain realizations from different length-scale
= rand(d1, 5)'
realizations1 = rand(d2, 5)'
realizations2 = rand(d3, 5)'
realizations3
# define some colors for our plot
function wongcolors()
return [
RGB(0/255, 114/255, 178/255), # blue
RGB(230/255, 159/255, 0/255), # orange
RGB(0/255, 158/255, 115/255), # green
RGB(204/255, 121/255, 167/255), # reddish purple
RGB(86/255, 180/255, 233/255), # sky blue
RGB(213/255, 94/255, 0/255), # vermillion
RGB(240/255, 228/255, 66/255), # yellow
]end
# plot attributes
activate!(type = "svg")
CairoMakie.set_theme!(theme_minimal())
# plot
= Figure(resolution = (1920/2, 1080/2))
figs1 = figs1[1, 1] = GridLayout()
ga2 = figs1[2, 1] = GridLayout()
gb2 = figs1[3, 1] = GridLayout()
gc2 = Axis(ga2[1, 1])
axs11 = Axis(gb2[1, 1])
axs21 = Axis(gc2[1, 1])
axs31 band!(axs11, x .+ 1, -5, 5, color = (:gray90, 0.7))
band!(axs21, x .+ 1, -5, 5, color = (:gray90, 0.7))
band!(axs31, x .+ 1, -5, 5, color = (:gray90, 0.7))
series!(axs11, realizations1, color = wongcolors())
series!(axs21, realizations2, color = wongcolors())
series!(axs31, realizations3, color = wongcolors())
linkyaxes!(axs11, axs21, axs31)
ylims!(axs31, (-7, 7))
hidexdecorations!.((axs11, axs21))
Label(ga2[1, 1, Top()], L"\ell = 1", valign = :bottom, textsize = 30)
Label(gb2[1, 1, Top()], L"\ell = 10", valign = :bottom, textsize = 30)
Label(gc2[1, 1, Top()], L"\ell = 100", valign = :bottom, textsize = 30)
figs1
```

Now if we introduce data, \(\mathcal
D\), then we can obtain the *posterior* PDF \(f(\mathbf{g_* | x_*}, \mathcal D)\). Our
prior knowledge changes given that we have new observed \(\mathbf{x}\) locations and prediction
locations: \(\mathbf{x_*}\)

\[ \left[\begin{array}{c} \mathbf{G} \\ \mathbf{G}_{*} \end{array}\right] \sim \mathcal{N} \left( \left[\begin{array}{c} \mu_{G} \\ \mu_{*} \end{array}\right], \left[ \begin{array}{cc} \Sigma_{G} & \Sigma_{G*} \\ \Sigma_{G*}^T & \Sigma_{*} \end{array} \right] \right) \]

Where \(\mathbf{\Sigma_G}\) is our kernel function comparing our observed \(\mathbf x\) locations against other observed \(\mathbf x\) locations and \(\mathbf{\Sigma_*}\) is the kernel function comparing predicted \(\mathbf{x_*}\) locations against other predicted \(\mathbf{x_*}\) locations. The \(\mathbf{\Sigma_{G*}}\) matrix is the cross-covariance between predicted and observed locations.

With this prior information we can condition on the data \(\mathcal D\) and obtain the posterior PDF \(f(\mathbf{g_* | x_*}, \mathcal D) = \mathcal N(\mathbf{g_*}; \mathbf{\mu}_{\mathbf * | \mathcal D}, \mathbf \Sigma_{\mathbf * | \mathcal D})\) analytically where the posterior mean and covariance are:

\[ \begin{aligned} \mathbf \mu_{\mathbf *|\mathcal D} &= \mathbf{\mu_{G_*} + \Sigma_{G*}^T \Sigma{_G}^{-1}(g - \mu_G)} = \mathbf{\Sigma_{G*}^T \Sigma{_G}^{-1} g}\\ \mathbf \Sigma_{\mathbf *|\mathcal D} &= \mathbf{\Sigma_* - \Sigma_{G*}^T \Sigma{_G}^{-1} \Sigma_{G*}} \end{aligned} \]

```
function gp(x_obs, x_pred, y_obs, σ_g)
# do the matrix algebra to get our posterior distribution
= square_exp.(x_obs, x_obs') * σ_g^2
ΣG = square_exp.(x_obs, x_pred') * σ_g^2
ΣG_star = ΣG \ ΣG_star
solved
= vec(solved' * y_obs)
μ_post = square_exp.(x_pred, x_pred') * σ_g^2
Σ_star = Σ_star - (solved' * ΣG_star)
Σ_post
return μ_post, Σ_post
end
```

```
# get our realizations
1, gp_Σ1 = gp(data.x[1], x, data.temperature[1], 2.5)
gp_μ= MvNormal(gp_μ1, Matrix(Hermitian(gp_Σ1 + 1e-10I)))
post1 = rand(post1, 5)'
post_real 2 = sqrt.(diag(gp_Σ1))
σ
# plot attributes
= Figure(resolution = (800, 400))
fig21 = fig21[1, 1] = Axis(fig21, xlabel = "x", ylabel = "Temperature")
ax21 ylims!(ax21, (-10, 10))
# plot
band!(x .+ 1, gp_μ1-2*σ2, gp_μ1+2*σ2, color = (:gray90, 0.7))
series!(post_real, color = wongcolors())
scatter!(data.x[1:1] .+ 1, data.temperature[1:1], marker = :xcross, markersize = 15, color = :black)
fig21
```

```
# get our realizations
2, gp_Σ2 = gp(data.x[1:2], x, data.temperature[1:2], 2.5)
gp_μ= MvNormal(gp_μ2, Matrix(Hermitian(gp_Σ2 + 1e-10I)))
post2 = rand(post2, 5)'
post_real2 22 = sqrt.(diag(gp_Σ2))
σ
# plot attributes
= Figure(resolution = (800, 400))
fig22 = fig22[1, 1] = Axis(fig22, xlabel = "x", ylabel = "Temperature")
ax22 ylims!(ax22, (-10, 10))
# plot
band!(x .+ 1, gp_μ2-2*σ22, gp_μ2+2*σ22, color = (:gray90, 0.7))
series!(post_real2, color = wongcolors())
scatter!(data.x[1:2] .+ 1, data.temperature[1:2], marker = :xcross, markersize = 15, color = :black)
fig22
```

```
# get our realizations
3, gp_Σ3 = gp(data.x, x, data.temperature, 2.5)
gp_μ= MvNormal(gp_μ3, Matrix(Hermitian(gp_Σ3 + 1e-10I)))
post3 = rand(post3, 5)'
post_real3 23 = sqrt.(diag(gp_Σ3))
σ
# plot attributes
= Figure(resolution = (800, 400))
fig23 = fig23[1, 1] = Axis(fig23, xlabel = "x", ylabel = "Temperature")
ax23 ylims!(ax23, (-10, 10))
# plot
band!(x .+ 1, gp_μ3-2*σ23, gp_μ3+2*σ23, color = (:gray90, 0.7))
series!(post_real3, color = wongcolors())
scatter!(data.x .+ 1, data.temperature, marker = :xcross, markersize = 15, color = :black)
fig23
```

Until this point we have assumed that our observations are perfect
and without any error. That they are the *exact* temperature at
the observed locations. However, we know that sensors have some
instrument precision and we’d like to include that uncertainty in our
model. Our model formulation becomes:

\[ \underbrace{y}_{\text{observation}} = \overbrace{g(x)}^{\text{model}} + \underbrace{v}_{\text{measurement error}} \]

Where \(v: V \sim \mathcal N(v; 0, \sigma_{V}^2)\). We have new prior knowledge so have to include the observation error:

\[ \left[\begin{array}{c} \mathbf{Y} \\ \mathbf{G}_{*} \end{array}\right] \sim \mathcal{N} \left( \left[\begin{array}{c} \mu_{Y} \\ \mu_{G*} \end{array}\right], \left[ \begin{array}{cc} \Sigma_{Y} & \Sigma_{Y*} \\ \Sigma_{Y*}^T & \Sigma_{*} \end{array} \right] \right) \]

We can reuse much of what was done before but model the noise by adding it to the covariance kernel of our observations \(\mathbf \Sigma_G\). It is added along the diagonal using the identity matrix \(\mathbf I\):

\[ \mathbf \Sigma_Y = \mathbf \Sigma_G + \sigma_{V}^2 \mathbf I = \rho(\mathbf x, \mathbf{x'}) + \sigma_{V}^2 \mathbf I \]

```
function gp_noise(x_obs, x_pred, y_obs, σ_g, σ_v)
= square_exp.(x_obs, x_obs') * σ_g^2 + (σ_v^2 * Matrix(I, 3, 3))
ΣY = square_exp.(x_obs, x_pred') * σ_g^2
ΣY_star = ΣY \ ΣY_star
solved
= vec(solved' * y_obs)
μ_post = square_exp.(x_pred, x_pred') * σ_g^2
Σ_star = Σ_star - (solved' * ΣY_star)
Σ_post
return μ_post, Σ_post
end
2_noise, Σ2_noise = gp_noise(data.x, x, data.temperature, 2.5, 0.5);
μ
= MvNormal(μ2_noise, Matrix(Hermitian(Σ2_noise + 1e-10I)))
post_noise = rand(post_noise, 5)'
post_real_noise 2_noise = sqrt.(diag(Σ2_noise))
σ
# plot attributes
= Figure(resolution = (800, 400))
fig3 = fig3[1, 1] = Axis(fig3, xlabel = "x", ylabel = "Temperature")
ax3 ylims!(ax3, (-10, 10))
# plot
band!(x .+ 1, μ2_noise - 2*σ2_noise, μ2_noise +2*σ2_noise, color = (:gray90, 0.7))
series!(post_real_noise, color = wongcolors())
scatter!(data.x .+ 1, data.temperature, marker = :xcross, markersize = 15, color = :black)
fig3
```

Although Gaussian process regression is a nonparametric regression
method we have come across a few parameters so far that we can consider
*hyperparameters* because they are all parameters of the prior.
So far these are the following:

\[ \mathbf{\theta} = [\ell, \sigma_G, \sigma_V]^T \]

Where \(\ell\) is our length-scale in the kernel function, \(\sigma_G\) is the gaussian process prior standard deviation or scale, and \(\sigma_V\) is the observation standard deviation.

Our conditional distribution before was solved analytically when we fixed \(\mathbf \theta\) assuming each of them were known. If we wanted to estimate these parameters from our data we could use maximum likelihood (MLE) or Markov chain Monte Carlo (MCMC) methods.

The length-scale is more intuitive as a positive number and we can think of it in our units of distance (kilometers) in this example. When we have a length-scale around \(100\) then we have almost parallel lines from \(0\) to \(100\) km. In our example it doesn’t align with intuition to say that the temperature at either end of the pipeline should be correlated as highly as locations closer to each other. So we will give more weight to lower length-scales.

```
= [pdf(LogNormal(4, 1), x) for x in 0:0.01:200]
expvar = Figure(resolution = (700, 300))
fp = fp[1, 1] = Axis(fp, xlabel = "x")
axp lines!(axp, 0:0.01:200, expvar, linewidth = 3, label = "LogNormal(4, 1)")
xlims!(axp, (0, 200))
hideydecorations!(axp)
axislegend()
fp
```

Assuming that \(\sigma_G = 2.5 ^\circ C\) and \(\sigma_V = 0.5\) as we did before but this time we want to use the data to estimate our length-scale \(\ell\) we have:

\[ \begin{align*} \ell &\sim \text{LogNormal}(4, 1) \\ \Sigma &= \rho(\mathbf{x}, \mathbf{x'}, \ell) \cdot \sigma_{G}^2 \\ f &\sim \text{GP}(0, \Sigma) \\ y_{n} &\sim \mathcal N(f(x_{n}), \sigma_V) \end{align*} \]

A limitation of this modeling procedure that I ran into rather quickly is that building the matrices described above and attempting to use MCMC was too computationally intensive. Estimating the single \(\ell\) parameter required the estimation of \(100\) other parameters \(f\) for each location and the simpler MCMC methods would not converge.

Due to this limitation it’s common in practice to use approximate methods instead of full bayes when the number of observations exceed \(1000\).

```
using Turing, FillArrays
@model function myGPnoncent(y, x)
# priors
~ LogNormal(4, 1)
length_par # covariance matrix
= square_exp.(x, x', length_scale = length_par) * 2.5^2 + 1e-10*I
cov_mat = cholesky(Symmetric(cov_mat)).L
L_cov
# non-centered param
~ Normal()
m ~ MvNormal(Fill(m, length(x)), 0.5 + I * 1e-10)
f_tilde = L_cov * f_tilde
f_predict
# likelihood
~ MvNormal(f_predict[[32, 71, 31]], 0.5 + I * 1e-10)
y end
= 0:1:100
x = myGPnoncent(data.temperature, x)
gpmod = sample(gpmod, NUTS(0.65), 200) chain
```

Including \(80 \%\) highest posterior density (HPD) region:

```
using KernelDensity
= vcat(chain[Symbol("length_par")]...)
post_length
function hpdi(x::Vector{T}; alpha=0.11) where {T<:Real}
= length(x)
n = max(1, ceil(Int, alpha * n))
m
= sort(x)
y = y[1:m]
a = y[(n - m + 1):n]
b = findmin(b - a)
_, i return [a[i], b[i]]
end
= hpdi(post_length, alpha = 0.2)
hpdi_range
# get index of hpdi range
= kde(post_length)
k = findall(x -> x >= hpdi_range[1], k.x) |> minimum
min_idx = findall(x -> x <= hpdi_range[2], k.x) |> maximum
max_idx
# plot
= Figure(resolution = (650, 300))
fig5 = fig5[1, 1] = Axis(fig5)
ax5 lines!(ax5, k)
band!(ax5, k.x[min_idx:max_idx], 0.0, k.density[min_idx:max_idx])
hideydecorations!(ax5)
xlims!(ax5, (0, 100))
fig5
```

The Gaussian process regression is a useful approach to modeling nonlinear relationships and exploiting the relationships between the covariates. In my opinion it strikes a balance between parametric and nonparametric methods like linear regression and neural networks/tree-based methods respectively.

Compared to linear regression, it does a better job at modeling uncertainty in interpolation/extrapolation and its extrapolation is more predictable than other methods like linear regression with polynomial basis covariates. Some other limitations of linear regression that this approach can handle better are the expectation of homoscedastic errors and sensitivity to outliers.

In practice when building a Gaussian process and utilizing full bayes it’s almost a necessity to use linear algebra tricks like cholesky decomposition of the covariance matrices and non-centered parameterizations to make the estimation procedure more efficient.

I’m unsure if using MCMC is the way to go with these methods and perhaps reviewing the approximate methods landscape would be beneficial. Particularly of interest to me is the laGP: Local Approximate Gaussian Process Regression R package implementation.