Implementing KDE with Julia

Why am I writing about KDE’s? At the recommendation of Will Kurt’s probability blog I’ve been reading a great book on data analysis by Philipp K. Janert:

This book has a great intro to KDEs that explain the motivation. My
own abbreviated version are that KDEs provide a useful technique to
visualize a variables distribution. Visualizing your data is an
important step to take early in the data analysis stage. In fact, there
are many metrics that we commonly use to understand a variable that have
an implicit assumption that your data are `unimodal`

(having
a single peak). If your data doesn’t have this structure then you may be
mislead by measures of central tendency (mean/median/mode), outliers, or
other statistical methods (linear regression, t-tests, etc.).

This post’s structure follows closely with how I commonly learn topics. I start at a high level, using a pre-canned solution for the algorithm, and then work backward to find out what’s going on underneath.

I use a small example I discovered on the Wikipedia page for KDEs. It uses a handful of data points for a single variable:

Sample | Value |
---|---|

1 | -2.1 |

2 | -1.3 |

3 | -0.4 |

4 | 1.9 |

5 | 5.1 |

6 | 6.2 |

Let’s start with a basic `dot`

plot of these points. I’ll
be using the `Julia Programming Language`

for these
examples.

```
using StatsPlots
# initialize an array of the samples
= [-2.1; -1.3; -0.4; 1.9; 5.1; 6.2];
x # plot it out
scatter(x, zeros(length(x)), legend = false)
```

Now applying the quick and easy solution: a package. This
package has a function named `kde`

that takes a one
dimensional array (or vector), a bandwidth argument, and a chosen kernel
(we’ll use the default). So let’s see it:

```
import KernelDensity
kde(x, bandwidth = sqrt(2.25)) |>
KernelDensity.-> plot!(x, legend = false) x
```

There we go, we’ve applied KDE to these data points and we can now
see the `bimodal`

nature of this data. If all we wanted to do
was visualize the distribution then we’re done. I’d like to dig a bit
deeper though.

What is the `kernel`

part of this about? What was the
default kernel we used in the previous section? The `kde`

function from the package used a default kernel associated with the
Normal distribution. But to understand what this all means we need to
take a look at the definition of Kernel Density Estimation:

\[ D_h(x; {x_i}) = \sum_{i=1}^n \frac{1}{nh} K\left(\frac{x - x_i}{h}\right) \]

Breaking down this formula a bit: The kernel is the function shown
above as `$K$`

and Janert describes it like so:

To form a KDE, we place a

kernel—that is, a smooth, strongly peaked function—at the position of each data point. We then add up the contributions from all kernels to obtain a smooth curve, which we can evaluate at any point along thexaxis

We are effectively calculating weighted distances from our data
points to points along the *x* axis. There is a great interactive
introduction to kernel density estimation here. I highly recommend
it because you can play with bandwidth, select different kernel methods,
and check out the resulting effects.

As I mentioned before, the default `kernel`

for this
package is the Normal (or Gaussian) probability density function
(pdf):

\[ K(x) = \frac{1}{\sqrt{2\pi}}\text{exp}\left(-\frac{1}{2}x^2\right) \]

Since we are calculating pdfs I’ll use the Distributions.jl package to create each distribution, calculate the densities, and sum the results.

```
using Distributions
= Normal.(x, sqrt(2.25)) dists
```

```
6-element Array{Distributions.Normal{Float64},1}:
Distributions.Normal{Float64}(μ=-2.1, σ=1.5)
Distributions.Normal{Float64}(μ=-1.3, σ=1.5)
Distributions.Normal{Float64}(μ=-0.4, σ=1.5)
Distributions.Normal{Float64}(μ=1.9, σ=1.5)
Distributions.Normal{Float64}(μ=5.1, σ=1.5)
Distributions.Normal{Float64}(μ=6.2, σ=1.5)
```

Here we see a neat feature of the Julia language. Any Julia function
can be vectorized (or broadcasted) by the application of the
`.`

(or “dot”) operator. See this
blog post if you want to learn more about it. Above we applied the
`Normal`

method element-wise creating an array of Normal
distributions. The mean of our individual distributions being our data
points and a variance of `2.25`

(aka our chosen bandwidth).
Let’s plot each of these distributions:

`plot(dists, legend = false)`

Summing up their probability densities across all of
`x`

.

```
# create an iterator
= range(-7, 11, length = 100)
x_d # find the kde with a gaussian kernel
= sum(pdf.(eachdist, x_d) for eachdist in dists)
dens
plot!(x_d, dens)
```

The resulting shape of the KDE is identical to the one we first calculated. We could stop here except this is really just a special case where we are using the gaussian kernel. Let’s extrapolate a bit so we could use different kernels.

To apply a new kernel method we can just write the KDE code from
scratch. Below I’ve defined the KDE function as `D`

and the
kernel argument as `K`

to mimic the math above.

```
# define some kernels:
# gaussian kernel
kgauss(x) = 1/sqrt(2π) * exp(-1/2 * x^2)
# boxcar
kbox(x) = abs(x) <= 1 ? 1/2 : 0
# triangular
ktri(x) = abs(x) <= 1 ? 1 - abs(x) : 0
# define the KDE function
D(x, h, xi, K) =
1/(length(xi) * h) * sum(K.((x .- xi) / h))
# evaluate KDE along the x-axis using comprehensions
= [D(xstep, sqrt(2.25), x, K) for xstep in x_d, K in (kgauss, kbox, ktri)]
dens
# visualize the kernels
plot(x_d, dens, label = ["Gaussian", "Box", "Triangular"])
```

In my example above I used some shorthand Julia syntax. The
`?:`

syntax is called a ternary operator and makes a
conditional `if-else`

statement more compact. I also used
Julia’s `Assignment`

form for my function definitions above
because it looks a lot more like the math involved. You could have
easily defined each of these functions to look more like so:

```
function foo(x)
if ...
...
else
...
end
return ...
end
```

A short post on cumulative distribution functions (cdf) using Julia
will likely follow this one. Janert introduces both kdes and cdfs in his
chapter **A Single Variable: Shape and Distribution** and
they complement each other really well. Thanks for reading!