Linear Regressions with Julia

julia
statistics
Author

Paulo Gugelmo Cavalheiro Dias

Published

September 21, 2024

The first time I heard about Julia was a few months ago, in an introductory class to programming. Since then, I have come to know a bit more about the reputation of the language, supposely very efficient and especially suited for scientific computations.

In this article, I will try to perform some basic linear regressions with Julia and to plot them.

Simple linear regression

Here, let us try to explain one continuous variable by another continuous variable.

First, those are the packages we are going to use :

using Plots
using GLM
  • The Plots package seems to be the most general package to plot data with Julia. It is a simple interface to several underlying plotting packages, like gr or plotly. In this sense, it allows to easily switch between different frameworks that have different features, while keeping the same syntax.

  • The GLM package seems to be the main package for Generalized Linear Models (GLM) with Julia. It will be useful in the computation of our simple and multinomial linear models.

Now, let us generate some data :

x = range(0, 1, length=100)
# Generate data : 
y = rand(100) .* x

# Check the data :
size(y)
(100,)
  • The function range() in Julia allows to create an array with a starting and an ending value. Although having similar properties than vectors, arrays created with the range() function have the “Range” type (or StepRangeLen to be exact). Here, I just initialise the x array to compute an easy to use y vector.
  • The function rand() in Julia allows to randomly generate numbers. The default settings of the function make it so that the drawn data is of type Float64 between \(0\) and \(1\). Its argument, set to the value of \(100\), indicates the number of random values that are to be drawn. Therefore, rand(100) returns a vector of 100 Float64 numbers.
  • The .* function is a vectorized multiplication. In Julia, adding a dot . before a mathematical operator allows to apply this operation to all the elements of a vector. If we wanted to add \(2\) to all the elements of the y vector, we should have written y .+= 2.
  • The last function size() is useful to get more dimension of the array we consider. In our case, it yields (100,) because it is a \(100\) rows vector. We would get the same if we did size(y).

Now, let us plot the data we just created :

plot(
    scatter(x,y, label = "data"),
    title = "Generating Random Data",
    ylabel = "Random variable",
    xlabel = "x",
)

The function plot() in Julia comes from the Plots package. It is the default plotting function, with the gr backend by default (we will get to that later on). It contains several elements :

  • The function scatter() allows to plot a scatterplot, i.e. independent points that are not linked by a line. If we had not indicated the scatter() function, it would have plotted a line between all the points.
  • The label value is the one that gets displayed within the box. Specifying “data” allows to avoid the “y1” default value.
  • title, ylabel and xlabel are pretty straightforward to understand.

It is now time to run a simple linear regression. Two simple ways to run such a model are :

  • Using the GLM package,
  • Using the built-in matrix computation function of Julia.

GLM package

First, with the GLM package :

model = lm(@formula(y ~ x), (;y, x))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int32}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error     t  Pr(>|t|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  0.00995436   0.03272    0.30    0.7616  -0.0549775  0.0748862
x            0.468727     0.0565302  8.29    <1e-12   0.356545   0.58091
──────────────────────────────────────────────────────────────────────────

Now, if we want to plot our model with the data, we can run :

coefs = GLM.coef(model)
plot!((x) -> coefs[1] + coefs[2] * x, label="model", 
    title = "Simple linear regression")

If we want to include the equation of our linear model in the legend of the plot, we can use the LaTeXStrings.jl package :

using LaTeXStrings
plot(title = "Simple linear regression", 
    scatter(x,y, label = false),
    (x) -> coefs[1] + coefs[2] * x, label=latexstring("y=$(coefs[1]) + $(coefs[2]) * x"))

And if we want it to be outside the plot :

plot!(legend=:outerbottom, legendcolumns=1)

Matrix computation

Second, with the built-in matrix computation function :

X = [ones(length(x)) x] 
(X\y)
2-element Vector{Float64}:
 0.00995436086366171
 0.4687272109439466

The first line creates a matrix \(X\) such that \(X\in\mathbb{R}^{100\times 100}\), with its first columns being only ones, and its second column being the vector of \(x\).

Without entering into the details, it does exactly what the OLS method does, but in a more rudimentary way, using the normal equation to find the OLS estimates.

Performance comparison

We see that both approaches yield the same result, but what are the differences between them ? Apart from the syntax, we could check the performance of each of them. In order for us to do that, we are going to use the @time function, which returns the time and space used by a chosen function.

@time begin
    vec_1 = ones(length(x))
    X = transpose([transpose(vec_1);transpose(x)])
    (X\y)
end
@time begin 
    model = lm(@formula(y ~ x), (;x, y))
    GLM.coef(model)[1:2]
end

Built-in matrix approach :

  0.000070 seconds (30 allocations: 40.266 KiB)
2-element Vector{Float64}:
 0.00995436086366171
 0.4687272109439466

GLM package approach :

  0.000120 seconds (146 allocations: 20.438 KiB)
2-element Vector{Float64}:
 0.009954360863661988
 0.46872721094394604

If we compare the information returned by the @time function, we see that the matrix computation approach seems to be slightly faster. However, we have to take into account that the lm() function does not only compute the coefficients, but alswo other informations, that are here not displayed, such as the confidence interval, and the probability associated to the t-statistic. In this sense, this comparison does not make justice to the GLM package. We can however remember that if we only wanted the coefficients of such a model, we could prefer the matrix multiplication approach if we need time.

Multiple linear regression

Here, let us try to explain one continuous variable by multiple continuous variables.

Again, we first have to generate and plot the data :

# Generating random data :
x,y,w,z = (rand(100) for _ in 1:4)

# Plotting the data :
plot(scatter(x, y, z,
    title = "Static 3D plot", 
    marker_z = w, label="data"))

Rotatable plot

Now, we could continue with the plot function of the default gr backend package of Plots, but to get rotable 3D plots, we can use the PlotlyJS package. Since Plot.jl integrates plotly, we could try just adding plotly() :

# We switch from the gr() to the plotlyjs() backend within Plots :
plotlyjs()
# We plot :
plot(scatter(x, y, z;
        marker_z = w,
        markersize = 2, 
        title = "Rotatable 3D plot",
        label = "data", 
        xlabel = "x",
        ylabel = "y", 
        zlabel = "z"))

It is now time for some model. Let us consider that the variable z can be explained linearly by the variables x and y, and that w does not provide supplementary information.

We are going to use the two methods we saw previously :

# Using matrix computations :
@time begin
    X = [ones(length(x)) x y]
    X\z
end
# Using GLM : 
@time begin 
    model = lm(@formula(z ~ x + y), (;x, y, z))
    coefs = GLM.coef(model)
    coefs
end

Built-in matrix approach :

  0.000045 seconds (33 allocations: 42.531 KiB)
3-element Vector{Float64}:
  0.36233333368806564
 -0.0017472521580554313
  0.21145297793195228

GLM package approach :

  0.000265 seconds (205 allocations: 29.141 KiB)
3-element Vector{Float64}:
  0.362333333688066
 -0.0017472521580556221
  0.2114529779319519

As before, both results are approximatively the same, with a small advantage for the matrix computation approach in performance terms.

Plotting a multinomial regression

Now let’s try to plot the plane of this regression :

linear_model(x,y) = coefs[1] .+ coefs[2] * x .+ coefs[3] * y
plot(x,y,linear_model,
    st=:surface,
    title = "Plane of a two dimensional linear regression",
    label="model",
    xlabel ="x",
    ylabel = "y",
    zlabel = "z")

While I did not manage to plot simulatneously the points and the surface on the plot, it seemed a satisfactory result for the moment.