using Plots
using GLM
Linear Regressions with Julia
The first time I heard about Julia was a few months ago, in an introductory class to programming. Since then, I have come to know a bit more about the reputation of the language, supposely very efficient and especially suited for scientific computations.
In this article, I will try to perform some basic linear regressions with Julia and to plot them.
Simple linear regression
Here, let us try to explain one continuous variable by another continuous variable.
First, those are the packages we are going to use :
The Plots package seems to be the most general package to plot data with Julia. It is a simple interface to several underlying plotting packages, like gr or plotly. In this sense, it allows to easily switch between different frameworks that have different features, while keeping the same syntax.
The GLM package seems to be the main package for Generalized Linear Models (GLM) with Julia. It will be useful in the computation of our simple and multinomial linear models.
Now, let us generate some data :
= range(0, 1, length=100)
x # Generate data :
= rand(100) .* x
y
# Check the data :
size(y)
(100,)
- The function
range()
in Julia allows to create an array with a starting and an ending value. Although having similar properties than vectors, arrays created with therange()
function have the “Range” type (orStepRangeLen
to be exact). Here, I just initialise thex
array to compute an easy to usey
vector. - The function
rand()
in Julia allows to randomly generate numbers. The default settings of the function make it so that the drawn data is of typeFloat64
between \(0\) and \(1\). Its argument, set to the value of \(100\), indicates the number of random values that are to be drawn. Therefore,rand(100)
returns a vector of 100Float64
numbers. - The
.*
function is a vectorized multiplication. In Julia, adding a dot.
before a mathematical operator allows to apply this operation to all the elements of a vector. If we wanted to add \(2\) to all the elements of they
vector, we should have writteny .+= 2
. - The last function
size()
is useful to get more dimension of the array we consider. In our case, it yields(100,)
because it is a \(100\) rows vector. We would get the same if we didsize(y)
.
Now, let us plot the data we just created :
plot(
scatter(x,y, label = "data"),
= "Generating Random Data",
title = "Random variable",
ylabel = "x",
xlabel )
The function plot()
in Julia comes from the Plots package. It is the default plotting function, with the gr backend by default (we will get to that later on). It contains several elements :
- The function
scatter()
allows to plot a scatterplot, i.e. independent points that are not linked by a line. If we had not indicated thescatter()
function, it would have plotted a line between all the points. - The
label
value is the one that gets displayed within the box. Specifying “data” allows to avoid the “y1” default value. title
,ylabel
andxlabel
are pretty straightforward to understand.
It is now time to run a simple linear regression. Two simple ways to run such a model are :
- Using the GLM package,
- Using the built-in matrix computation function of Julia.
GLM package
First, with the GLM package :
= lm(@formula(y ~ x), (;y, x)) model
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int32}}}}, Matrix{Float64}}
y ~ 1 + x
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.00995436 0.03272 0.30 0.7616 -0.0549775 0.0748862
x 0.468727 0.0565302 8.29 <1e-12 0.356545 0.58091
──────────────────────────────────────────────────────────────────────────
Now, if we want to plot our model with the data, we can run :
= GLM.coef(model)
coefs plot!((x) -> coefs[1] + coefs[2] * x, label="model",
= "Simple linear regression") title
If we want to include the equation of our linear model in the legend of the plot, we can use the LaTeXStrings.jl package :
using LaTeXStrings
plot(title = "Simple linear regression",
scatter(x,y, label = false),
-> coefs[1] + coefs[2] * x, label=latexstring("y=$(coefs[1]) + $(coefs[2]) * x")) (x)
And if we want it to be outside the plot :
plot!(legend=:outerbottom, legendcolumns=1)
Matrix computation
Second, with the built-in matrix computation function :
= [ones(length(x)) x]
X \y) (X
2-element Vector{Float64}:
0.00995436086366171
0.4687272109439466
The first line creates a matrix \(X\) such that \(X\in\mathbb{R}^{100\times 100}\), with its first columns being only ones, and its second column being the vector of \(x\).
Without entering into the details, it does exactly what the OLS method does, but in a more rudimentary way, using the normal equation to find the OLS estimates.
Performance comparison
We see that both approaches yield the same result, but what are the differences between them ? Apart from the syntax, we could check the performance of each of them. In order for us to do that, we are going to use the @time
function, which returns the time and space used by a chosen function.
@time begin
= ones(length(x))
vec_1 = transpose([transpose(vec_1);transpose(x)])
X \y)
(Xend
@time begin
= lm(@formula(y ~ x), (;x, y))
model coef(model)[1:2]
GLM.end
Built-in matrix approach :
0.000070 seconds (30 allocations: 40.266 KiB)
2-element Vector{Float64}:
0.00995436086366171
0.4687272109439466
GLM package approach :
0.000120 seconds (146 allocations: 20.438 KiB)
2-element Vector{Float64}:
0.009954360863661988
0.46872721094394604
If we compare the information returned by the @time
function, we see that the matrix computation approach seems to be slightly faster. However, we have to take into account that the lm()
function does not only compute the coefficients, but alswo other informations, that are here not displayed, such as the confidence interval, and the probability associated to the t-statistic. In this sense, this comparison does not make justice to the GLM package. We can however remember that if we only wanted the coefficients of such a model, we could prefer the matrix multiplication approach if we need time.
Multiple linear regression
Here, let us try to explain one continuous variable by multiple continuous variables.
Again, we first have to generate and plot the data :
# Generating random data :
= (rand(100) for _ in 1:4)
x,y,w,z
# Plotting the data :
plot(scatter(x, y, z,
= "Static 3D plot",
title = w, label="data")) marker_z
Rotatable plot
Now, we could continue with the plot function of the default gr backend package of Plots
, but to get rotable 3D plots, we can use the PlotlyJS package. Since Plot.jl integrates plotly, we could try just adding plotly()
:
# We switch from the gr() to the plotlyjs() backend within Plots :
plotlyjs()
# We plot :
plot(scatter(x, y, z;
= w,
marker_z = 2,
markersize = "Rotatable 3D plot",
title = "data",
label = "x",
xlabel = "y",
ylabel = "z")) zlabel
It is now time for some model. Let us consider that the variable z
can be explained linearly by the variables x
and y
, and that w
does not provide supplementary information.
We are going to use the two methods we saw previously :
# Using matrix computations :
@time begin
= [ones(length(x)) x y]
X \z
Xend
# Using GLM :
@time begin
= lm(@formula(z ~ x + y), (;x, y, z))
model = GLM.coef(model)
coefs
coefsend
Built-in matrix approach :
0.000045 seconds (33 allocations: 42.531 KiB)
3-element Vector{Float64}:
0.36233333368806564
-0.0017472521580554313
0.21145297793195228
GLM package approach :
0.000265 seconds (205 allocations: 29.141 KiB)
3-element Vector{Float64}:
0.362333333688066
-0.0017472521580556221
0.2114529779319519
As before, both results are approximatively the same, with a small advantage for the matrix computation approach in performance terms.
Plotting a multinomial regression
Now let’s try to plot the plane of this regression :
linear_model(x,y) = coefs[1] .+ coefs[2] * x .+ coefs[3] * y
plot(x,y,linear_model,
=:surface,
st= "Plane of a two dimensional linear regression",
title ="model",
label="x",
xlabel = "y",
ylabel = "z") zlabel
While I did not manage to plot simulatneously the points and the surface on the plot, it seemed a satisfactory result for the moment.