Learn Python from Scratch for Data Analysis & Machine Learning – linear regression and t-test
970 segments
In this video, I will show how you can
get started with Python for data
analysis.
Note that I here assume that you have
some basic understanding in statistics.
I will first show how to install and use
packages and explain the difference
between functions and methods. Then we
will see how to compute linear
regression, a test, and how to read in
data from an Excel file.
You can use different IDs or text
editors for Python and I will here use
spider but other editors should work
just fine.
If you are an R user like me, Spider is
nice to start with because it looks very
similar to R Studio and it is really
easy to install and get started with.
Here is where we create our script.
And here is the console that shows the
output.
And in this tab, you can view your
plots. You may download Spider for free
from the following website. Spider comes
with its inbuilt Python.
To be able to run the code as shown in
this video, you first need to install
the following packages. If you have not
already done that, check how to install
these libraries on your system.
In spider, you can for example install
the following package like this, which
makes sure that the package is installed
into the same Python interpreter that
Spider is currently running.
If you get an error when you try to
install a package that says something
like pip is not installed or no module
named pip, you can try to run the
following two lines and see if that
fixes the problem.
To see if you already have a package
installed, you can run this line
which shows the version of the installed
package.
Once you have successfully installed all
these libraries, you will be ready to
use the code in this tutorial.
Before we start with the tutorial, I
will first show how you can run a simple
code in spider and in collab.
So when you open spider, it should look
something like this.
We can remove this text.
Then we assign the value five to our
variable x and use the function print.
To run this code, we select it and click
up here.
Note that the value of x will be printed
in the console.
Then we assign the value seven to the
variable Y
and add X and Y and store the sum in Z
and type print.
If you now run this code, the sum will
be printed in the console.
Another simple way to use Python is to
use Google Collab where you first need
to set up an account.
We first open a new notebook.
Make sure that the runtime type is set
to Python.
Then we write our code in blocks like
this.
and press here to run the code. To open
a new block, we click here.
We will first start with the basics. To
assign for example the value seven to a
variable that we call x, we write like
this. To print the value that is stored
in x, we use the function print
that outputs this value in the console.
If you now set x equal to five and run
the code again, we assign a new value to
the variable. We therefore replace its
previous value with a new one.
If you now print variable x, the value
five should be shown in the output.
In this code, we assign the values seven
and five to the variables x and y.
Then we add these values and assign the
resulting sum to the variable zed and
print its value.
It is also possible to assign values to
the two variables on a single line like
this
where x will be set to seven and y to
five.
By using the function type inside the
function print, we can print the data
type of x.
Since five is a whole number, Python
will print class int showing that x is
an integer. If instead assign the value
5.2 two to the variable and print its
type. Python prints class floats showing
that x is a floating point number. In
Python, numbers with decimals are called
floats while whole numbers are called
integers.
This code assigns a five with quotes to
the variable x. The quotes indicate that
this is a character and not a number. In
Python, anything inside quotes is
treated as a string, even if it looks
like a number, which explains why the
variable x is now interpreted as a
string or text instead of a number.
We may also use single quotes.
If you now try to add a number to the
variable, we will get an error because x
is no longer interpreted as a number.
Usually, we only use quotes when we like
to assign a text string to a variable.
This code assigns the value false to the
variable x. False is a boolean value in
Python which can either be true or
false. If you now print the data type of
X, we see that X is a Boline. Bolins are
often used in Python for conditions and
logical operations.
This code changes the value of X from
false to true.
We'll now see how to do some simple math
in Python.
To for example square the value of x we
can use two asterisks which means x to
the power of two.
So how do we compute the square root of
x?
To compute the square root of 16 we can
do it like this. But to do more advanced
math,
the easiest way is to import the math
module which is part of Python's
standard library. This module contains
many useful mathematical operations.
If you now run the following function, a
list of all the functions and constants
in the math module should be shown in
the console.
To compute the square root, we can use
the function sqrt.
To use the sqrt function in a math
module, we type the name of the module
followed by a dot
and then the name of the function we
like to use.
Then we can print the square root of 16.
To print the value of pi which is a
constant in the math module, we use the
following code.
This code uses the x function from the
math module which calculates e raised to
a given power.
Similarly, we can compute the natural
log of two like this.
To use a function in a math module, we
have used the name of the module
followed by the name of the function in
that module.
Usually, we avoid using the full name of
the module
and instead give the module a shorter
name
which we then use to access the function
in a module.
So to use functions in a module, this is
the most common way
where we now use the short name of the
module.
Note that you can use any name here as
long as they are consistent.
Another way to use a function in a
module is to only import the function
that we like to use.
So from the math module
we import the function SQRT
after running this line we can use the
SQRT function directly
without the need to use the name of the
module.
Note that the following code will
produce an error because we have not yet
imported the log function from the math
module.
To fix this error, we run this line
where we now also import the log
function from the math module.
So we can either use the functions in
the math module like this
or like this.
This way is probably the most common
way.
When we import entire math module, it is
always clear that the functions come
from the math module. This prevents
naming conflicts if other modules have
functions with the same name.
However, the advantage of importing the
functions directly
is that the code looks a bit cleaner.
We can create the following list by
putting the values inside square
brackets and assign it to x.
We can then print the list and see that
x is of data type list.
There is also something called tupil
written with parentheses which is a list
of values that cannot be changed.
For example, if you like to change the
value of the first element of the list
from 1 to 7, we can use the following
code.
But if you try to do the same with a
tupil, we get an error because a tupil
cannot be changed.
When you multiply a list by a number,
Python repeats the elements of the list
that many times.
In data analysis, we usually work with
data in the form of vectors and
matrices.
That is why it is helpful to use the
numpy module.
So we import the numpy module and give
it the short name np which is the
standard convention.
Then we use the function array to create
a numpy array
of the list and assign it to x.
If you now multiply this array by two,
we see that each value in the elements
has been multiplied by two.
In Python, arrays and lists are stored
so that the first element has index
zero. Here, the first element of the
array stores the value one and its index
is zero.
The second element of the array stores
the value two and its index is one.
To access a value in an array,
we write the name of the array followed
by square brackets containing the index
of the element we want to print.
For example, this prints the value
stored at index zero, which is the first
element of the array.
This prints the value stored at index
one which is the second element of the
array.
If we try to print the value that is
stored at index 4, we'll get the
following error
because our array does not have a fourth
index.
So remember the Python arrays start
counting at zero.
But what is happening here?
Well, in Python, a negative index counts
from the end of the array. So this
refers to the last element.
So -2 refers to the second to last
element of the array.
This code means that we like to print
the values in the array that start at
index zero and go up to but not
including index three.
We can think of this as printing the
first three elements of the array
given that we start at index zero
which prints the first three values of
the array.
We will now see how to create the
following matrix in numpy.
This code creates a 2x2 numpy array or a
matrix.
Each inner list creates the rows of the
matrix.
The double square brackets in the output
show that we now have a two-dimensional
array.
Another
way to create the same matrix is to
first create a one-dimensional array.
Then we use reshape to change the shape
of the array into a 2x2 matrix.
Note that reshape is not a function in
the numpy module. It is actually a
method which can only be called on
existing numpy arrays.
Since x is an existing number array,
we can apply the method reshape on it.
I will later explain the difference
between a function and a method.
So with the method reshape, we convert
the one-dimensional array into a
two-dimensional array or matrix with two
rows and two columns.
One convenient way to create the matrix
is to use a negative one here which says
create a two-dimensional array that has
two columns
and figure out the number of rows
automatically.
Suppose that we like to print the value
on the first row first column of this
matrix.
Then we say that we like to print the
value in the element at row index zero
and column index zero.
This will print the value stored in the
following element.
If you like to print all values in the
first column,
we select indices zero and one, which
means that we select the first two rows
of the first column,
which will print all the values in the
first column.
We can think of these two here as we
like to print the first two rows
given that we start at index zero.
We may also write like this where we say
that we like to start at index zero and
go to the end which will print all rows
of the first column.
To create a two-dimensional array that
has four rows and one column, we can
write like this
where we say that our two-dimensional
array should have four rows
and one column.
We can also do like this where we let
Python automatically figure out how many
rows the array should have.
When we say that we like to have just
one column
to check the shape of our
two-dimensional array, we can type the
name of the array followed by the method
shape
which tell us that our two-dimensional
array has four rows and one column.
Suppose that we have created a
two-dimensional array and want to
convert it to a one-dimensional array.
Then we can use the method flatten which
converts a multi-dimensional array into
one-dimensional array.
So here we have a onedimensional array
which is converted to a two-dimensional
array
and then back to one-dimensional array.
In this code, we create the following
row vector
and the following column vector.
To multiply these two matrices which is
called a dot product, we use the
function dot in numpy.
We may also use the add symbol for
matrix multiplications.
With the following function, we can
create a two-dimensional array with
zeros.
Note that the numbers are in the form of
decimal numbers or floating point
numbers.
If you instead set dype to int,
the zeros will instead come as integers
or whole numbers.
We can then assign the numbers 1 2 3 and
four to each element by specifying their
row and column indices.
For example, this assigns the value two
to the first row second column.
However, assigning decimal numbers to
the array.
Now convert them to integers by dropping
the decimal part.
So to store decimal numbers we must set
dype to float which is the default
setting for this function.
You may also see something like this
where a float 64 means that each decimal
number is stored using 64 bits for high
precision calculations.
Sometimes one uses float 32 to save
memory and speed up computations, but
these decimal numbers are less precise
than float 64.
Note the difference here.
When we store the value of pi in a
matrix,
float 32 stores this value with only
about seven significant digits whereas
float 64 stores this value with more
digits and therefore with higher
precision.
This is the default of the function.
Suppose that we have stored the weight
in kilos of four individuals in a
one-dimensional array.
We can then use the function mean from
numpy to compute the average weight of
the four individuals.
With the function std, we can compute
the standard deviation.
Note that by default, the function
computes the population standard
deviation.
where we only have the sample size in
the denominator.
To instead compute the sample standard
deviation, we need to set the degrees of
freedom to one where we now have n minus
one in the denominator.
We will now see how to plot some simple
data.
Suppose that we have the following data
of five individuals.
We can insert the data by using numpy
arrays like this
to create the following simple scatter
plot. We can import the piplot subm
module from the main plotting library
mattplot lib to access functions like
plot and scatter.
We will here use the scatter function.
And to see which parameters we can use
in this function, we can run the
following line.
To make a scatter plot, we use the
function scatter
and plug in the names of our numpy
arrays that were created previously.
To use bigger points in the plot, we set
the parameter s to 100.
S is called a parameter which is the
name the function expects
and the number 100 that we pass to it is
called an argument.
With this parameter we can choose a
certain color for the points to add nice
labels and a grid to the plot.
We use these functions to modify its
appearance.
We add a grid, put labels on the axis,
and finally display the plot with show
in spider. You need to run all these
lines at the same time to create the
plot
like this.
We will now have a look at the
difference between using functions and
methods in Python, which is crucial to
understand when using certain libraries
in Python. To explain the difference
between a function and a method, we will
here set up a code to perform linear
regression based on our previous data.
So we will use age
to predict the systolic blood pressure
where we will estimate the intercept
and the slope of the regression line.
To perform linear regression, we will
here use the function line regress that
we import from the stats module.
Then we create one-dimensional arrays of
the age and the blood pressure of the
five individuals.
Next, we perform linear regression with
the imported function
and print the results.
We see that the slope of the fitted line
is about 0.22
which means that the blood pressure is
predicted to increase by 0.22
for one year increase in age.
The intercept is about 113
which can be interpreted as a person at
age zero has a blood pressure of 113.
However, since we have no data on
children, we should not put much trust
in this value. To add this regression
line to our plot,
we first extract the slope and the
intercept from the output stored in res.
Then we create an array that takes the
values 0 and 80, which gives the x
coordinates to draw the line between.
Then we compute the corresponding
y-coordinates for these two points with
the equation of our straight line.
So the values of the slope and intercept
will go in here.
And we will compute the corresponding y
values for the x values 0 and 80.
Finally, we run the same code as before
when we generated a scatter plot.
But we here now also use the function
plot to draw a line between the two
points.
So that we now show the regression line
in the plot.
We can now use our regression line to
predict the blood pressure for a person
of age 50 by drawing a vertical line
from 50
and then the horizontal line from the
regression line to the y-axis where we
see that the predicted blood pressure is
about 124.
A more accurate way to compute the
predicted value is to use the equation
of the line where we plug in the
computed slope and the intercept
and the h and do the math.
The problem with this calculation is
that we here use rounded values of the
intercept and the slope.
An even better way to do this would be
to compute this directly in Python
because we then avoid rounding the
values of the slope and the intercept.
So the importance of this exercise was
to show how we can use an imported
function like this.
We will now perform linear regression
again but this time we will use the
library skykit learn by using its subm
module linear model. The difference from
the previous example is that this is not
the function. It is actually a so-called
class. We therefore now import a class
and not the function.
So from the library skyit learn we use
the module linear model
from where we import the class linear
regression.
The name of a class usually involves
capital letters.
We add the data like this. But note that
the method that we will use here
requires that the data is in the form of
two-dimensional arrays.
which explains why we now reshape the
one-dimensional arrays to twodimensional
arrays
which look like this. If you print them,
we can see that they are two-dimensional
arrays because there are two square
brackets around the numbers.
Since we here import a class instead of
a function,
we must first create an instance of this
class.
When we run this line, we will create an
object with a name model.
If we run the corresponding code in a
Python text editor,
we will create an object with a name
model.
The object we created named model has
attributes that store information about
the regression which will later store
the slope and the intercept of the
fitted line.
This class also provides functions that
we can apply on our object. When a
function belongs to an object, it is
instead called a method.
If we type model, we should see the
attributes that are associated with the
object.
These two attributes will later store
the slope and intercept of the
regression line. To perform linear
regression, we will use the method fit
which is included in this class.
If you now run the following line,
we will perform linear regression where
the values of the slope and the
intercept will be stored in the object
like this.
We can extract the values of the slope
and intercept with the following
attributes that are now associated to
the object. For example, we can print
the slope of the regression line like
this.
Since this object stores information
about the slope and the intercept of the
regression line, we can now use our
model to for example predict the
systolic blood pressure by using the
method predict.
So this was our previous code. When we
run this line, we create an instance of
the linear regression class, which means
that we create an object of that class
called model.
Then we use the fit method on the model
object to compute linear regression.
After we have used this method, the
model object will store the values of
the intercept and slope as its
attributes that we can extract from the
object
like this.
Note that we use the name of the object
followed by the name of the attribute.
Now to predict the blood pressure for a
person at the age of 50,
we simply use the method predict
and plug in the value 50 in a
two-dimensional array because the method
expects a two-dimensional array.
Although we only provide one number,
if we then print the predicted blood
pressure,
we see that we get about the same value
as we got previously when we calculated
this by hand.
So how do we add the regression line to
our scatter plot?
We will first extract the slope and the
intercept.
But note that the slope is stored in a
two-dimensional array.
So to get plain numbers instead of
arrays, we first flatten the
two-dimensional array to a
onedimensional array.
Then we extract the first element as a
plain number from the one-dimensional
array.
If you now print the slope and the
intercept, we see that we now have plain
numbers
to compute the start and the end of the
coordinates of the line. We use the same
code as before
and run the same lines to generate the
scatter plot with the regression line.
Instead of computing this line manually
like this,
we can use the predict method.
But note that we then first need to
create a two-dimensional array
because the predict method expects a
two-dimensional array as input.
We will now compute the predicted values
for each person by using the method
predict based on the variable h
which corresponds to the y values of the
line.
Then we can compute the sum of the
squared error or sum of squared
residuals.
The difference between the observed
blood pressure values and the predicted
values are called residuals.
which correspond to the vertical
distances between the data points and
the line.
If you square those distances and sum
them, we get the SSE value.
If we divide this by the sample size,
we get the mean squared error.
We can compute the same thing if we
import the following function
from the subm module matrix
which we can use to compute the MSE.
Let's also use this function to compute
the R squared value
which is about 0.84.
This means that 84% of the variability
in the systolic blood pressure is
explained by age.
To better estimate the R squar value in
machine learning,
we usually train the model based on some
training data
and plug in our test data. Here
the skyit learn library is a machine
learning library where the idea is that
you should split the data into training
data and test data to validate your
trained model. It will therefore not
compute any p values.
Although our data set is way too small
to be split into training data and test
data. I will show how it works.
We import the train test split function
which is used to split the data into
training data and test data
where we here say that 40% of the
individuals should go into the test data
set.
The five individuals will be randomly
put into the two data sets. We here set
the random state to some value which
makes sure that every time we run the
code we'll get the same random split.
You may change this integer to another
value but note that you will then get a
different random split.
So when we run this line,
these three individuals happen to be
selected for their training data and
these two individuals happen to be
selected for the test data.
These values therefore go into X train
whereas these values go into X test
which are used to see how good the model
predicts the blood pressure of these two
individuals based on their age.
When we train the model, we use the
blood pressure of the three individuals
so that the model predicts the blood
pressure as good as possible.
And then we test the model based on the
age of the two individuals to see how
close we get to the observed blood
pressure of these two individuals.
So to train the model, we plug in the
training data here,
which means that we fit a regression
line to these three points.
Then we use the ages of the two
individuals that were included in the
test data set to predict the blood
pressure based on the regression line.
We therefore use these values in the
method predict which results in the
following predicted blood pressures of
the two individuals.
Then we compare these predicted values
with the observed values in the test
data.
to compute the R squar value. So this is
how you can estimate the R squar value
based on a test data set.
Another way to do this especially if we
have a small sample size would be to run
cross validation where all data points
will be used both in the training data
and the test data.
Let's see how to compute some
inferential statistics in Python.
One way is to use the library stats
models
where this line defines that we like to
estimate an intercept of our model. If
this is not included, the intercept will
be automatically set to zero.
If you print X, we see that our first
column now includes ones which is used
to estimate the intercept.
Whereas the second column includes the
ages.
If we run this code, we'll get the same
intercept and slope as we estimated
previously when we used the full data
set. But where we here now also see the
corresponding p values and the 95%
confidence intervals for these
coefficients.
Once you get the idea of how to use
libraries in Python, it is quite easy to
use other statistical tests. For
example, suppose that we like to use a t
test to analyze if there is a
significant difference in the systolic
blood pressure between two age groups.
We then import the following function
that can compute an independent t test
and plug in the data into numpy arrays,
one array for each age group.
Then we plug in these arrays in the t
test function.
and here say that we do not assume equal
variance in the two groups which means
that we use Welsh's t test.
The outputs from this function will give
us the t statistics p value and degrees
of freedom
which can be printed like this.
Finally, I will show how you can import
some data in Python and create some nice
plots. I have here created a data set in
Excel and named the Excel file book one.
One way to import data in Python is to
use the library pandas.
From this library, we use the function
read Excel to read in the data.
This is the file name of the Excel file
and this is the path to the directory
where I saved the file. You need to
change this path to where you have saved
the file on your computer.
When we run this line, we'll create a
so-called data frame that can store
variables with different data types.
It is always good to print the first
lines of the data frame to make sure it
looks okay.
Here we use the method head from pandas
to print the first five rows.
It is also good to check the data types
of the variables.
The data type of the variable age group
is called object
which basically means that we have text
in this column.
This is what we want because this is a
categorical variable that defines the
groups.
The two other columns contains numeric
data where int 64 means that the values
are integers.
This column shows how many non-m missing
values we have in each column.
Since we have 10 rows which are indexed
from zero to 9,
we have no missing data.
By using the method describe, we can get
some descriptive statistics of the
columns with numeric data in a data
frame.
For example, we can see that the average
blood pressure of the 10 individuals is
124.
We can print the full data frame like
this.
Now suppose that we like to extract the
variable weight from the data frame.
Then we use the name of the data frame
and the name of the column with square
brackets.
If you now print the data type of the
extracted column,
we see that it is of type series which
is a one-dimensional labeled array. If
you print this, we see that the column
name has disappeared.
If instead use double square brackets,
we see that the variable weight is still
a data frame and that the column name
has not been removed. It is therefore
good to keep separate variables as data
frames instead of series.
With this code, we can make a box plot.
Box plot is a method of the pandas data
frame object.
Pandas uses the library mattplot lib
behind the scenes to draw the plot.
Note that the column name is placed here
which is nice.
With the method hist, we can create a
histogram of the data.
For our categorical variable, we can for
example create a pie chart where we see
that we have an equal number of
individuals in the two age groups.
If you like to create a box plot of all
the numerical columns in our data frame,
we simply use the method box plot
directly on the data frame.
We may also extract which numerical
variables we like to include in the box
plot and determine the order of the
variables.
If you like to show the blood pressure
separate for the two age groups,
we may first extract these two variables
and then set by equal to the variable
age group.
It is also possible to do this in one
line where we here apply the box plot
method on the data frame and define
which column in the data frame we like
to display
to fix a nicer title and some label on
the yaxis.
We import the mattplot lib library
and set an appropriate y label
and a title
and use this code to avoid showing the
subtitle.
If you like to compute the t test as we
did before, we extract the variable bulb
pressure only for the rows where the age
group is equal to 20 to 35.
Which means that we store these values
in the variable group one
and these values in group two.
Then we just use the same lines of code
as we used previously to compute the t
test.
Note that I here store the degrees of
freedom with the name DFS
so that I don't overwrite the data frame
with the name DF.
Okay, let's now try to do multiple
linear regression. We would like to
predict the blood pressure based on
weight and age group. This can be seen
as we check if there is a difference in
blood pressure between the two age
groups by controlling for the variable
weight which may influence the outcome
and act as a confounder.
So we begin to extract the predictor
variables that we store in a data frame
called X. Then we convert our variable
age group to a categorical variable
because this variable consists of two
groups.
If you print the data frame X, we see
that the variable H group has now been
converted to boolean variables, one for
each group.
by setting drop first equal to true. We
will delete this redundant column
so that X now only have one column of
the categorical variable.
So this means that the first five
individuals do not belong to the age
group 36 to 55
but these do.
Then we convert this variable to a
numeric where the zeros define the
baseline group, the ones in the age
group 20 to 35 and the ones define the
age group 36 to 55.
Next, we add an intercept to our model
and fit the regression line to our data.
We can see that the ones in the age
group 36 to 55 has an intercept that is
8.6 greater than the ones in the
baseline group which are the ones in the
age group 20 to 35. So there is still a
significant difference in the blood
pressure between the two age groups when
we control for the variable weights.
Let's remove the variable weight from
the model so that we only use the
variable H group as a predictor
and run the code and print the P values.
This P value is the same
as the one we get from a t test by
assuming an equal variance of the two
groups.
This was the end of this video about the
basics of using Python for data
analysis.
Thanks for watching.
Ask follow-up questions or revisit key timestamps.
This video provides a comprehensive introduction to Python for data analysis, covering installation, package management, and the distinction between functions and methods. It demonstrates how to perform linear regression, conduct statistical tests, and read data from Excel files. The tutorial explores various Python environments like Spider and Google Colab, and delves into fundamental data types (integers, floats, strings, booleans). It introduces essential libraries such as NumPy for numerical operations and array manipulation, Matplotlib for data visualization (scatter plots, histograms, box plots), and Pandas for data frame manipulation and analysis. The video also explains advanced concepts like machine learning with scikit-learn, including model training, prediction, and evaluation metrics like R-squared, as well as inferential statistics with libraries like Statsmodels for hypothesis testing and confidence intervals.
Videos recently processed by our community