HomeVideos

Learn Python from Scratch for Data Analysis & Machine Learning – linear regression and t-test

Now Playing

Learn Python from Scratch for Data Analysis & Machine Learning – linear regression and t-test

Transcript

970 segments

0:00

In this video, I will show how you can

0:02

get started with Python for data

0:04

analysis.

0:06

Note that I here assume that you have

0:08

some basic understanding in statistics.

0:11

I will first show how to install and use

0:14

packages and explain the difference

0:16

between functions and methods. Then we

0:20

will see how to compute linear

0:21

regression, a test, and how to read in

0:24

data from an Excel file.

0:28

You can use different IDs or text

0:30

editors for Python and I will here use

0:34

spider but other editors should work

0:36

just fine.

0:38

If you are an R user like me, Spider is

0:43

nice to start with because it looks very

0:45

similar to R Studio and it is really

0:48

easy to install and get started with.

0:51

Here is where we create our script.

0:54

And here is the console that shows the

0:57

output.

0:58

And in this tab, you can view your

1:01

plots. You may download Spider for free

1:05

from the following website. Spider comes

1:08

with its inbuilt Python.

1:11

To be able to run the code as shown in

1:14

this video, you first need to install

1:16

the following packages. If you have not

1:19

already done that, check how to install

1:22

these libraries on your system.

1:32

In spider, you can for example install

1:35

the following package like this, which

1:38

makes sure that the package is installed

1:41

into the same Python interpreter that

1:44

Spider is currently running.

1:48

If you get an error when you try to

1:49

install a package that says something

1:52

like pip is not installed or no module

1:55

named pip, you can try to run the

1:57

following two lines and see if that

2:00

fixes the problem.

2:03

To see if you already have a package

2:06

installed, you can run this line

2:10

which shows the version of the installed

2:12

package.

2:14

Once you have successfully installed all

2:16

these libraries, you will be ready to

2:19

use the code in this tutorial.

2:22

Before we start with the tutorial, I

2:25

will first show how you can run a simple

2:27

code in spider and in collab.

2:31

So when you open spider, it should look

2:34

something like this.

2:37

We can remove this text.

2:40

Then we assign the value five to our

2:43

variable x and use the function print.

2:49

To run this code, we select it and click

2:53

up here.

2:57

Note that the value of x will be printed

3:00

in the console.

3:02

Then we assign the value seven to the

3:05

variable Y

3:08

and add X and Y and store the sum in Z

3:13

and type print.

3:17

If you now run this code, the sum will

3:20

be printed in the console.

3:26

Another simple way to use Python is to

3:29

use Google Collab where you first need

3:31

to set up an account.

3:34

We first open a new notebook.

3:39

Make sure that the runtime type is set

3:42

to Python.

3:48

Then we write our code in blocks like

3:51

this.

3:58

and press here to run the code. To open

4:01

a new block, we click here.

4:19

We will first start with the basics. To

4:21

assign for example the value seven to a

4:24

variable that we call x, we write like

4:28

this. To print the value that is stored

4:31

in x, we use the function print

4:37

that outputs this value in the console.

4:41

If you now set x equal to five and run

4:44

the code again, we assign a new value to

4:47

the variable. We therefore replace its

4:50

previous value with a new one.

4:55

If you now print variable x, the value

4:58

five should be shown in the output.

5:03

In this code, we assign the values seven

5:06

and five to the variables x and y.

5:11

Then we add these values and assign the

5:13

resulting sum to the variable zed and

5:16

print its value.

5:19

It is also possible to assign values to

5:22

the two variables on a single line like

5:25

this

5:27

where x will be set to seven and y to

5:32

five.

5:33

By using the function type inside the

5:36

function print, we can print the data

5:39

type of x.

5:42

Since five is a whole number, Python

5:45

will print class int showing that x is

5:49

an integer. If instead assign the value

5:52

5.2 two to the variable and print its

5:55

type. Python prints class floats showing

6:00

that x is a floating point number. In

6:03

Python, numbers with decimals are called

6:07

floats while whole numbers are called

6:10

integers.

6:12

This code assigns a five with quotes to

6:16

the variable x. The quotes indicate that

6:20

this is a character and not a number. In

6:24

Python, anything inside quotes is

6:27

treated as a string, even if it looks

6:30

like a number, which explains why the

6:33

variable x is now interpreted as a

6:36

string or text instead of a number.

6:41

We may also use single quotes.

6:44

If you now try to add a number to the

6:47

variable, we will get an error because x

6:50

is no longer interpreted as a number.

6:55

Usually, we only use quotes when we like

6:58

to assign a text string to a variable.

7:04

This code assigns the value false to the

7:07

variable x. False is a boolean value in

7:11

Python which can either be true or

7:13

false. If you now print the data type of

7:17

X, we see that X is a Boline. Bolins are

7:21

often used in Python for conditions and

7:24

logical operations.

7:28

This code changes the value of X from

7:31

false to true.

7:34

We'll now see how to do some simple math

7:36

in Python.

7:38

To for example square the value of x we

7:42

can use two asterisks which means x to

7:45

the power of two.

7:48

So how do we compute the square root of

7:50

x?

7:53

To compute the square root of 16 we can

7:56

do it like this. But to do more advanced

7:59

math,

8:01

the easiest way is to import the math

8:04

module which is part of Python's

8:07

standard library. This module contains

8:10

many useful mathematical operations.

8:14

If you now run the following function, a

8:17

list of all the functions and constants

8:19

in the math module should be shown in

8:22

the console.

8:24

To compute the square root, we can use

8:27

the function sqrt.

8:31

To use the sqrt function in a math

8:34

module, we type the name of the module

8:38

followed by a dot

8:41

and then the name of the function we

8:43

like to use.

8:46

Then we can print the square root of 16.

8:52

To print the value of pi which is a

8:55

constant in the math module, we use the

8:58

following code.

9:01

This code uses the x function from the

9:04

math module which calculates e raised to

9:08

a given power.

9:11

Similarly, we can compute the natural

9:13

log of two like this.

9:18

To use a function in a math module, we

9:21

have used the name of the module

9:23

followed by the name of the function in

9:26

that module.

9:29

Usually, we avoid using the full name of

9:32

the module

9:35

and instead give the module a shorter

9:37

name

9:40

which we then use to access the function

9:42

in a module.

9:45

So to use functions in a module, this is

9:49

the most common way

9:52

where we now use the short name of the

9:54

module.

9:57

Note that you can use any name here as

9:59

long as they are consistent.

10:05

Another way to use a function in a

10:08

module is to only import the function

10:11

that we like to use.

10:14

So from the math module

10:17

we import the function SQRT

10:21

after running this line we can use the

10:24

SQRT function directly

10:28

without the need to use the name of the

10:30

module.

10:33

Note that the following code will

10:34

produce an error because we have not yet

10:37

imported the log function from the math

10:40

module.

10:43

To fix this error, we run this line

10:45

where we now also import the log

10:48

function from the math module.

10:52

So we can either use the functions in

10:55

the math module like this

10:58

or like this.

11:01

This way is probably the most common

11:03

way.

11:05

When we import entire math module, it is

11:08

always clear that the functions come

11:11

from the math module. This prevents

11:14

naming conflicts if other modules have

11:17

functions with the same name.

11:20

However, the advantage of importing the

11:23

functions directly

11:26

is that the code looks a bit cleaner.

11:30

We can create the following list by

11:33

putting the values inside square

11:36

brackets and assign it to x.

11:40

We can then print the list and see that

11:43

x is of data type list.

11:46

There is also something called tupil

11:49

written with parentheses which is a list

11:52

of values that cannot be changed.

11:56

For example, if you like to change the

11:58

value of the first element of the list

12:02

from 1 to 7, we can use the following

12:06

code.

12:09

But if you try to do the same with a

12:12

tupil, we get an error because a tupil

12:15

cannot be changed.

12:19

When you multiply a list by a number,

12:23

Python repeats the elements of the list

12:26

that many times.

12:28

In data analysis, we usually work with

12:31

data in the form of vectors and

12:34

matrices.

12:35

That is why it is helpful to use the

12:38

numpy module.

12:41

So we import the numpy module and give

12:44

it the short name np which is the

12:47

standard convention.

12:50

Then we use the function array to create

12:53

a numpy array

12:56

of the list and assign it to x.

13:01

If you now multiply this array by two,

13:04

we see that each value in the elements

13:07

has been multiplied by two.

13:12

In Python, arrays and lists are stored

13:16

so that the first element has index

13:18

zero. Here, the first element of the

13:21

array stores the value one and its index

13:25

is zero.

13:28

The second element of the array stores

13:31

the value two and its index is one.

13:36

To access a value in an array,

13:40

we write the name of the array followed

13:43

by square brackets containing the index

13:46

of the element we want to print.

13:50

For example, this prints the value

13:53

stored at index zero, which is the first

13:57

element of the array.

14:00

This prints the value stored at index

14:02

one which is the second element of the

14:05

array.

14:11

If we try to print the value that is

14:13

stored at index 4, we'll get the

14:17

following error

14:19

because our array does not have a fourth

14:22

index.

14:25

So remember the Python arrays start

14:28

counting at zero.

14:32

But what is happening here?

14:36

Well, in Python, a negative index counts

14:39

from the end of the array. So this

14:42

refers to the last element.

14:46

So -2 refers to the second to last

14:50

element of the array.

14:56

This code means that we like to print

14:58

the values in the array that start at

15:01

index zero and go up to but not

15:04

including index three.

15:08

We can think of this as printing the

15:10

first three elements of the array

15:13

given that we start at index zero

15:17

which prints the first three values of

15:20

the array.

15:23

We will now see how to create the

15:25

following matrix in numpy.

15:29

This code creates a 2x2 numpy array or a

15:32

matrix.

15:36

Each inner list creates the rows of the

15:40

matrix.

15:42

The double square brackets in the output

15:45

show that we now have a two-dimensional

15:48

array.

15:49

Another

15:52

way to create the same matrix is to

15:54

first create a one-dimensional array.

15:59

Then we use reshape to change the shape

16:02

of the array into a 2x2 matrix.

16:06

Note that reshape is not a function in

16:09

the numpy module. It is actually a

16:12

method which can only be called on

16:14

existing numpy arrays.

16:17

Since x is an existing number array,

16:22

we can apply the method reshape on it.

16:25

I will later explain the difference

16:27

between a function and a method.

16:31

So with the method reshape, we convert

16:34

the one-dimensional array into a

16:36

two-dimensional array or matrix with two

16:40

rows and two columns.

16:44

One convenient way to create the matrix

16:47

is to use a negative one here which says

16:52

create a two-dimensional array that has

16:54

two columns

16:57

and figure out the number of rows

16:59

automatically.

17:02

Suppose that we like to print the value

17:05

on the first row first column of this

17:08

matrix.

17:10

Then we say that we like to print the

17:12

value in the element at row index zero

17:17

and column index zero.

17:20

This will print the value stored in the

17:22

following element.

17:25

If you like to print all values in the

17:28

first column,

17:30

we select indices zero and one, which

17:34

means that we select the first two rows

17:38

of the first column,

17:40

which will print all the values in the

17:43

first column.

17:46

We can think of these two here as we

17:48

like to print the first two rows

17:52

given that we start at index zero.

17:56

We may also write like this where we say

17:59

that we like to start at index zero and

18:02

go to the end which will print all rows

18:06

of the first column.

18:10

To create a two-dimensional array that

18:13

has four rows and one column, we can

18:16

write like this

18:18

where we say that our two-dimensional

18:20

array should have four rows

18:24

and one column.

18:28

We can also do like this where we let

18:31

Python automatically figure out how many

18:34

rows the array should have.

18:37

When we say that we like to have just

18:39

one column

18:41

to check the shape of our

18:42

two-dimensional array, we can type the

18:46

name of the array followed by the method

18:49

shape

18:51

which tell us that our two-dimensional

18:53

array has four rows and one column.

18:59

Suppose that we have created a

19:01

two-dimensional array and want to

19:03

convert it to a one-dimensional array.

19:08

Then we can use the method flatten which

19:11

converts a multi-dimensional array into

19:14

one-dimensional array.

19:16

So here we have a onedimensional array

19:21

which is converted to a two-dimensional

19:24

array

19:26

and then back to one-dimensional array.

19:31

In this code, we create the following

19:34

row vector

19:36

and the following column vector.

19:40

To multiply these two matrices which is

19:43

called a dot product, we use the

19:46

function dot in numpy.

19:50

We may also use the add symbol for

19:53

matrix multiplications.

19:56

With the following function, we can

19:58

create a two-dimensional array with

20:01

zeros.

20:03

Note that the numbers are in the form of

20:06

decimal numbers or floating point

20:08

numbers.

20:10

If you instead set dype to int,

20:15

the zeros will instead come as integers

20:18

or whole numbers.

20:21

We can then assign the numbers 1 2 3 and

20:24

four to each element by specifying their

20:28

row and column indices.

20:32

For example, this assigns the value two

20:36

to the first row second column.

20:41

However, assigning decimal numbers to

20:44

the array.

20:46

Now convert them to integers by dropping

20:50

the decimal part.

20:54

So to store decimal numbers we must set

20:58

dype to float which is the default

21:01

setting for this function.

21:04

You may also see something like this

21:07

where a float 64 means that each decimal

21:10

number is stored using 64 bits for high

21:14

precision calculations.

21:17

Sometimes one uses float 32 to save

21:21

memory and speed up computations, but

21:24

these decimal numbers are less precise

21:27

than float 64.

21:31

Note the difference here.

21:34

When we store the value of pi in a

21:37

matrix,

21:39

float 32 stores this value with only

21:42

about seven significant digits whereas

21:46

float 64 stores this value with more

21:48

digits and therefore with higher

21:51

precision.

21:53

This is the default of the function.

21:59

Suppose that we have stored the weight

22:01

in kilos of four individuals in a

22:04

one-dimensional array.

22:07

We can then use the function mean from

22:09

numpy to compute the average weight of

22:13

the four individuals.

22:16

With the function std, we can compute

22:19

the standard deviation.

22:22

Note that by default, the function

22:25

computes the population standard

22:27

deviation.

22:29

where we only have the sample size in

22:31

the denominator.

22:35

To instead compute the sample standard

22:37

deviation, we need to set the degrees of

22:40

freedom to one where we now have n minus

22:45

one in the denominator.

22:49

We will now see how to plot some simple

22:52

data.

22:55

Suppose that we have the following data

22:57

of five individuals.

23:00

We can insert the data by using numpy

23:03

arrays like this

23:07

to create the following simple scatter

23:09

plot. We can import the piplot subm

23:13

module from the main plotting library

23:16

mattplot lib to access functions like

23:19

plot and scatter.

23:22

We will here use the scatter function.

23:24

And to see which parameters we can use

23:26

in this function, we can run the

23:28

following line.

23:30

To make a scatter plot, we use the

23:33

function scatter

23:35

and plug in the names of our numpy

23:38

arrays that were created previously.

23:42

To use bigger points in the plot, we set

23:46

the parameter s to 100.

23:49

S is called a parameter which is the

23:52

name the function expects

23:56

and the number 100 that we pass to it is

23:59

called an argument.

24:03

With this parameter we can choose a

24:05

certain color for the points to add nice

24:09

labels and a grid to the plot.

24:13

We use these functions to modify its

24:16

appearance.

24:19

We add a grid, put labels on the axis,

24:27

and finally display the plot with show

24:31

in spider. You need to run all these

24:33

lines at the same time to create the

24:36

plot

24:39

like this.

24:47

We will now have a look at the

24:49

difference between using functions and

24:51

methods in Python, which is crucial to

24:54

understand when using certain libraries

24:56

in Python. To explain the difference

25:00

between a function and a method, we will

25:02

here set up a code to perform linear

25:05

regression based on our previous data.

25:08

So we will use age

25:12

to predict the systolic blood pressure

25:14

where we will estimate the intercept

25:17

and the slope of the regression line.

25:21

To perform linear regression, we will

25:23

here use the function line regress that

25:26

we import from the stats module.

25:30

Then we create one-dimensional arrays of

25:33

the age and the blood pressure of the

25:35

five individuals.

25:38

Next, we perform linear regression with

25:41

the imported function

25:45

and print the results.

25:50

We see that the slope of the fitted line

25:52

is about 0.22

25:55

which means that the blood pressure is

25:57

predicted to increase by 0.22

26:00

for one year increase in age.

26:04

The intercept is about 113

26:07

which can be interpreted as a person at

26:10

age zero has a blood pressure of 113.

26:15

However, since we have no data on

26:17

children, we should not put much trust

26:19

in this value. To add this regression

26:22

line to our plot,

26:27

we first extract the slope and the

26:29

intercept from the output stored in res.

26:34

Then we create an array that takes the

26:37

values 0 and 80, which gives the x

26:40

coordinates to draw the line between.

26:45

Then we compute the corresponding

26:47

y-coordinates for these two points with

26:50

the equation of our straight line.

26:55

So the values of the slope and intercept

26:58

will go in here.

27:01

And we will compute the corresponding y

27:03

values for the x values 0 and 80.

27:09

Finally, we run the same code as before

27:11

when we generated a scatter plot.

27:15

But we here now also use the function

27:18

plot to draw a line between the two

27:21

points.

27:24

So that we now show the regression line

27:26

in the plot.

27:29

We can now use our regression line to

27:32

predict the blood pressure for a person

27:34

of age 50 by drawing a vertical line

27:38

from 50

27:40

and then the horizontal line from the

27:42

regression line to the y-axis where we

27:45

see that the predicted blood pressure is

27:47

about 124.

27:50

A more accurate way to compute the

27:52

predicted value is to use the equation

27:55

of the line where we plug in the

27:58

computed slope and the intercept

28:01

and the h and do the math.

28:06

The problem with this calculation is

28:08

that we here use rounded values of the

28:11

intercept and the slope.

28:15

An even better way to do this would be

28:18

to compute this directly in Python

28:20

because we then avoid rounding the

28:22

values of the slope and the intercept.

28:27

So the importance of this exercise was

28:30

to show how we can use an imported

28:32

function like this.

28:36

We will now perform linear regression

28:38

again but this time we will use the

28:41

library skykit learn by using its subm

28:44

module linear model. The difference from

28:48

the previous example is that this is not

28:50

the function. It is actually a so-called

28:54

class. We therefore now import a class

28:58

and not the function.

29:01

So from the library skyit learn we use

29:04

the module linear model

29:07

from where we import the class linear

29:10

regression.

29:13

The name of a class usually involves

29:16

capital letters.

29:18

We add the data like this. But note that

29:21

the method that we will use here

29:24

requires that the data is in the form of

29:27

two-dimensional arrays.

29:30

which explains why we now reshape the

29:32

one-dimensional arrays to twodimensional

29:36

arrays

29:39

which look like this. If you print them,

29:43

we can see that they are two-dimensional

29:45

arrays because there are two square

29:47

brackets around the numbers.

29:51

Since we here import a class instead of

29:54

a function,

29:56

we must first create an instance of this

29:59

class.

30:01

When we run this line, we will create an

30:03

object with a name model.

30:07

If we run the corresponding code in a

30:09

Python text editor,

30:12

we will create an object with a name

30:15

model.

30:17

The object we created named model has

30:22

attributes that store information about

30:24

the regression which will later store

30:27

the slope and the intercept of the

30:29

fitted line.

30:32

This class also provides functions that

30:35

we can apply on our object. When a

30:38

function belongs to an object, it is

30:41

instead called a method.

30:46

If we type model, we should see the

30:49

attributes that are associated with the

30:51

object.

30:54

These two attributes will later store

30:56

the slope and intercept of the

30:58

regression line. To perform linear

31:01

regression, we will use the method fit

31:04

which is included in this class.

31:08

If you now run the following line,

31:12

we will perform linear regression where

31:14

the values of the slope and the

31:16

intercept will be stored in the object

31:19

like this.

31:20

We can extract the values of the slope

31:23

and intercept with the following

31:25

attributes that are now associated to

31:28

the object. For example, we can print

31:31

the slope of the regression line like

31:33

this.

31:37

Since this object stores information

31:39

about the slope and the intercept of the

31:42

regression line, we can now use our

31:45

model to for example predict the

31:47

systolic blood pressure by using the

31:49

method predict.

31:52

So this was our previous code. When we

31:56

run this line, we create an instance of

31:58

the linear regression class, which means

32:01

that we create an object of that class

32:04

called model.

32:07

Then we use the fit method on the model

32:09

object to compute linear regression.

32:14

After we have used this method, the

32:16

model object will store the values of

32:18

the intercept and slope as its

32:21

attributes that we can extract from the

32:24

object

32:25

like this.

32:28

Note that we use the name of the object

32:32

followed by the name of the attribute.

32:37

Now to predict the blood pressure for a

32:40

person at the age of 50,

32:44

we simply use the method predict

32:48

and plug in the value 50 in a

32:51

two-dimensional array because the method

32:54

expects a two-dimensional array.

32:56

Although we only provide one number,

33:01

if we then print the predicted blood

33:03

pressure,

33:06

we see that we get about the same value

33:08

as we got previously when we calculated

33:11

this by hand.

33:14

So how do we add the regression line to

33:16

our scatter plot?

33:19

We will first extract the slope and the

33:21

intercept.

33:24

But note that the slope is stored in a

33:26

two-dimensional array.

33:30

So to get plain numbers instead of

33:32

arrays, we first flatten the

33:35

two-dimensional array to a

33:37

onedimensional array.

33:40

Then we extract the first element as a

33:43

plain number from the one-dimensional

33:45

array.

33:48

If you now print the slope and the

33:51

intercept, we see that we now have plain

33:53

numbers

33:55

to compute the start and the end of the

33:58

coordinates of the line. We use the same

34:01

code as before

34:05

and run the same lines to generate the

34:08

scatter plot with the regression line.

34:12

Instead of computing this line manually

34:14

like this,

34:16

we can use the predict method.

34:20

But note that we then first need to

34:22

create a two-dimensional array

34:26

because the predict method expects a

34:29

two-dimensional array as input.

34:34

We will now compute the predicted values

34:36

for each person by using the method

34:39

predict based on the variable h

34:43

which corresponds to the y values of the

34:46

line.

34:48

Then we can compute the sum of the

34:50

squared error or sum of squared

34:53

residuals.

34:56

The difference between the observed

34:58

blood pressure values and the predicted

35:00

values are called residuals.

35:04

which correspond to the vertical

35:06

distances between the data points and

35:08

the line.

35:12

If you square those distances and sum

35:14

them, we get the SSE value.

35:18

If we divide this by the sample size,

35:21

we get the mean squared error.

35:24

We can compute the same thing if we

35:27

import the following function

35:30

from the subm module matrix

35:33

which we can use to compute the MSE.

35:38

Let's also use this function to compute

35:40

the R squared value

35:45

which is about 0.84.

35:48

This means that 84% of the variability

35:51

in the systolic blood pressure is

35:53

explained by age.

35:57

To better estimate the R squar value in

35:59

machine learning,

36:02

we usually train the model based on some

36:05

training data

36:07

and plug in our test data. Here

36:11

the skyit learn library is a machine

36:14

learning library where the idea is that

36:16

you should split the data into training

36:19

data and test data to validate your

36:22

trained model. It will therefore not

36:24

compute any p values.

36:28

Although our data set is way too small

36:31

to be split into training data and test

36:34

data. I will show how it works.

36:37

We import the train test split function

36:41

which is used to split the data into

36:43

training data and test data

36:47

where we here say that 40% of the

36:50

individuals should go into the test data

36:52

set.

36:54

The five individuals will be randomly

36:57

put into the two data sets. We here set

37:01

the random state to some value which

37:03

makes sure that every time we run the

37:06

code we'll get the same random split.

37:10

You may change this integer to another

37:12

value but note that you will then get a

37:14

different random split.

37:18

So when we run this line,

37:22

these three individuals happen to be

37:24

selected for their training data and

37:28

these two individuals happen to be

37:30

selected for the test data.

37:34

These values therefore go into X train

37:38

whereas these values go into X test

37:42

which are used to see how good the model

37:44

predicts the blood pressure of these two

37:46

individuals based on their age.

37:50

When we train the model, we use the

37:52

blood pressure of the three individuals

37:55

so that the model predicts the blood

37:57

pressure as good as possible.

38:00

And then we test the model based on the

38:02

age of the two individuals to see how

38:05

close we get to the observed blood

38:08

pressure of these two individuals.

38:11

So to train the model, we plug in the

38:14

training data here,

38:17

which means that we fit a regression

38:19

line to these three points.

38:22

Then we use the ages of the two

38:25

individuals that were included in the

38:27

test data set to predict the blood

38:29

pressure based on the regression line.

38:33

We therefore use these values in the

38:35

method predict which results in the

38:38

following predicted blood pressures of

38:40

the two individuals.

38:45

Then we compare these predicted values

38:49

with the observed values in the test

38:51

data.

38:53

to compute the R squar value. So this is

38:57

how you can estimate the R squar value

39:00

based on a test data set.

39:04

Another way to do this especially if we

39:06

have a small sample size would be to run

39:09

cross validation where all data points

39:12

will be used both in the training data

39:14

and the test data.

39:17

Let's see how to compute some

39:19

inferential statistics in Python.

39:24

One way is to use the library stats

39:26

models

39:28

where this line defines that we like to

39:30

estimate an intercept of our model. If

39:34

this is not included, the intercept will

39:36

be automatically set to zero.

39:40

If you print X, we see that our first

39:43

column now includes ones which is used

39:47

to estimate the intercept.

39:50

Whereas the second column includes the

39:52

ages.

39:55

If we run this code, we'll get the same

39:58

intercept and slope as we estimated

40:01

previously when we used the full data

40:03

set. But where we here now also see the

40:06

corresponding p values and the 95%

40:10

confidence intervals for these

40:11

coefficients.

40:16

Once you get the idea of how to use

40:19

libraries in Python, it is quite easy to

40:22

use other statistical tests. For

40:25

example, suppose that we like to use a t

40:27

test to analyze if there is a

40:29

significant difference in the systolic

40:32

blood pressure between two age groups.

40:35

We then import the following function

40:38

that can compute an independent t test

40:42

and plug in the data into numpy arrays,

40:45

one array for each age group.

40:49

Then we plug in these arrays in the t

40:51

test function.

40:54

and here say that we do not assume equal

40:56

variance in the two groups which means

40:59

that we use Welsh's t test.

41:03

The outputs from this function will give

41:05

us the t statistics p value and degrees

41:09

of freedom

41:12

which can be printed like this.

41:20

Finally, I will show how you can import

41:22

some data in Python and create some nice

41:25

plots. I have here created a data set in

41:29

Excel and named the Excel file book one.

41:35

One way to import data in Python is to

41:38

use the library pandas.

41:42

From this library, we use the function

41:44

read Excel to read in the data.

41:50

This is the file name of the Excel file

41:53

and this is the path to the directory

41:56

where I saved the file. You need to

41:59

change this path to where you have saved

42:01

the file on your computer.

42:05

When we run this line, we'll create a

42:08

so-called data frame that can store

42:10

variables with different data types.

42:14

It is always good to print the first

42:16

lines of the data frame to make sure it

42:18

looks okay.

42:21

Here we use the method head from pandas

42:24

to print the first five rows.

42:29

It is also good to check the data types

42:31

of the variables.

42:34

The data type of the variable age group

42:37

is called object

42:39

which basically means that we have text

42:41

in this column.

42:44

This is what we want because this is a

42:46

categorical variable that defines the

42:48

groups.

42:50

The two other columns contains numeric

42:53

data where int 64 means that the values

42:56

are integers.

42:58

This column shows how many non-m missing

43:01

values we have in each column.

43:05

Since we have 10 rows which are indexed

43:09

from zero to 9,

43:12

we have no missing data.

43:15

By using the method describe, we can get

43:18

some descriptive statistics of the

43:20

columns with numeric data in a data

43:23

frame.

43:26

For example, we can see that the average

43:28

blood pressure of the 10 individuals is

43:31

124.

43:38

We can print the full data frame like

43:40

this.

43:43

Now suppose that we like to extract the

43:46

variable weight from the data frame.

43:49

Then we use the name of the data frame

43:51

and the name of the column with square

43:54

brackets.

43:57

If you now print the data type of the

44:00

extracted column,

44:02

we see that it is of type series which

44:06

is a one-dimensional labeled array. If

44:10

you print this, we see that the column

44:12

name has disappeared.

44:15

If instead use double square brackets,

44:20

we see that the variable weight is still

44:22

a data frame and that the column name

44:25

has not been removed. It is therefore

44:28

good to keep separate variables as data

44:31

frames instead of series.

44:34

With this code, we can make a box plot.

44:38

Box plot is a method of the pandas data

44:41

frame object.

44:43

Pandas uses the library mattplot lib

44:46

behind the scenes to draw the plot.

44:50

Note that the column name is placed here

44:53

which is nice.

44:56

With the method hist, we can create a

44:59

histogram of the data.

45:02

For our categorical variable, we can for

45:05

example create a pie chart where we see

45:08

that we have an equal number of

45:10

individuals in the two age groups.

45:17

If you like to create a box plot of all

45:19

the numerical columns in our data frame,

45:23

we simply use the method box plot

45:25

directly on the data frame.

45:29

We may also extract which numerical

45:32

variables we like to include in the box

45:34

plot and determine the order of the

45:36

variables.

45:38

If you like to show the blood pressure

45:40

separate for the two age groups,

45:45

we may first extract these two variables

45:49

and then set by equal to the variable

45:52

age group.

45:56

It is also possible to do this in one

45:58

line where we here apply the box plot

46:01

method on the data frame and define

46:05

which column in the data frame we like

46:07

to display

46:10

to fix a nicer title and some label on

46:14

the yaxis.

46:17

We import the mattplot lib library

46:22

and set an appropriate y label

46:26

and a title

46:29

and use this code to avoid showing the

46:31

subtitle.

46:34

If you like to compute the t test as we

46:37

did before, we extract the variable bulb

46:39

pressure only for the rows where the age

46:43

group is equal to 20 to 35.

46:47

Which means that we store these values

46:49

in the variable group one

46:54

and these values in group two.

47:00

Then we just use the same lines of code

47:02

as we used previously to compute the t

47:05

test.

47:08

Note that I here store the degrees of

47:10

freedom with the name DFS

47:13

so that I don't overwrite the data frame

47:15

with the name DF.

47:19

Okay, let's now try to do multiple

47:21

linear regression. We would like to

47:23

predict the blood pressure based on

47:26

weight and age group. This can be seen

47:28

as we check if there is a difference in

47:31

blood pressure between the two age

47:33

groups by controlling for the variable

47:36

weight which may influence the outcome

47:39

and act as a confounder.

47:42

So we begin to extract the predictor

47:44

variables that we store in a data frame

47:48

called X. Then we convert our variable

47:51

age group to a categorical variable

47:54

because this variable consists of two

47:57

groups.

47:58

If you print the data frame X, we see

48:01

that the variable H group has now been

48:04

converted to boolean variables, one for

48:08

each group.

48:10

by setting drop first equal to true. We

48:14

will delete this redundant column

48:17

so that X now only have one column of

48:20

the categorical variable.

48:24

So this means that the first five

48:26

individuals do not belong to the age

48:29

group 36 to 55

48:32

but these do.

48:35

Then we convert this variable to a

48:37

numeric where the zeros define the

48:40

baseline group, the ones in the age

48:42

group 20 to 35 and the ones define the

48:47

age group 36 to 55.

48:51

Next, we add an intercept to our model

48:56

and fit the regression line to our data.

49:02

We can see that the ones in the age

49:04

group 36 to 55 has an intercept that is

49:08

8.6 greater than the ones in the

49:10

baseline group which are the ones in the

49:13

age group 20 to 35. So there is still a

49:17

significant difference in the blood

49:18

pressure between the two age groups when

49:21

we control for the variable weights.

49:25

Let's remove the variable weight from

49:27

the model so that we only use the

49:30

variable H group as a predictor

49:33

and run the code and print the P values.

49:40

This P value is the same

49:44

as the one we get from a t test by

49:47

assuming an equal variance of the two

49:49

groups.

49:51

This was the end of this video about the

49:53

basics of using Python for data

49:56

analysis.

49:57

Thanks for watching.

Interactive Summary

This video provides a comprehensive introduction to Python for data analysis, covering installation, package management, and the distinction between functions and methods. It demonstrates how to perform linear regression, conduct statistical tests, and read data from Excel files. The tutorial explores various Python environments like Spider and Google Colab, and delves into fundamental data types (integers, floats, strings, booleans). It introduces essential libraries such as NumPy for numerical operations and array manipulation, Matplotlib for data visualization (scatter plots, histograms, box plots), and Pandas for data frame manipulation and analysis. The video also explains advanced concepts like machine learning with scikit-learn, including model training, prediction, and evaluation metrics like R-squared, as well as inferential statistics with libraries like Statsmodels for hypothesis testing and confidence intervals.

Suggested questions

5 ready-made prompts