Machine Learning

Making the computer learn from studying data and statistics.

Step into the direction of artificial intelligence (AI).

Program that analyses data and learns to predict the outcome.

Data Set

In the mind of a computer, data set is any collection of data.

It can be anything from an array to a complete database. Example of an array:

[99, 86,87,88, 111, 86, 103, 87, 94, 78, 77, 85, 86]

Data Types

Data type main categories:

Numerical
Categorical
Ordinal

Numerical data are slit into two main categories.

Discrete Data - numbers that are limited to integers. e.g., The number of cars passing by.

Continuous Data - numbers that are of infinite value. e.g., The price of an item, or the size of an item

Categorical data are values that cannot be measured up against each other. e.g., a color value, or any yes/no values.

Ordinal data are like categorical data, but can be measured up against each other. e.g., school grades where A is better than B and so on.

By knowing the data type of your data source, you will be able to know what technique to use when analyzing them.

You will learn more about statistics and analyzing data in the next chapters.

Mean Median Mode

Mean - The average value
Median - The mid-point value
Mode - The most common value

e.g. speed of 13 vehicles.

Speed = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]

Calculate Mean = (99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77

e.g., Use the NumPy mean() method to find the average speed:

import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.mean(speed)

print(x)

output: 89.769

Median is the value in the middle, after you have sorted all the values:

77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111

Use the NumPy median() method to find the middle value:

import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

output: 87.0

Mode value is the value that appears the most number of times:

99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86

Standard Deviation

standard deviation means that the values are spread out over a wider range.

speed = [86,87,88,86,87,85,86] output: 0.9

Use the NumPy std() method to find the standard deviation:

import numpy

speed = [86,87,88,86,87,85,86]

x = numpy.std(speed)

print(x)

output: 0.9

Variance

Variance is another number that indicates how spread out the values are.

if you multiply the standard deviation by itself, you get the variance!

Use the NumPy var() method to find the variance:

import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.var(speed)

print(x)

output: 1432.24

Percentiles

Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.

Use the NumPy percentile() method to find the percentiles:

import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 75)

print(x)

output: 43.0

Data Distribution

We use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.

Create an array containing 250 random floats between 0 and 5:

import numpy

x = numpy.random.uniform(0.0, 5.0, 250)

print(x)

Histogram

To visualize the data set we can draw a histogram with the data we collected.

We will use the Python module Matplotlib to draw a histogram:

Draw a histogram:

import numpy

import matplotlib.pyplot as plt

x = numpy.random.uniform(0.0, 5.0, 250)

plt.hist(x, 5)

plt.show()

Big Data Distributions

Create an array with 100000 random numbers, and display them using a histogram with 100 bars:

import numpy

import matplotlib.pyplot as plt

x = numpy.random.uniform(0.0, 5.0, 100000)

plt.hist(x, 100)

plt.show()

Normal Data Distribution

In probability theory normal data distribution, or the Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the formula of this data distribution.

A typical normal data distribution:

import numpy

import matplotlib.pyplot as plt

x = numpy.random.normal(5.0, 1.0, 100000)

plt.hist(x, 100)

plt.show()

Scatter Plot

A scatter plot is diagram where each value in the data set is represented by a dot.

The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the same length, one for the values of the x-axis, and one for the values of the y-axis:

y = [5,7,8,7,2,17,2,9,4,11,12,9,6]

x = [99,86,87,88,111,86,103,87,94,78,77,85,86]

Use the scatter() method to draw a scatter plot diagram:

import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]

y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)

plt.show()

Random Data Distributions

A scatter plot with 1000 dots:

import numpy

import matplotlib.pyplot as plt

x = numpy.random.normal(5.0, 1.0, 1000)

y = numpy.random.normal(10.0, 2.0, 1000)

plt.scatter(x, y)

plt.show()

Regression

The term regression is used when you try to find the relationship between variables.

Linear Regression

Linear regression uses the relationship between the data-points to draw a straight line through all them.

Start by drawing a scatter plot:

import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]

y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)

plt.show()

e.g., Import scipy and draw the line of Linear Regression:

import matplotlib.pyplot as plt

from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]

y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):

return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)

plt.plot(x, mymodel)

plt.show()

R-Squared

The values of the x-axis and the values of the y-axis is, if there are no relationship the linear regression cannot be used to predict anything.

The relationship is measured with a value called the r-squared.

How well does my data fit in a linear regression

from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]

y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

print(r)

Predict Future Values

e.g., Predict the speed of a 10 years old car:

from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]

y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):

return slope * x + intercept

speed = myfunc(10)

print(speed)

Polynomial Regression

If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression.

e.g., Import numpy and matplotlib then draw the line of Polynomial Regression:

import numpy

import matplotlib.pyplot as plt

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]

y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

myline = numpy.linspace(1, 22, 100)

plt.scatter(x, y)

plt.plot(myline, mymodel(myline))

plt.show()

Multiple Regression

Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.

Start by importing the Pandas module

import Pandas

The Pandas module allows us to read csv files and return a DataFrame object.

df = pandas.read_csv("cars.csv")

Then make a list of the independent values and call this variable X. Put the dependent values in a variable called y.

X = df[['Weight', 'Volume']]

y = df['CO2']

We will use some methods from the sklearn module

from sklearn import linear_model

From the sklearn module we will use the LinearRegression() method to create a linear regression object.

This object has a method called fit() that takes the independent and dependent values as parameters and fills the regression object with data that describes the relationship:

regr = linear_model.LinearRegression()

regr.fit(X, y)

Now we have a regression object that are ready to predict CO2 values based on a car's weight and volume:

#predict the CO2 emission of a car where the weight is 2300g, and the volume is 1300ccm:

predictedCO2 = regr.predict([[2300, 1300]])

Complete example:

import pandas

from sklearn import linear_model

df = pandas.read_csv("cars.csv")

X = df[['Weight', 'Volume']]

y = df['CO2']

regr = linear_model.LinearRegression()

regr.fit(X, y)

#predict the CO2 emission of a car where the weight is 2300g, and the volume is 1300ccm:

predictedCO2 = regr.predict([[2300, 1300]])

print(predictedCO2)

Result: 107.208

Coefficient

The coefficient is a factor that describes the relationship with an unknown variable.

Example: if x is a variable, then 2x is x two times. x is the unknown variable, and the number 2 is the coefficient.

In this case, we can ask for the coefficient value of weight against CO2, and for volume against CO2. The answer(s) we get tells us what would happen if we increase, or decrease, one of the independent values.

Print the coefficient values of the regression object:

import pandas

from sklearn import linear_model

df = pandas.read_csv("cars.csv")

X = df[['Weight', 'Volume']]

y = df['CO2']

regr = linear_model.LinearRegression()

regr.fit(X, y)

print(regr.coef_)

Result: [0.00755095 0.00780526]

Scale

When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

There are different methods for scaling data.

The standardization is one of the method:

z = (x - u) / s

Where z is the new value, x is the original value, u is the mean and s is the standard deviation.

The Python sklearn module has a method called StandardScaler() which returns a Scaler object with methods for transforming data sets.

Scale all values in the Weight and Volume columns:

import pandas

from sklearn import linear_model

from sklearn.preprocessing import StandardScaler

scale = StandardScaler()

df = pandas.read_csv("cars2.csv")

X = df[['Weight', 'Volume']]

scaledX = scale.fit_transform(X)

print(scaledX)