Machine Learning
Machine Learning
Making the computer learn from studying data and statistics.
Step into the direction of artificial intelligence (AI).
Program that analyses data and learns to predict the outcome.
Data Set
In the mind of a computer, data set is any collection of data.
It can be anything from an array to a complete database. Example of an array:
[99, 86,87,88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
Data Types
Data type main categories:
Numerical
Categorical
Ordinal
Numerical data are slit into two main categories.
Discrete Data - numbers that are limited to integers. e.g., The number of cars passing by.
Continuous Data - numbers that are of infinite value. e.g., The price of an item, or the size of an item
Categorical data are values that cannot be measured up against each other. e.g., a color value, or any yes/no values.
Ordinal data are like categorical data, but can be measured up against each other. e.g., school grades where A is better than B and so on.
By knowing the data type of your data source, you will be able to know what technique to use when analyzing them.
You will learn more about statistics and analyzing data in the next chapters.
Mean Median Mode
Mean - The average value
Median - The mid-point value
Mode - The most common value
e.g. speed of 13 vehicles.
Speed = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
Calculate Mean = (99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
e.g., Use the NumPy mean() method to find the average speed:
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.mean(speed)
print(x)
output: 89.769
Median is the value in the middle, after you have sorted all the values:
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
Use the NumPy median() method to find the middle value:
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.median(speed)
print(x)
output: 87.0
Mode value is the value that appears the most number of times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86
Standard Deviation
standard deviation means that the values are spread out over a wider range.
speed = [86,87,88,86,87,85,86] output: 0.9
Use the NumPy std() method to find the standard deviation:
import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)
output: 0.9
Variance
Variance is another number that indicates how spread out the values are.
if you multiply the standard deviation by itself, you get the variance!
Use the NumPy var() method to find the variance:
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.var(speed)
print(x)
output: 1432.24
Percentiles
Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
Use the NumPy percentile() method to find the percentiles:
import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 75)
print(x)
output: 43.0
Data Distribution
We use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.
Create an array containing 250 random floats between 0 and 5:
import numpy
x = numpy.random.uniform(0.0, 5.0, 250)
print(x)
Histogram
To visualize the data set we can draw a histogram with the data we collected.
We will use the Python module Matplotlib to draw a histogram:
Draw a histogram:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 250)
plt.hist(x, 5)
plt.show()
Big Data Distributions
Create an array with 100000 random numbers, and display them using a histogram with 100 bars:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 100000)
plt.hist(x, 100)
plt.show()
Normal Data Distribution
In probability theory normal data distribution, or the Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the formula of this data distribution.
A typical normal data distribution:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 100000)
plt.hist(x, 100)
plt.show()
Scatter Plot
A scatter plot is diagram where each value in the data set is represented by a dot.
The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the same length, one for the values of the x-axis, and one for the values of the y-axis:
y = [5,7,8,7,2,17,2,9,4,11,12,9,6]
x = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Use the scatter() method to draw a scatter plot diagram:
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
Random Data Distributions
A scatter plot with 1000 dots:
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)
plt.scatter(x, y)
plt.show()
Regression
The term regression is used when you try to find the relationship between variables.
Linear Regression
Linear regression uses the relationship between the data-points to draw a straight line through all them.
Start by drawing a scatter plot:
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
e.g., Import scipy and draw the line of Linear Regression:
import matplotlib.pyplot as plt
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
R-Squared
The values of the x-axis and the values of the y-axis is, if there are no relationship the linear regression cannot be used to predict anything.
The relationship is measured with a value called the r-squared.
How well does my data fit in a linear regression
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
print(r)
Predict Future Values
e.g., Predict the speed of a 10 years old car:
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
speed = myfunc(10)
print(speed)
Polynomial Regression
If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression.
e.g., Import numpy and matplotlib then draw the line of Polynomial Regression:
import numpy
import matplotlib.pyplot as plt
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
myline = numpy.linspace(1, 22, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Multiple Regression
Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.
Start by importing the Pandas module
import Pandas
The Pandas module allows us to read csv files and return a DataFrame object.
df = pandas.read_csv("cars.csv")
Then make a list of the independent values and call this variable X. Put the dependent values in a variable called y.
X = df[['Weight', 'Volume']]
y = df['CO2']
We will use some methods from the sklearn module
from sklearn import linear_model
From the sklearn module we will use the LinearRegression() method to create a linear regression object.
This object has a method called fit() that takes the independent and dependent values as parameters and fills the regression object with data that describes the relationship:
regr = linear_model.LinearRegression()
regr.fit(X, y)
Now we have a regression object that are ready to predict CO2 values based on a car's weight and volume:
#predict the CO2 emission of a car where the weight is 2300g, and the volume is 1300ccm:
predictedCO2 = regr.predict([[2300, 1300]])
Complete example:
import pandas
from sklearn import linear_model
df = pandas.read_csv("cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the CO2 emission of a car where the weight is 2300g, and the volume is 1300ccm:
predictedCO2 = regr.predict([[2300, 1300]])
print(predictedCO2)
Result: 107.208
Coefficient
The coefficient is a factor that describes the relationship with an unknown variable.
Example: if x is a variable, then 2x is x two times. x is the unknown variable, and the number 2 is the coefficient.
In this case, we can ask for the coefficient value of weight against CO2, and for volume against CO2. The answer(s) we get tells us what would happen if we increase, or decrease, one of the independent values.
Print the coefficient values of the regression object:
import pandas
from sklearn import linear_model
df = pandas.read_csv("cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
Result: [0.00755095 0.00780526]
Scale
When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?
The answer to this problem is scaling. We can scale data into new values that are easier to compare.
There are different methods for scaling data.
The standardization is one of the method:
z = (x - u) / s
Where z is the new value, x is the original value, u is the mean and s is the standard deviation.
The Python sklearn module has a method called StandardScaler() which returns a Scaler object with methods for transforming data sets.
Scale all values in the Weight and Volume columns:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("cars2.csv")
X = df[['Weight', 'Volume']]
scaledX = scale.fit_transform(X)
print(scaledX)