Journey of Machine Learning

Journey of Machine Learning
There are so many algorithms available and it can feel overwhelming when algorithm names are thrown
around and you are expected to just know what they are and where they fit.
In this post I want to give you two ways to think about and categorize the algorithms you may come
across in the field.
 The first is a grouping of algorithms by the learning style.

 The second is a grouping of algorithms by similarity in form or function (like grouping similar animals
together).
Both approaches are useful, but we will focus in on the grouping of algorithms by similarity and go on a
tour of a variety of different algorithm types.
After reading this post, you will have a much better understanding of the most popular machine learning
algorithms for supervised learning and how they are related.
A cool example of an ensemble of lines of best fit. Weak members are grey, the combined prediction is
red.
Plot from Wikipedia, licensed under public domain.
Algorithms Grouped by Learning Style
There are different ways an algorithm can model a problem based on its interaction with the experience
or environment or whatever we want to call the input data.
It is popular in machine learning and artificial intelligence textbooks to first consider the learning styles
that an algorithm can adopt.
There are only a few main learning styles or learning models that an algorithm can have and we’ll go
through them here with a few examples of algorithms and problem types that they suit.
This taxonomy or way of organizing machine learning algorithms is useful because it forces you to think
about the the roles of the input data and the model preparation process and select one that is the most
appropriate for your problem in order to get the best result.
Let’s take a look at four different learning styles in machine learning algorithms:
Supervised Learning
Input data is called training data and has a known label or result
such as spam/not-spam or a stock price at a time.
A model is prepared through a training process where it is required to make predictions and is corrected
when those predictions are wrong. The training process continues until the model achieves a desired
level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include Logistic Regression and the Back Propagation Neural Network.
Unsupervised Learning
Input data is not labelled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract general
rules. It may through a mathematical process to systematically reduce redundancy, or it may be
to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and k-Means.
Semi-Supervised Learning
Input data is a mixture of labelled and unlabelled examples.
There is a desired prediction problem but the model must learn the structures to organize the data as well
as make predictions.
Example problems are classification and regression.
Example algorithms are extensions to other flexible methods that make assumptions about how to model
the unlabelled data.
Overview
When crunching data to model business decisions, you are most typically using supervised and
unsupervised learning methods.
A hot topic at the moment is semi-supervised learning methods in areas such as image classification
where there are large datasets with very few labelled examples.
Algorithms Grouped By Similarity
Algorithms are often grouped by similarity in terms of their function (how they work). For example, tree-
based methods, and neural network inspired methods.
I think this is the most useful way to group algorithms and it is the approach we will use here.
This is a useful grouping method, but it is not perfect. There are still algorithms that could just as easily fit
into multiple categories like Learning Vector Quantization that is both a neural network inspired method
and an instance-based method. There are also categories that have the same name that describes the
problem and the class of algorithm such as Regression and Clustering.
We could handle these cases by listing algorithms twice or by selecting the group that subjectively is the
“best” fit. I like this latter approach of not duplicating algorithms to keep things simple.
In this section I list many of the popular machine leaning algorithms grouped the way I think is the most
intuitive. It is not exhaustive in either the groups or the algorithms, but I think it is representative and will
be useful to you to get an idea of the lay of the land.
Please Note: There is a strong bias towards algorithms used for classification and regression, the two
most prevalent supervised machine learning problems you will encounter.
If you know of an algorithm or a group of algorithms not listed, put it in the comments and share it with us.
Let’s dive in.
Regression Algorithms
Regression is concerned with modelling the relationship between

variables that is iteratively refined using a measure of error in the predictions made by the model.
Regression methods are a workhorse of statistics and have been cooped into statistical machine learning.
This may be confusing because we can use regression to refer to the class of problem and the class of
algorithm. Really, regression is a process.
The most popular regression algorithms are:
 Ordinary Least Squares Regression (OLSR)

 Linear Regression
 Logistic Regression
 Stepwise Regression
 Multivariate Adaptive Regression Splines (MARS)
 Locally Estimated Scatterplot Smoothing (LOESS)
Instance-based Algorithms
Instance based learning model a decision problem with instances or
examples of training data that are deemed important or required to the model.
Such methods typically build up a database of example data and compare new data to the database
using a similarity measure in order to find the best match and make a prediction. For this reason,
instance-based methods are also called winner-take-all methods and memory-based learning. Focus is
put on representation of the stored instances and similarity measures used between instances.
The most popular instance-based algorithms are:
 k-Nearest Neighbour (kNN)

 Learning Vector Quantization (LVQ)
 Self-Organizing Map (SOM)
 Locally Weighted Learning (LWL)
Regularization Algorithms
An extension made to another method (typically regression methods)
that penalizes models based on their complexity, favoring simpler models that are also better at
generalizing.
I have listed regularization algorithms separately here because they are popular, powerful and generally
simple modifications made to other methods.
The most popular regularization algorithms are:
 Ridge Regression
 Least Absolute Shrinkage and Selection Operator (LASSO)
 Elastic Net
 Least-Angle Regression (LARS)
Decision Tree Algorithms
Decision tree methods construct a model of decisions made based

on actual values of attributes in the data.
Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are
trained on data for classification and regression problems. Decision trees are often fast and accurate and
a big favorite in machine learning.
The most popular decision tree algorithms are:
 Classification and Regression Tree (CART)

 Iterative Dichotomiser 3 (ID3)
 C4.5 and C5.0 (different versions of a powerful approach)
 Chi-squared Automatic Interaction Detection (CHAID)
 Decision Stump
 M5
 Conditional Decision Trees
Bayesian Algorithms
Bayesian methods are those that are explicitly apply Bayes’

Theorem for problems such as classification and regression.
The most popular Bayesian algorithms are:
 Naive Bayes
 Gaussian Naive Bayes
 Multinomial Naive Bayes
 Averaged One-Dependence Estimators (AODE)
 Bayesian Belief Network (BBN)
 Bayesian Network (BN)
Clustering Algorithms
Clustering, like regression describes the class of problem and the

class of methods.
Clustering methods are typically organized by the modelling approaches such as centroid-based and
hierarchal. All methods are concerned with using the inherent structures in the data to best organize the
data into groups of maximum commonality.
The most popular clustering algorithms are:
 k-Means
 k-Medians
 Expectation Maximisation (EM)
 Hierarchical Clustering
Association Rule Learning Algorithms
Association rule learning are methods that extract rules that best
explain observed relationships between variables in data.
These rules can discover important and commercially useful associations in large multidimensional
datasets that can be exploited by an organisation.
The most popular association rule learning algorithms are:
 Apriori algorithm
 Eclat algorithm
Artificial Neural Network Algorithms
Artificial Neural Networks are models that are inspired by the structure
and/or function of biological neural networks.
They are a class of pattern matching that are commonly used for regression and classification problems
but are really an enormous subfield comprised of hundreds of algorithms and variations for all manner of
problem types.
Note that I have separated out Deep Learning from neural networks because of the massive growth and
popularity in the field. Here we are concerned with the more classical methods.
The most popular artificial neural network algorithms are:
 Perceptron
 Back-Propagation
 Hopfield Network
 Radial Basis Function Network (RBFN)
Deep Learning Algorithms
Deep Learning methods are a modern update to Artificial Neural

Networks that exploit abundant cheap computation.
They are concerned with building much larger and more complex neural networks, and as commented
above, many methods are concerned with semi-supervised learning problems where large datasets
contain very little labelled data.
The most popular deep learning algorithms are:
 Deep Boltzmann Machine (DBM)

 Deep Belief Networks (DBN)
 Convolutional Neural Network (CNN)
 Stacked Auto-Encoders
Dimensionality Reduction Algorithms
Like clustering methods, dimensionality reduction seek and exploit

the inherent structure in the data, but in this case in an unsupervised manner or order to summarise or
describe data using less information.
This can be useful to visualize dimensional data or to simplify data which can then be used in a
supervized learning method. Many of these methods can be adapted for use in classification and
regression.
 Principal Component Analysis (PCA)

 Principal Component Regression (PCR)
 Partial Least Squares Regression (PLSR)
 Sammon Mapping
 Multidimensional Scaling (MDS)
 Projection Pursuit
 Linear Discriminant Analysis (LDA)
 Mixture Discriminant Analysis (MDA)
 Quadratic Discriminant Analysis (QDA)
 Flexible Discriminant Analysis (FDA)
Ensemble Algorithms
Ensemble methods are models composed of multiple weaker models

that are independently trained and whose predictions are combined in some way to make the overall
prediction.
Much effort is put into what types of weak learners to combine and the ways in which to combine them.
This is a very powerful class of techniques and as such is very popular.
 Boosting
 Bootstrapped Aggregation (Bagging)
 AdaBoost
 Stacked Generalization (blending)
 Gradient Boosting Machines (GBM)
 Gradient Boosted Regression Trees (GBRT)
 Random Forest
Machine Learning with R
R is the most popular platform for applied machine learning. When you want to get serious with applied
machine learning you will find your way into R.
It is very powerful because so many machine learning algorithms are provided. A problem is that the
algorithms are all provided by third parties, which makes their usage very inconsistent. This slows you
down, a lot, because you have to learn how to model data and how to make predicts with each algorithm
in each package, again and again.
In this post you will discover how you can overcome this difficulty with machine learning algorithms in R,
with pre-prepared recipes that follow a consistent structure.
The R ecosystem is enormous. Open source third party packages provide this power, allowing academics
and professionals to get the most powerful algorithms available into the hands of us practitioners.
A problem that I experienced when starting out with R was that the usage to each algorithm differs from
package to package. This inconsistency also extends to the documentation, with some providing worked
example for classification but ignoring regression and others not providing examples at all.
All this means that if you want to try a few different algorithms from different packages, you must spend
time figuring out how to fit and make predictions with each method in turn. This takes a lot of time,
especially with the spotty examples and vignettes.
I summarize these difficulties as follows:
 Inconsistent: Algorithm implementations vary in the way a model is fit to data and the way a model is
used to generate predictions. This means that you have to study each package and each algorithm
implementation just to put together a working example, let alone adapt it to your problem.
 Decentralized: Algorithm are implemented across different packages and it can be hard to locate which
packages provide an implementation of the algorithm you need, let alone which package provides the
most popular implementation. Additionally, the documentation for one package may be spread across
multiple help files, website and vignettes. This means you have to do a lot of searching just to locate an
algorithm, let alone compile a list of algorithms from which you can choose.
 Incomplete: Algorithm documentation is almost always partially complete. An example usage may or
may not be provided, if it is, it may or may not be demonstrated on a canonical problem. This means you
have no obvious way to quickly understand how to use an implementation.
 Complexity: Algorithms vary in their complexity of implementation and description. This can take it’s toll
on you as you jump from package to package. You want to focus on how to get the most from the
algorithm and its parameters, and not burn energy on parsing reams of PDFs just to get a hello world.
Linear Regression in R
You can copy and paste the recipes in this post to make a jump-start on your own problem or to learn and
practice with linear regression in R.
Ordinary Least Squares Regression
Ordinary Least Squares Regression
Ordinary Least Squares (OLS) regression is a linear model that seeks to find a set of coefficients for a
line/hyper-plane that minimise the sum of the squared errors.
# load data
data(longley)
# fit model
fit<- lm(Employed~., longley)
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley)
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)
Stepwize Linear Regression
Stepwise Linear Regression is a method that makes use of linear regression to discover which subset of
attributes in the dataset result in the best performing model. It is step-wise because each iteration of the
method makes a change to the set of attributes and creates a model to evaluate the performance of the
set.
# load data
data(longley)
# fit model
base<- lm(Employed~., longley)
# summarize the fit
summary(base)
# perform step-wise feature selection
fit<- step(base)
# summarize the selected model
summary(fit)
# make predictions
print(rmse)
Principal Component Regression
Principal Component Regression (PCR) creates a linear regression model using the outputs of a Principal
Component Analysis (PCA) to estimate the coefficients of the model. PCR is useful when the data has
highly-correlated predictors.
# load the package

library(pls)
# load data
data(longley)
# fit model
fit<- pcr(Employed~., data=longley, validation="CV")
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley, ncomp=6)
print(rmse)
Partial Least Squares Regression
Partial Least Squares (PLS) Regression creates a linear model of the data in a transformed projection of
problem space. Like PCR, PLS is appropriate for data with highly-correlated predictors.
# load the package

library(pls)
# load data
data(longley)
# fit model
fit<- plsr(Employed~., data=longley, validation="CV")
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley, ncomp=6)
print(rmse)
Multivariate Adaptive Regression Splines
Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression method that models
multiple nonlinearities in data using hinge functions (functions with a kink in them).
# load the package
library(earth)
# load data
data(longley)
# fit model
fit<- earth(Employed~., longley)
# summarize the fit
summary(fit)
# summarize the importance of input variables
evimp(fit)
# make predictions
print(rmse)
Support Vector Machine
Support Vector Machines (SVM) are a class of methods, developed originally for classification, that find
support points that best separate classes. SVM for regression is called Support Vector Regression
(SVM).
# load the package

library(kernlab)
# load data
data(longley)
# fit model
fit<- ksvm(Employed~., longley)
# summarize the fit
summary(fit)
# make predictions
print(rmse)
k-Nearest Neighbor
The k-Nearest Neighbor (kNN) does not create a model, instead it creates predictions from close data on-
demand when a prediction is required. A similarity measure (such as Euclidean distance) is used to
locate close data in order to make predictions.
# load the package

library(caret)
# load data
data(longley)
# fit model
fit<- knnreg(longley[,1:6], longley[,7], k=3)
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley[,1:6])
print(rmse)
Neural Network
A Neural Network (NN) is a graph of computational units that recieve inputs and transfer the result into an
output that is passed on. The units are ordered into layers to connect the features of an input vector to the
features of an output vector. With training, such as the Back-Propagation algorithm, neural networks can
be designed and trained to model the underlying relationship in data.
# load the package

library(nnet)
# load data
data(longley)
x <- longley[,1:6]
y <- longley[,7]
# fit model
fit<- nnet(Employed~., longley, size=12, maxit=500, linout=T, decay=0.01)
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, x, type="raw")
rmse<- mean((y - predictions)^2)
print(rmse)
Classification and Regression Trees
Classification and Regression Trees (CART) split attributes based on values that minimize a loss function,
such as sum of squared errors.
The following recipe demonstrates the recursive partitioning decision tree method on the longley dataset.
# load the package

library(rpart)
# load data
data(longley)
# fit model
fit<- rpart(Employed~., data=longley, control=rpart.control(minsplit=5))
# summarize the fit
summary(fit)
# make predictions
print(rmse)
Conditional Decision Trees
Condition Decision Trees are created using statistical tests to select split points on attributes rather than a
loss function.
The following recipe demonstrates the condition inference trees method on the longley dataset.
# load the package

library(party)
# load data
data(longley)
# fit model
fit<- ctree(Employed~., data=longley, controls=ctree_control(minsplit=2,minbucket=2,testtype="Univariate"))
# summarize the fit
summary(fit)
# make predictions
print(rmse)
Model Trees
Model Trees create a decision tree and use a linear model at each node to make a prediction rather than
using an average value.
The following recipe demonstrates the M5P Model Tree method on the longley dataset.
Rule System
Rule Systems can be created by extracting and simplifying the rules from a decision tree.
The following recipe demonstrates the M5Rules Rule System on the longley dataset.
# load the package

library(RWeka)
# load data
data(longley)
# fit model
fit<- M5Rules(Employed~., data=longley)
# summarize the fit
summary(fit)
# make predictions
print(rmse)
Bagging CART
Bootstrapped Aggregation (Bagging) is an ensemble method that creates multiple models of the same
type from different sub-samples of the same dataset. The predictions from each separate model are
combined together to provide a superior result. This approach has shown participially effective for high-
variance methods such as decision trees.
The following recipe demonstrates bagging applied to the recursive partitioning decision tree.
# load the package

library(ipred)
# load data
data(longley)
# fit model
fit<- bagging(Employed~., data=longley, control=rpart.control(minsplit=5))
# summarize the fit
summary(fit)
# make predictions
print(rmse)
Random Forest
Random Forest is variation on Bagging of decision trees by reducing the attributes available to making a
tree at each decision point to a random sub-sample. This further increases the variance of the trees and
more trees are required.
# load the package

library(randomForest)
# load data
data(longley)
# fit model
fit<- randomForest(Employed~., data=longley)
# summarize the fit
summary(fit)
# make predictions
print(rmse)
Gradient Boosted Machine

Boosting is an ensemble method developed for classification for reducing bias where models are added
to learn the misclassification errors in existing models. It has been generalized and adapted in the form
of Gradient Boosted Machines (GBM) for use with CART decision trees for classification and regression
# load the package

library(gbm)
# load data
data(longley)
# fit model
fit<- gbm(Employed~., data=longley, distribution="gaussian")
# summarize the fit
summary(fit)
# make predictions
print(rmse)
Cubist
Cubist decision trees are another ensemble method. They are constructed like model trees but involve a
boosting-like procedure called committees that re rule-like models.
# load the package

library(Cubist)
# load data
data(longley)
# fit model
fit<- cubist(longley[,1:6], longley[,7])
# summarize the fit
summary(fit)
# make predictions
print(rmse)
Machine Learning Using Python

In the picture, we used the following definitions for the notations:
 a(j)iai(j) : "activation" of unit ii in layer jj

 Θ(j)Θ(j) : matrix of weights controlling function mapping from layer jj to layer j+1j+1
Here are the computations represented by the NN picture above:
a(2)0=g(Θ(1)00x0+Θ(1)01x1+Θ(1)02x2)=g(ΘT0x)=g(z(2)0)a0(2)=g(Θ00(1)x0+Θ01(1)x1+Θ02(1)x2)=g(Θ0Tx)=g(z0(
2))
2))
2))
hΘ(x)=a(3)1=g(Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2)hΘ(x)=a1(3)=g(Θ10(2)a0(2)+Θ11(2)a1(2)+Θ12(2)a2(2))
In the equations, the gg is sigmoid function that refers to the special case of the logisticfunction and
defined by the formula:
g(z)=11+e−zg(z)=11+e−z
Sigmoid functions
One of the reasons to use the sigmoid function (also called the logistic function) is it was the first one to
be used. Its derivative has a very good property. In a lot of weight update algorithms, we need to know a
derivative (sometimes even higher order derivatives). These can all be expressed as products
of ff and 1−f1−f. In fact, it's the only class of functions that satisfies f′(t)=f(t)(1−f(t))f′(t)=f(t)(1−f(t)).
However, usually the weights are much more important than the particular function chosen. These
sigmoid functions are very similar, and the output differences are small. Here's a plot from Wikipedia-
Sigmoid function. Note that all functions are normalized in such a way that their slope at the origin is 1.
Forward Propagation
If we use matrix notation, the equations of the previous section become:
x=⎡⎣⎢x0x1x2⎤⎦⎥z(2)=⎡⎣⎢⎢⎢z(2)0z(2)1z(2)2⎤⎦⎥⎥⎥x=[x0x1x2]z(2)=[z0(2)z1(2)z2(2)]
z(2)=Θ(1)x=Θ(1)a(1)z(2)=Θ(1)x=Θ(1)a(1)
a(2)=g(z(2))a(2)=g(z(2))
a(2)0=1.0a0(2)=1.0
z(3)=Θ(2)a(2)z(3)=Θ(2)a(2)
hΘ(x)=a(3)=g(z(3))hΘ(x)=a(3)=g(z(3))
Back Propagation (Gradient computation)
The backpropagation learning algorithm can be divided into two phases: propagation and weight update.
- from wiki - Backpropagatio.
1. Phase 1: Propagation
Each propagation involves the following steps:
o Forward propagation of a training pattern's input through the neural network in order to
generate the propagation's output activations.
o Backward propagation of the propagation's output activations through the neural network
using the training pattern target in order to generate the deltas of all output and hidden
neurons.
2. Phase 2: Weight update
For each weight-synapse follow the following steps:
o Multiply its output delta and input activation to get the gradient of the weight.
o Subtract a ratio (percentage) of the gradient from the weight.
This ratio (percentage) influences the speed and quality of learning; it is called the learning rate.
The greater the ratio, the faster the neuron trains; the lower the ratio, the more accurate the
training is. The sign of the gradient of a weight indicates where the error is increasing, this is why
the weight must be updated in the opposite direction.
Repeat phase 1 and 2 until the performance of the network is satisfactory.
If we denote an error of node jj in layer ll as δ(l)jδj(l), for our output unit(L=3) becomes activation -actual
value:
δ(3)j=a(3)j−yj=hΘ(x)−yjδj(3)=aj(3)−yj=hΘ(x)−yj
If we use a vector form, it is:
δ(3)=a(3)−yδ(3)=a(3)−y
δ(2)=(Θ(2))Tδ(3)⋅g′(z(2))δ(2)=(Θ(2))Tδ(3)⋅g′(z(2))
where
g′(z(2))=a(2)⋅(1−a(2))g′(z(2))=a(2)⋅(1−a(2))
Note that we do not have δ(1)δ(1) term because that's the input layer and the values are the ones that we
observed and they are being used as a training set. So, there is no errors associate with the input.
Also, the derivative of cost function can be written like this:
∂∂Θ(l)ijJ(Θ)=a(l)jδ(l+1)i∂∂Θij(l)J(Θ)=aj(l)δi(l+1)
We use this value to update weights and we can multiply learning rate before we adjust the weight.
self.weights[i] += learning_rate * layer.T.dot(delta)
where the layer in the code is actually a(l)a(l).
Code
Source code is here.
importnumpy as np
def sigmoid(x):
return 1.0/(1.0 + np.exp(-x))
defsigmoid_prime(x):
return sigmoid(x)*(1.0-sigmoid(x))
deftanh(x):
returnnp.tanh(x)
deftanh_prime(x):
return 1.0 - x**2
classNeuralNetwork:
def __init__(self, layers, activation='tanh'):
if activation == 'sigmoid':
self.activation = sigmoid
self.activation_prime = sigmoid_prime
elif activation == 'tanh':
self.activation = tanh
self.activation_prime = tanh_prime
# Set weights
self.weights = []
# layers = [2,2,1]
# range of weight values (-1,1)
# input and hidden layers - random((2+1, 2+1)) : 3 x 3
fori in range(1, len(layers) - 1):
r = 2*np.random.random((layers[i-1] + 1, layers[i] + 1)) -1
self.weights.append(r)
# output layer - random((2+1, 1)) : 3 x 1
r = 2*np.random.random( (layers[i] + 1, layers[i+1])) - 1
self.weights.append(r)
def fit(self, X, y, learning_rate=0.2, epochs=100000):

# Add column of ones to X
# This is to add the bias unit to the input layer
ones = np.atleast_2d(np.ones(X.shape[0]))
X = np.concatenate((ones.T, X), axis=1)
for k in range(epochs):
i = np.random.randint(X.shape[0])
a = [X[i]]
for l in range(len(self.weights)):
dot_value = np.dot(a[l], self.weights[l])
activation = self.activation(dot_value)
a.append(activation)
# output layer
error = y[i] - a[-1]
deltas = [error * self.activation_prime(a[-1])]
# we need to begin at the second to last layer

# (a layer before the output layer)
for l in range(len(a) - 2, 0, -1):
deltas.append(deltas[-1].dot(self.weights[l].T)*self.activation_prime(a[l]))
# reverse
# [level3(output)->level2(hidden)] => [level2(hidden)-
>level3(output)]
deltas.reverse()
# backpropagation
# 1. Multiply its output delta and input activation
# to get the gradient of the weight.
# 2. Subtract a ratio (percentage) of the gradient from the
weight.
fori in range(len(self.weights)):
layer = np.atleast_2d(a[i])
delta = np.atleast_2d(deltas[i])
self.weights[i] += learning_rate * layer.T.dot(delta)
if k % 10000 == 0: print 'epochs:', k
def predict(self, x):

a = np.concatenate((np.ones(1).T, np.array(x)), axis=1)
for l in range(0, len(self.weights)):
a = self.activation(np.dot(a, self.weights[l]))
return a
if __name__ == '__main__':
nn = NeuralNetwork([2,2,1])
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y = np.array([0, 1, 1, 0])
nn.fit(X, y)
for e in X:
print(e,nn.predict(e))
Output:
epochs: 0
epochs: 10000
epochs: 20000
epochs: 30000
epochs: 40000
epochs: 50000
epochs: 60000
epochs: 70000
epochs: 80000
epochs: 90000
(array([0, 0]), array([ 9.14891326e-05]))
(array([0, 1]), array([ 0.99557796]))
(array([1, 0]), array([ 0.99707463]))
(array([1, 1]), array([ 0.00090973]))
k-Means Clustering
The k-Means Clustering finds centers of clusters and groups input samples around the clusters.
k-Means Clustering is a partitioning method which partitions data into k mutually exclusive clusters, and
returns the index of the cluster to which it has assigned each observation. Unlike hierarchical clustering,
k-means clustering operates on actual observations (rather than the larger set of dissimilarity measures),
and creates a single level of clusters. The distinctions mean that k-means clustering is often more
suitable than hierarchical clustering for large amounts of data.
The following description for the steps is from wiki - K-means_clustering.
Step 1
k initial "means" (in this case k=3) are randomly generated within the data domain.
Step 2
k clusters are created by associating every observation with the nearest mean. The partitions here
represent the Voronoi diagram generated by the means.
Step 3
The centroid of each of the k clusters becomes the new mean.
Step 4
Steps 2 and 3 are repeated until convergence has been reached.
K-Means Clustering in OpenCV
In OpenCV, we use cv.KMeans2(() and it's defined like this:
cv2.kmeans(data, K, criteria, attempts, flags[, bestLabels[, centers]])
The parameters are:
 data : Data for clustering.

 nclusters(K) K : Number of clusters to split the set by.
 criteria : The algorithm termination criteria, that is, the maximum number of iterations and/or the
desired accuracy. The accuracy is specified as criteria.epsilon. As soon as each of the cluster
centers moves by less than criteria.epsilon on some iteration, the algorithm stops.
 attempts : Flag to specify the number of times the algorithm is executed using different initial
labellings. The algorithm returns the labels that yield the best compactness (see the last function
parameter).
 flags :
o KMEANS_RANDOM_CENTERS - selects random initial centers in each attempt.
o KMEANS_PP_CENTERS - uses kmeans++ center initialization by Arthur and
Vassilvitskii.
o KMEANS_USE_INITIAL_LABELS - during the first (and possibly the only) attempt, use
the user-supplied labels instead of computing them from the initial centers. For the
second and further attempts, use the random or semi-random centers. Use one of
KMEANS_*_CENTERS flag to specify the exact method.
 centers : Output matrix of the cluster centers, one row per each cluster center.
1-D data (one feature)

In this section, we'll play with a data set which has only one feature. The data is one-dimensional, and
clustering can be decided by one parameter.
We generated two groups of random numbers in two separated ranges, each group with 25 numbers as
shown in the picture below:
The code:
importnumpy as np
import cv2
frommatplotlib import pyplot as plt
x = np.random.randint(25,100,25) # 25 randoms in (25,100)

y = np.random.randint(175,250,25) # 25 randoms in (175,250)
z = np.hstack((x,y)) # z.shape : (50,)
z = z.reshape((50,1)) # reshape to a column vector
z = np.float32(z)
plt.hist(z,256,[0,256])
plt.show()
Now, we're going to apply cv2.kmeans() function to the data with the following parameters:
# Define criteria = ( type, max_iter = 10 , epsilon = 1.0 )

criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
# Set flags
flags = cv2.KMEANS_RANDOM_CENTERS
# Apply KMeans
compactness,labels,centers = cv2.kmeans(z,2,None,criteria,10,flags)
In the code, the criteria means whenever 10 iterations of algorithm is ran, or an accuracy of epsilon = 1.0
is reached, stop the algorithm and return the answer.
Here are the meanings of the return values from cv2.kmeans() function:
 compactness : It is the sum of squared distance from each point to their corresponding centers.
 labels : This is the label array where each element marked '0','1',.....
 centers : This is array of centers of clusters.
For the current case, it gives us centers as:
centers: [[ 72.08000183] [ 217.03999329]]
Here is the final code for 1-D data:
importnumpy as np
import cv2
frommatplotlib import pyplot as plt
x = np.random.randint(25,100,25) # 25 randoms in (25,100)

y = np.random.randint(175,250,25) # 25 randoms in (175,250)
z = np.hstack((x,y)) # z.shape : (50,)
z = z.reshape((50,1)) # reshape to a column vector
z = np.float32(z)
# Define criteria = ( type, max_iter = 10 , epsilon = 1.0 )

criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
# Set flags
flags = cv2.KMEANS_RANDOM_CENTERS
# Apply KMeans
compactness,labels,centers = cv2.kmeans(z,2,None,criteria,10,flags)
print('centers: %s' %centers)
A = z[labels==0]
B = z[labels==1]
# initial plot
plt.subplot(211)
plt.hist(z,256,[0,256])
# Now plot 'A' in red, 'B' in blue, 'centers' in yellow

plt.subplot(212)
plt.hist(A,256,[0,256],color = 'r')
plt.hist(B,256,[0,256],color = 'b')
plt.hist(centers,32,[0,256],color = 'y')
plt.show()
We can see the data has been clustered: A's centroid is 72 and B's centroid is 217 which was one of the
outputs from the cv2.kmeans() function.
Signal Processing with NumPy arrays in iPython
iPython
IPython currently provides the following features (wiki-iPython):
 Powerful interactive shells (terminal and Qt-based).

 A browser-based notebook with support for code, text, mathematical expressions, inline plots and
other rich media.
 Support for interactive data visualization and use of GUI toolkits.
 Flexible, embeddable interpreters to load into ones own projects.
 Easy to use, high performance tools for parallel computing.
 Basic Signals - boxcar
 We'll make a simple boxcar with np.zeros() and np.ones().
 We start with a simple command to get python environment using ipython --pylab:
 $ ipython --pylab
 Python 2.7.5+ (default, Feb 27 2014, 19:37:08)
 Type "copyright", "credits" or "license" for more information.

 IPython 0.13.2 -- An enhanced Interactive Python.
 ? -> Introduction and overview of IPython's features.
 %quickref -> Quick reference.
 help -> Python's own help system.
 object? -> Details about 'object', use 'object??' for extra details.

 Welcome to pylab, a matplotlib-based Python environment [backend:
TkAgg].
 For more information, type 'help(pylab)'.

 In [1]:
 Let's play with arrays first:
 In [1]: zeros
 Out[1]:

 In [2]: zeros(100)
 Out[2]:
 array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0.])
 In [3]: ones(10)
 Out[3]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
 Now, we want to concatenate these and make a boxcar:
 Also, we we know more about the command, we can add '?' after the command. Then, it will
show details about the command as well as some usage samples:
 In [4]: concatenate?
 The following information will be displayed in another window (I mean if we can close the help
window (by typing 'q'), our command shell will reappear:
 Type: builtin_function_or_method
 String Form:<built-in function concatenate>
 Docstring:
 concatenate((a1, a2, ...), axis=0)

 Join a sequence of arrays together.

 Parameters
 ----------
 a1, a2, ... : sequence of array_like
 The arrays must have the same shape, except in the dimension
 corresponding to `axis` (the first, by default).
 axis : int, optional
 The axis along which the arrays will be joined. Default is 0.
 Now, concatenation:
 In [5]: concatenate ( (zeros(100), ones(30), zeros(100)) )
 Out[5]:
 array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1.,
 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0.])
 Actually, we want to assign the array to a variable:
 In [6]: data = concatenate ( (zeros(100), ones(30), zeros(100)) )

 In [7]: plot(data)
 Out[7]: [<matplotlib.lines.Line2D at 0x3024450>]
 The we get the picture below:

 Note the values in the array is for y-values, for x-axis, the sample numbers are used: we have
100+30+100=230 samples.

 Hamming filter
 In [8]: hamming(40)
 Out[8]:
 array([ 0.08 , 0.08595688, 0.10367324, 0.13269023, 0.17225633,
 0.2213468 , 0.27869022, 0.34280142, 0.41201997, 0.48455313,
 0.55852233, 0.63201182, 0.70311825, 0.77 , 0.83092487,
 0.88431494, 0.92878744, 0.96319054, 0.98663324, 0.99850836,
 0.99850836, 0.98663324, 0.96319054, 0.92878744, 0.88431494,
 0.83092487, 0.77 , 0.70311825, 0.63201182, 0.55852233,
 0.48455313, 0.41201997, 0.34280142, 0.27869022, 0.2213468 ,
 0.17225633, 0.13269023, 0.10367324, 0.08595688, 0.08
])

 In [9]: plot(hamming(40))
 Out[9]: []

 Then, we get the following picture:


 We overwrote a new to the old one because we did not close the plot window before we made a
new draw.
 So, let's close this window, and plot the Hamming again:
 In [10]: plot(hamming(40))
 Out[10]: []

 Now, we have a plot with only the Hamming filter:


 However, if we want to apply the filter to the other signal, we need to normalize the filter.

 Normalizing the Hamming filter

 We'll normalize the filter by diving it by the sum of the filter values:
 In [11]: sum(hamming(40))
 Out[11]: 21.139999999999997

 In [12]: norm = hamming(40) / sum(hamming(40))

 In [13]: plot(norm)
 Out[13]: []

 The we get normalized Hamming filter:


 Note that the peak value is much lower than the previous Hamming filter.

 Convolve
 Now we do convolve the filter with the boxcar data:
 In [14]: convolve(data, norm)
 Out[14]:
 array([ 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0.0037843 , 0.00785037, 0.0127545 , 0.01903124, 0.0271796 ,
 0.03765012, 0.05083319, 0.06704896, 0.08653903, 0.10946018,
 0.13588035, 0.16577684, 0.19903693, 0.23546077, 0.27476658,
 0.31659794, 0.36053301, 0.40609548, 0.45276687, 0.5 ,
 0.54723313, 0.59390452, 0.63946699, 0.68340206, 0.72523342,
 0.76453923, 0.80096307, 0.83422316, 0.86411965, 0.89053982,
 0.90967668, 0.92510066, 0.93641231, 0.94331865, 0.94564081,
 0.94331865, 0.93641231, 0.92510066, 0.90967668, 0.89053982,
 0.86411965, 0.83422316, 0.80096307, 0.76453923, 0.72523342,
 0.68340206, 0.63946699, 0.59390452, 0.54723313, 0.5 ,
 0.45276687, 0.40609548, 0.36053301, 0.31659794, 0.27476658,
 0.23546077, 0.19903693, 0.16577684, 0.13588035, 0.10946018,
 0.08653903, 0.06704896, 0.05083319, 0.03765012, 0.0271796 ,
 0.01903124, 0.0127545 , 0.00785037, 0.0037843 , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. ])

 In [15]:

 If we plot, it looks like this:


 Random Signal and FFT

 We can also generate random signals.
 In [16]: randn(20)
 Out[16]:
 array([ 1.04943265, -1.10655086, 2.65115344, 0.37216869, -1.81576411,
 -0.72413097, -0.74942176, -0.38411864, 0.04484577, 0.31172929,
 -0.78896659, -0.47319888, -0.61743916, -0.77453274, -2.34677579,
 -0.61554551, 0.93594151, -1.06451913, -1.13374368,
1.16034994])
 If we do FFT with 100 random numbers:
 In [17]: random_data = randn(100)

 In [18]: plot(fft.fft(random_data))
 /usr/lib/python2.7/dist-packages/numpy/core/numeric.py:320:
ComplexWarning: Casting complex values to real discards the imaginary
part
 return array(a, dtype, copy=False, order=order)
 Out[18]: []

 In [19]: plot(random_data)
 Out[19]: []

 In [20]:

 blue: FFT, green = random data

 Video Capture
 The following code is to capture video which could be either live stream or file stream. To capture
a video, we need to create a VideoCapture object. Its argument can be either the device index or
the name of a video file. Device index is just the number to specify which camera. Normally one
camera will be connected (index = 0) or the 2nd camera (index = 1).
 # capture.py
 import numpy as np
 import cv2

 # Capture video from camera
 # cap = cv2.VideoCapture(0)

 # Capture video from file
 cap = cv2.VideoCapture('BlueUmbrella.webm')

 while(cap.isOpened()):
 # Capture frame-by-frame
 ret, frame = cap.read()

 # Our operations on the frame come here
 gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

 # Display the resulting frame
 cv2.imshow('frame',gray)
 if cv2.waitKey(1) & 0xFF == ord('q'):
 break

 # When everything done, release the capture
 cap.release()

 cv2.destroyAllWindows()
Tracking Blue Object
In this section, we want to track blue umbrella from the Pixar movie clip. How?
After converting BGR image to HSV, we extract a blue colored object. The reason for convderting to HSV
colorspace is because it is more easier to represent a color than RGB colorspace. Here are the steps to
take:
1. Take each frame of the video.

2. Convert from BGR to HSV color-space.
3. We threshold the HSV image for a range of blue color.
4. Now extract the blue object alone, we can do whatever on that image we want.
Code - Tracking Blue Object

# cspace.py
import cv2
importnumpy as np
cap = cv2.VideoCapture('BlueUmbrella.webm')
while(1):
# Take each frame

_, frame = cap.read()
# Convert BGR to HSV

hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
# define range of blue color in HSV

lower_blue = np.array([110,50,50])
upper_blue = np.array([130,255,255])
# Threshold the HSV image to get only blue colors

mask = cv2.inRange(hsv, lower_blue, upper_blue)
# Bitwise-AND mask and original image

res = cv2.bitwise_and(frame,frame, mask= mask)
cv2.imshow('frame',frame)
cv2.imshow('mask',mask)
cv2.imshow('res',res)
k = cv2.waitKey(5) & 0xFF

if k == 27:
break
cv2.destroyAllWindows()
Here is the output from the code:
1. Original image
2. Threshold the HSV image to get only blue colors - mask
3. Bitwise-AND mask and original image - res
Original files in other formats:
 cspace_blue.mp4
 cspace_blue.webm
 cspace_blue.ogv

Journey of Machine Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Journey of Machine Learning

Uploaded by

Copyright:

Available Formats

Journey of Machine Learning

 The first is a grouping of algorithms by the learning style.

Algorithms Grouped by Learning Style

Example algorithms include: the Apriori algorithm and k-Means.

Example problems are classification and regression.

Algorithms Grouped By Similarity

Regression is concerned with modelling the relationship between

The most popular regression algorithms are:

 Ordinary Least Squares Regression (OLSR)

The most popular instance-based algorithms are:

 k-Nearest Neighbour (kNN)

The most popular regularization algorithms are:

Decision tree methods construct a model of decisions made based

The most popular decision tree algorithms are:

 Classification and Regression Tree (CART)

Bayesian methods are those that are explicitly apply Bayes’

The most popular Bayesian algorithms are:

Clustering, like regression describes the class of problem and the

The most popular clustering algorithms are:

Association Rule Learning Algorithms

The most popular association rule learning algorithms are:

Artificial Neural Network Algorithms

The most popular artificial neural network algorithms are:

Deep Learning methods are a modern update to Artificial Neural

The most popular deep learning algorithms are:

 Deep Boltzmann Machine (DBM)

Like clustering methods, dimensionality reduction seek and exploit

 Principal Component Analysis (PCA)

Ensemble methods are models composed of multiple weaker models

I summarize these difficulties as follows:

Ordinary Least Squares Regression

Principal Component Regression

# load the package

Partial Least Squares Regression

# load the package

Multivariate Adaptive Regression Splines

Support Vector Machine

# load the package

# load the package

# load the package

# load the package

Conditional Decision Trees

# load the package

# load the package

# load the package

# load the package

Gradient Boosted Machine

# load the package

# load the package

Machine Learning Using Python

 a(j)iai(j) : "activation" of unit ii in layer jj

Here are the computations represented by the NN picture above:

If we use matrix notation, the equations of the previous section become:

Back Propagation (Gradient computation)

Repeat phase 1 and 2 until the performance of the network is satisfactory.

If we use a vector form, it is:

Also, the derivative of cost function can be written like this:

self.weights[i] += learning_rate * layer.T.dot(delta)

where the layer in the code is actually a(l)a(l).

Source code is here.

def fit(self, X, y, learning_rate=0.2, epochs=100000):

# we need to begin at the second to last layer

if k % 10000 == 0: print 'epochs:', k

def predict(self, x):