You are on page 1of 71

Machine Learning for

Developers

Dr Prakash Goteti
Technology Learning Services
Agenda

 Big Picture: Introduction to Data Science

 Where Machine learning fits in?

 What is machine learning

 Machine learning case studies

 Machine learning –Key terminology

 Predictive Analytics and Recommendation Systems

 (Un)Supervised learning algorithms

Copyright © 2017 Tech Mahindra. All rights reserved. 2


 Introduction to Data Science

Copyright © 2017 Tech Mahindra. All rights reserved. 3


Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 4


Big Picture –Data Science
Data Science

Define Research goal


Establish Research
Goal
Prepare Project charter
Gather the data

Prepare the data

Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 5


Big Picture –Data Science
Data Science

Establish Research
Goal

Internal Data
Gather the data
External Data
Prepare the data

Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 6


Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data


Data cleansing

Prepare the data Data Transformation

Data Aggregation
Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 7


Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data


Graphical
techniques
Explore the data Visualization
Techniques
Non Graphical Techniques

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 8


Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Explore the data


Model selection

Build a model Model execution

Model evaluation
Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 9


Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Explore the data

Build a model
Presentation
Present the findings
Automation and inferences

Copyright © 2017 Tech Mahindra. All rights reserved. 10


Big Picture –Data Science
Data Science
Numpy and
Data cleansing
Pandas
Establish Research
Goal
Data
matplotlib
visualization and
package
Gather the data reporting

Machine
Prepare the data scikit-learn
learning
toolkit
algorithms

Explore the data


Natural
Nltk
language
framework
processing
Build a model

Social network NetworkX


analysis Library
Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 11


 Introduction to Machine Learning

Copyright © 2017 Tech Mahindra. All rights reserved. 12


Machine Learning

Machine learning is amalgamation of computer science,


engineering and statistics.

It is a tool that can be applied to many problems with the nature of


data interpretation and action on data for the benefit of business

Machine learning uses statistics extensively.

Copyright © 2017 Tech Mahindra. All rights reserved. 13


Machine learning case studies (1-2)

GE already makes hundreds of millions of dollars by crunching the


data it collects from deep-sea oil wells or jet engines to optimize
performance, anticipate breakdowns, and streamline maintenance.

Outside North America:


In Europe, more than a dozen banks have replaced older
statistical-modeling approaches with machine-learning techniques
and, in some cases, experienced 10 percent increases in sales of
new products, 20 percent savings in capital expenditures, 20
percent increases in cash collections, and 20 percent declines in
churn.
This is through new recommendation engines for clients in retailing
and in small and medium-sized companies enabling more accurate
forecast.

Copyright © 2017 Tech Mahindra. All rights reserved. 14


Machine learning case studies (2-2)
 A Canadian bank uses predictive analytics to increase campaign response rates
by 600%, cut customer acquisition costs in half, and boost campaign ROI by
100%.

 A research group at a leading hospital combined predictive and text analytics to


improve its ability to classify and treat pediatric brain tumors.

 An airline increased revenue and customer satisfaction by better estimating the


number of passengers who won’t show up for a flight. This reduces the number of
overbooked flights that require re-accommodating passengers as well as the
number of empty seats.

 These use cases reflect an important fact that predictive analytics (PA) can
provide significant impact towards Return –On -Investments for the organizations.

 PA can help companies in achieving operational excellence through cost


reduction, process improvement, better understand customer behavior, identify
unexpected opportunities, and anticipate problems before they happen so that risk
mitigation, avoidance steps can be taken up effectively.

Copyright © 2017 Tech Mahindra. All rights reserved. 15


Key Terminology
Features: individual measurements that when combined with
other features make up a training example
• identifying key properties describing these entities.
• If these entities are represented as table, each column is identified as feature
or attribute.
• Each row in the table is described as instance.
• Features or attributes are the individual measurements which collectively make
up a training example.
• This is usually columns in a training or test set

Training set:
• Set of columns/attributes collectively constitutes training set.
• The target variable or class the training example belongs to is then compared
to the predicted value to understand how accurate the algorithm is.

Training example:
• Each training example has features of a class and target variable.

Copyright © 2017 Tech Mahindra. All rights reserved. 16


Key Terminology
Data Types
• Numeric Data (quantifiable things-discrete, continuous )
• Categorical ( Based on categories –enumerate the categories)
• Ordinal Data (mixture above: star ratings on product, movie etc)

Knowledge Representation:
• It is in the form of rules –like probability distribution
• These rules are readable by the machine.

Classification: To predict what class an instance of data should fall into.

Regression: A best fit line drawn through some data points to generalize the data
points
• Regression is prediction of a numeric value. For example, consider the problem of classification of items

Supervised learning:
• There is a target value given for the data

Un-supervised learning:
Copyright © 2017 Tech Mahindra. All rights reserved. 17
• There is no target value given for the data
Steps in Machine learning
Data • RSS feed, likes, dislikes
Collection extracting from Websites

Data
cleansing • Refining the data /columns

Analyze
input Data • Recognize if any patterns

Train the
Algorithm • Feed the MLA with clean data

Test the
algorithm • Infer the results

Copyright © 2017 Tech Mahindra. All rights reserved. 18


 Mathematical and Statistical Foundations

Copyright © 2017 Tech Mahindra. All rights reserved. 19


Binning No of Age
people Range
 Convert Numeric data into categorical data (bins)

 Use pre-defined ranges as bins 20 20-30

 Classification algorithm and age is class variable 33 31-40


 Indicator variables –convert categorical data into Boolean
data
45 41-50
 Centering and Scaling Time zone
– Standardise the range of values 41 51-60
– Better comparison
– Values are “centered” by subtracting them from the mean: 37 >60
– Values are scaled by dividing the above by SD
– ML algorithm gives better results with standardized values

Mean:

– Variance describes spread around the mean:


– SD Example: sample: (2,5,6,5,9) Mean =27/55.4
– (5.4) Differences from the mean =(-3.4, -0.4,0.6,-0.4,4.4)
– Squared differences =(11.56, 0.16, 0.36, 0.16,19.36)
– Avg of squared diffs =(11.56, 0.16, 0.36, 0.16,19.36)/5
– =31.6/5 =6.32 =2.51
Copyright © 2017 Tech Mahindra. All rights reserved. 20
Correlation
 Pearson correlation correlation coefficient r measures the strength and direction of a
linear relationship between two variables on a scatterplot. The value of r is always
between +1 and –1.

Copyright © 2017 Tech Mahindra. All rights reserved. 21


Covariance and Correlation
 How much two attributes (X, Y) are correlated or separated

 Measuring Covariance:

– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)

– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)

– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size

– Divide the covariance by SDs of both sets

– -1 means perfect inverse correlation; 0 means no correlation; 1 – perfect correlation

– Correlation cannot be indicator for causation; It helps in what experiments to conduct

Copyright © 2017 Tech Mahindra. All rights reserved. 22


Covariance and Correlation
 How much two attributes (X, Y) are correlated or separated

 Measuring Covariance:

– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)

– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)

– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size

– Divide the covariance by SDs of both sets

– -1 means perfect inverse correlation; 0 means no correlation; 1 – perfect correlation

– Correlation cannot be indicator for causation; It helps in what experiments to conduct

Copyright © 2017 Tech Mahindra. All rights reserved. 23


Solving linear equation
• In machine learning, we deal with training sets and test data where the algorithms to be
trained on large data sets and Matrices are good representation for such data.

• Matrices help in dimensionality reduction with respect to data set through Principal
Component Analysis (PCA).

• A classifier algorithm or regression one by minimizing error between the value calculated
by the nascent classifier and the actual value from the training data can be done using
linear algebra techniques.
Steps in solving linear equations:
Consider: −3𝑥 − 2𝑦 + 4𝑧 = 9 3𝑦 − 2𝑧 = 5 4𝑥 − 3𝑦 + 2𝑧 = 7
These can be expressed as: AX=B; 𝑋 = 𝐴−1 . 𝐵 ,where
A=[ −3 −2 4
0 3 −2
4 −3 2 ]
B=[
9
5
7
] X=[x,
y,
Copyright © 2017 Tech Mahindra. All rights reserved. 24
Working with Data structures -Set
A|B
Returns a set which is the union of sets A and B.
A.union(B)
A |= B
Adds all elements of array B to the set A.
A.update(B)
A&B
Returns a set which is the intersection of sets A and B.
A.intersection(B)
A &= B
Leaves in the set A only items that belong to the set B.
A.intersection_update(B)
A-B Returns the set difference of A and B (the elements
A.difference(B) included in A, but not included in B).
A -= B
Removes all elements of B from the set A.
A.difference_update(B)
Returns the symmetric difference of sets A and B (the
A^B
elements belonging to either A or B, but not to both
A.symmetric_difference(B)
sets simultaneously).
A ^= B
Writes in A the symmetric difference of sets A and B.
A.symmetric_difference_update(B)
A <= B
Returns true if A is a subset of B.
A.issubset(B)
A >= B
Returns true if B is a subset of A.
A.issuperset(B)
A<B Equivalent to A <= B and A != B
A>B Equivalent to A >= B and A != B

Copyright © 2017 Tech Mahindra. All rights reserved. 25


Statistics
 Mean: sum of the values in the sample/size of the sample:
– (x1+x2+x3 ……xn)/N

 Median: It is middle value of the sorted set of values in the


sample.
– Median is less susceptible for outliers than the mean
– Median is better indicator to look at than mean

 Mode: Most common value in the data set


– It is an indicative of frequency

– Ex. 0,1,3, 4,0, 3,6,0: Mode is 0 –occurred 3 times in the sample

Copyright © 2017 Tech Mahindra. All rights reserved. 26


Statistics

 68% of the data falls within one SD of the mean


• 95% of the data falls within two SD of the mean
• 99.7% of the data falls within three SD of the mean

Copyright © 2017 Tech Mahindra. All rights reserved. 27


Statistics
 The probability density for a Gaussian distribution is given in terms of mean
value ( ) and the variance ( ) of the population as :

 The Central Limit Theorem states that


“Given a sufficiently large sample size from a population with a finite level of
variance, the mean of all samples from the same population will be approximately
equal to the mean of the population.

Furthermore, all of the samples will follow an approximate normal distribution


pattern, with all variances being approximately equal to the variance of the
population divided by each sample's size”.

https://www.youtube.com/watch?v=BO6GQkOjR50

Copyright © 2017 Tech Mahindra. All rights reserved. 28


 Working with Numpy –’NumpyNotebook1’ examples

Copyright © 2017 Tech Mahindra. All rights reserved. 29


 Cleansing the Data

Copyright © 2017 Tech Mahindra. All rights reserved. 30


Data Cleansing
 Issues with data quality
 Invalid values
 Formats of the data (dd-mm-yy); spelling issues
 Dependency –referential constraints, one to many unary relations
 Domain constraints, referential integrity constraints
 Duplicate records
 Missing values
 Values in wrong columns
 Issues with data quality
 …..
 Understanding Data Quality issues
 Understanding Data quality issues Pandas:
• Outlier analysis
• Exploratory data analysis –charts, visualization tools
 Understanding Data quality issues Pandas:
• Outlier analysis and data analysis – visualization tools
 Fixing the data quality issues
 Use coding language; fix the sources (R, Python..)
 Find issues in data processing streams
Copyright © 2017 Tech Mahindra. All rights reserved. 31
Data Cleansing –Data imputation
 If column is empty –what value we fill in?

 Fixing null, empty values

 Unlike RDBMS, any value in ML is valid

 ML Considers nulls as ‘class of data’

 Techniques:
– Populate by mean, median, mode
– Multiple imputation techniques (regression, mean median..)
– Prediction algorithm to predict missing value

Copyright © 2017 Tech Mahindra. All rights reserved. 32


Data Cleansing –Data Standardization
 Numeric data
– Logarithm
– Decimal places
– Floor, ceiling

 Date and time


– Time zone
– Fixing null, empty values

 Text data
– Name formatting
– Upper case /lower case

Copyright © 2017 Tech Mahindra. All rights reserved. 33


Python Libraries
Installation:

Approach 1: pip install numpy scipy matplotlib ipython Jupyter Pandas sympy

Approach 2: Python library bundles are available through environment platforms:


Anaconda: https://www.continuum.io
Canopy: https://www.enthought.com/products/canopy/

Numpy: It stands for 'Numerical Python'.


• Useful to perform operations on arrays (vectors) including multidimensional array objects. It
supports several operations on these objects
• The other operations include areas from linear algebra, random number generation etc.

Pandas: Pandas library provides two important data structures namely Series and DataFrame

Copyright © 2017 Tech Mahindra. All rights reserved. 34


Pandas (1- 4):
 A library that provides a way of processing tabular data supported by
two data structures: Series, DataFrame

Copyright © 2017 Tech Mahindra. All rights reserved. 35


Pandas (2- 4):
 Creating a Series:
– By passing a list of values
– Pd.Series?
– animals =[‘Lion’, ‘Tiger’, ‘Bear’, ‘Mouse’]
– pd.Series(animals) Pandas automatically assigns index values
0 ‘Lion’
1 Tiger
2 Bear
3 Mouse
dtype: Object
 Series from Dictionary
– city_cap =[‘India’: New Delhi, ‘US’: New Yark’] # to know the type of keys
s =pd.Series(city_caps)
for i in city_cap.keys():
US New Yark
India New Delhi print(type(i))
dtype Object
 Series Form a list of indices and corresponding values
– Pandas overrides automatic creation of index values using list of values provided
through index parameter
– s=pd.Series([value_item_list], index=[keys_list])
Copyright © 2017 Tech Mahindra. All rights reserved. 36
Pandas (3- 4):
 Working with DataFrame:

– A library that provides a way of processing tabular data supported by


two data structures: Series, DataFrame

 Series
– A Series is cross breed of array indexing and dictionary:
Examples:

Copyright © 2017 Tech Mahindra. All rights reserved. 37


Pandas (4 - 4):
 Pandas Data structures:

– A library that provides a way of processing tabular data supported by


two data structures: Series, DataFrame

 Series
– A Series is cross breed of array indexing and dictionary:
Examples:

Copyright © 2017 Tech Mahindra. All rights reserved. 38


 Data Visualization in Python

Copyright © 2017 Tech Mahindra. All rights reserved. 39


Data Visualization (1 - 6):
 Data visualization: Story telling by means of visual patterns
– Before looking at data creating an interesting the story

– Story will tell us specific tools needed for visualization

1. Identify the tool (excel/tableau/python …)

2. Define the story clearly

3. Pick up right visual aid to tell the story

4. Assess data visualization


a) Are there any distractions from main story
b) Are they describe your story?

𝑠𝑡𝑜𝑟𝑦 𝑖𝑛𝑘
 Story ink ratio: =
𝑡𝑜𝑡𝑎𝑙 𝑖𝑛𝑘 𝑢𝑠𝑒𝑑 𝑡𝑜 𝑝𝑟𝑖𝑛𝑡 𝑡ℎ𝑒 𝑔𝑟𝑎𝑝ℎ𝑖𝑐

– Portion of graphics ink is devoted to the non-redantant display of the story information

Copyright © 2017 Tech Mahindra. All rights reserved. 40


Data Visualization (2 - 6):
– Pick up the chart that communicates the story best !
– Bar chart: To make comparisons between the categories, comparisons in time intervals

– Two types:
 Horizontal (long list of categories)
 Vertical (showing negative values, time periods)
 Comparing the trends –line charts

– Pi Chart:
 Best for showing few categories
 Parts of pi chart should add to a meaningful whole Creating effective visualization

– Stacked areas (ex. Cumulative flow diagram)


 When cumulative proportions matter
 They are poor at showing specific values

– Histograms –to understand spread in the data

– Box plots:
 Summarises the distribution (median, min_val, max_val) of the data;
 identify outliers in the data

– Scatter plots:
 Used to establish the relationship between the variables Copyright © 2017 Tech Mahindra. All rights reserved. 41
Data Visualization (3 - 6):

Copyright © 2017 Tech Mahindra. All rights reserved. 42


Data Visualization (4 - 6):
 Comparing colours
– Using the right colour –only if the colour communicates additional information
– Themes:

– Qualitative colour {contrast} They don’t carry obvious relationship among them

– Sequential colours{ range of values)


Same colour from fading shade to dark shade

– Diverging colours {obviously dividing segments}


Same colour from dark shade to fade

Copyright © 2017 Tech Mahindra. All rights reserved. 43


Data Visualization (5 - 6):
– Good practices

– A colour scheme should


 Add information
 Encode data well
 Accommodate colour blindness
 Print well –BW and colour

– Colour scheme tools


 Color Brewer 2.0 http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3
 Colorbrwer implementations in Python is done through:
https://pypi.python.org/pypi/brewer2mpl/1.4

– Selection of colours
 Light grey dark lines : to show simple data
 Black and red: Correlation
 Use legends: Indicates what each component represents
 Use labels that paints directly on charts instead of axes
 Make sure the visualization stands by itself
 Use squint test: Can this visualization tell a story?

Copyright © 2017 Tech Mahindra. All rights reserved. 44


Data Visualization (6 - 6):
– matplotlib library:
– Steps
1. Create the data set and visualize the figure
2. Plot the data
3. Configure axes
4. Add annotations/legends
5. Show() or save the file as image/pdf ….

– Implementation aspects

1. import matplotlib.pyplot as plt


2. plt.figure()
3. plt.plot(x_vals, y_vals)
4. Plt.plot(x2_vals, y2_vals)
5. plt.xticks([List of values])
6. plt.yticks([List of values]
7. plt.xlim(lower_x, upper_x)
8. plt.ylim(lower_y, upper_y)
9. plt.xlabels(‘’)
10. plt.ylabels(‘ ‘)
11. plt.legend()
12. plt.grid()
13. plt.show()/plt.savefig(…<filename>)
Copyright © 2017 Tech Mahindra. All rights reserved. 45
 Classification of Algorithms

Copyright © 2017 Tech Mahindra. All rights reserved. 46


Supervised Learning
 It is process of creating predictive models using set of
historical data that contains results that you are trying to
predict.
 A supervised learning algorithm is the one that given examples that contain the
desired target value

 Supervised Learning Approaches: Use past results to train a


model
 Classification: To identify which group a new record belongs to (i.e., customer or
event) based on its inherent characteristics.
 Regression: It uses past values to predict future values and is used in forecasting
and variance analysis

 Predictive Analytics: A practice of extracting information


from existing data sets in order to determine patterns and
predict future outcomes and trends.
 Collaborative filtering –Mining user behavior and make product
recommendations
Copyright © 2017 Tech Mahindra. All rights reserved. 47
Un-Supervised Learning
 Unsupervised learning does not use previously known
results to train its models.
 Un –supervised algorithms are not given the target desired answer,
but they must find something plausible on their own.

 Uses descriptive statistics to identify clusters (ex: Market analysis)

 They can identify


 clusters or groups of similar records within a database (i.e., clustering)
 relationships among values in a database (i.e., association)

Copyright © 2017 Tech Mahindra. All rights reserved. 48


Tasks
 Supervised learning tasks
 K –Nearest neighbors
 Naïve Bayes
 Support vector machines
 Decision trees

 Un –supervised learning tasks:

 k-Means
 DBSCAN

Why do we have so many algorithms?

Copyright © 2017 Tech Mahindra. All rights reserved. 49


Choice of the Algorithm
 Consider your goal
 If you are trying to predict or forecast a target value –supervised
learning
 If the target value is discrete {Y/NO, 1/2/3, A/B/C, Red/yellow…}
then use classification algorithm
 If the target value is continuous [a range of values] then use
regression {0.00 -10.00; -99 to +99; -infty to +infty}

 If you are NOT trying to predict or forecast a target value –un


supervised learning
 Try to fit the data into some discrete groups (clustering)

Copyright © 2017 Tech Mahindra. All rights reserved. 50


 Supervised Learning
– Classification

Copyright © 2017 Tech Mahindra. All rights reserved. 51


Introduction to classification: kNN Algorithm

for every point in our data set:


Compute distance between inX and the current point
sort the distances in increasing order
take k items with lowest distances into inX
find the majority class among these items
return the majority class as our prediction for the class inX

Copyright © 2017 Tech Mahindra. All rights reserved. 52


Example -kNN
Consider questionnaire survey on objective testing with two attributes –acid durability and
strength to classify whether a special paper tissue is good or not.
Four training samples:

Suppose factory produces a tissue with tests of values –X1=3, X2=7;


With out expensive survey can we guess what the classification of this new tissue is?
http://people.revoledu.com/kardi/tutorial/KNN/KNN_Numerical-example.html

Copyright © 2017 Tech Mahindra. All rights reserved. 53


Example -kNN
Let no of nearest neighbours (K)=3; Compute the distance from query instance and the training samples

Copyright © 2017 Tech Mahindra. All rights reserved. 54


Example -kNN
Let no of nearest neighbours (K)=3; Compute the distance from query instance and the training samples and identify 3 minima

Copyright © 2017 Tech Mahindra. All rights reserved. 55


Example -kNN
Gather the category Y of the nearest neighbours.

Copyright © 2017 Tech Mahindra. All rights reserved. 56


Example -kNN

 With in k=3, we have 2 good and one bad as per the survey input
data
 Conclude that the new tissue paper that pass laboratory tests with
X1=3, X2=7 is included in good category

Copyright © 2017 Tech Mahindra. All rights reserved. 57


Naïve Bayes:
Naïve: It simplifies the probability computations by assuming that
predictive features are mutually independent.

Bayes: It maps the probabilities of observing input features given belonging


classes, to the probability distribution over classes based on Bayes theorem:
𝑷 𝑩 𝑨 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)

 Probability of observing A occurs given B is true: 𝑃 𝐴 𝐵

 Probability of occurrence of A is: 𝑃 𝐴

 Probability of occurrence of B is : 𝑃 𝐵

 Probability of observing B given A occurs:𝑃 𝐵 𝐴


Copyright © 2017 Tech Mahindra. All rights reserved. 58
Naïve Bayes (2-3):
Example1: A doctor reported the following screening test scenario on
Cancer screening test :

Test Cancer No cancer Total


Test +ve 80 900 980
Test –ve 20 9000 9020
Total 100 9900 10000
 80 out of 100 are correctly diagnosed while the rest are not
 Cancer is falsely detected among 900 patients out of 900 healthy people
 If the result of this screening test on a person is Positive? What is the probability
𝑃 𝑃𝑜𝑠 𝐶 𝑃(𝐶)
that he actually have cancer? 𝑃 𝐶 𝑝𝑜𝑠 = 𝑃(𝑃𝑜𝑠)

80 100 980
𝑃 𝑃𝑜𝑠 𝐶 = = 0.8; 𝑃 𝐶 = = 0.01; 𝑃 𝑃𝑜𝑠 = = 0.098
100 10000 10000
=8.16% which is significantly higher than our general assumption: 100/10000=1%
Copyright © 2017 Tech Mahindra. All rights reserved. 59
Naïve Bayes (3-3):
Example2: Spam mail detection. Observed a tendency that the mails
containing the work “gift” are spam. Classify a given new mail into spam or
ham based on the probability:
𝑷 𝒈𝒊𝒇𝒕 𝑺𝒑𝒂𝒎 𝑷(𝑺𝒑𝒂𝒎)
𝑷 𝑺𝒑𝒂𝒎 𝒈𝒊𝒇𝒕 =
𝑷(𝒈𝒊𝒇𝒕)

 Probability of an email being spam, if it contains the word “gift”:: 𝑃 𝑆𝑝𝑎𝑚 𝑔𝑖𝑓𝑡

 The Nr is “Probability of a message being spam and containing the word “gift” :
𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃(𝑆𝑝𝑎𝑚)

 The Dr is the overall probability of an email containing the word “gift”: Equivalent
to : 𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃 𝑆𝑝𝑎𝑚 + 𝑃 𝑔𝑖𝑓𝑡 𝐻𝑎𝑚 𝑃(𝐻𝑎𝑚)

 Naïve : Presence of different words are independent of each other:

Copyright © 2017 Tech Mahindra. All rights reserved. 60


Naïve Bayes (3-3):
 Let the event of having cancer and positive test result as C, pos respectively. The
probability that the person has cancer, given that test result is positive is: 𝑃 𝐶 𝑃𝑜𝑠

 Cancer is falsely detected among 900 patients out of 900 healthy people
 If the result of this screening test on a person is Positive? What is the probability
that he actually have cancer?
𝑃 𝐵 𝐴 𝑃(𝐴)
 , positive: test shown positive, patient 𝑃 𝐴 𝐵 =
𝑃(𝐵)
 Conclude that the new tissue paper that pass laboratory tests with X1=3, X2=7 is
included in good category

Copyright © 2017 Tech Mahindra. All rights reserved. 61


 Un –Supervised Learning
– K Means clustering

Copyright © 2017 Tech Mahindra. All rights reserved. 62


K Means clustering (1-7):
 It is process of grouping a complex data into clusters
 Demographics, Movies
 K stands for number of clusters based on
attributes of the data
 “Split the data into k groups”
 What group of the given data belongs to -scatter
plot
 Helps in categorization which we don’t know
apriory!
 Unlike supervised learning, its not a case we
already know the correct group, we try to
converge the data into groups based on the data
–groups also unknown(–latent values)
 A supervised learning algorithm is the one that
given examples that contain the desired target
value
 Ex: interesting clusters of songs based on the
attributes of song

63
Copyright © 2017 Tech Mahindra. All rights reserved. 63
K Means clustering (2-7):

Copyright © 2017 Tech Mahindra. All rights reserved. 64


K Means clustering (3-7):
 Randomly we choose following two centroids (k=2) for two clusters.
 In this case the 2 centroid are: m1=(1.0,1.0) and m=(5.0,7.0).

65
Copyright © 2017 Tech Mahindra. All rights reserved. 65
K Means clustering (4-7):
We obtain two clusters containing: {1,2,3} and {4,5,6,7}.
Their new centroids are:

Copyright © 2017 Tech Mahindra. All rights reserved. 66


K Means clustering (5-7):

 Now using these centroids we


compute the Euclidean distance of
each object, as shown in table.

 Therefore, the new clusters are:


 {1,2} and {3,4,5,6,7}

 Next centroids are: m1=(1.25,1.5) and


m2 = (3.9,5.1)

Copyright © 2017 Tech Mahindra. All rights reserved. 67


K Means clustering (6-7):

 The clusters obtained are:


{1,2} and {3,4,5,6,7}

 Therefore, there is no change in


the cluster.
 Thus, the algorithm comes to a halt
here and final result consist of 2
clusters {1,2} and {3,4,5,6,7}.

Copyright © 2017 Tech Mahindra. All rights reserved. 68


K Means clustering (7-7):

Copyright © 2017 Tech Mahindra. All rights reserved. 69


Join Our community:

https://my.techmahindra.com/personal/pl73819/blog/Lists/Post
s/Post.aspx?ID=2
Thank you
Prakash.LNS@Techmahindra.com

Disclaimer
Tech Mahindra Limited, herein referred to as TechM provide a wide array of presentations and reports, with the contributions of various
professionals. These presentations and reports are for information purposes and private circulation only and do not constitute an offer to buy or sell
any services mentioned therein. They do not purport to be a complete description of the market conditions or developments referred to in the
material. While utmost care has been taken in preparing the above, we claim no responsibility for their accuracy. We shall not be liable for any direct
or indirect losses arising from the use thereof and the viewers are requested to use the information contained herein at their own risk. These
presentations and reports should not be reproduced, re-circulated, published in any media, website or otherwise, in any form or manner, in part or as
a whole, without the express consent in writing of TechM or its subsidiaries. Any unauthorized use, disclosure or public dissemination of information
contained herein is prohibited. Individual situations and local practices and standards may vary, so viewers and others utilizing information contained
within a presentation are free to adopt differing standards and approaches as they see fit. You may not repackage or sell the presentation. Products
and names mentioned in materials or presentations are the property of their respective owners and the mention of them does not constitute an
endorsement by TechM. Information contained in a presentation hosted or promoted by TechM is provided “as is” without warranty of any kind, either
expressed or implied, including any warranty of merchantability or fitness for a particular purpose. TechM assumes no liability or responsibility for the
contents of a presentation or the opinions expressed by the presenters. All expressions of opinion are subject to change without notice.

Copyright © 2017 Tech Mahindra. All rights reserved. 71

You might also like