Professional Documents
Culture Documents
A Report
by
print(df.head())
print(df.columns)
print(df.shape)
print(df.unique())
print("*********quantative*********")
print(df.describe())
print("*********qualitative*********")
print(df.describe(include=[object]))
df[df.dtypes[(df.dtypes=="float64")|(df.dtypes=="int64")].index.valu
es].hist(figsize=[11,11])
plt.show()
From this we inferred that there are 6 columns with last two columns having
categorical values. There are no columns with missing values, so we need not apply
data cleaning in this case.
3.2 Data cleaning
Not required as there were no missing values.
3.3 Data Visualizations
Bar charts are used to analyze the relation between the day and number of family
members present and also between time and number of family members present in
both the houses.
For House A
For House B
3.4 Model Building
In this section we build our model for prediction.
K-Means Analysis
In this section we applied K-MeansAlgo. and analysed the result.
# Data pre processing
dataset['Day_encoded'] = dataset['Day'].map({'Monday': 1, 'Tuesday': 2,
'Wednesday': 3, 'Thursday': 4, 'Friday': 5, 'Saturday': 6, 'Sunday': 7})
# Getting the prediction score for both house models using test data
scoreA = knnHouseA.score(x_testA, y_testA)
scoreB = knnHouseB.score(x_testB, y_testB)
Result From K-Means – The accuracy for House A is 90.47 and House B is 97.61
percentage.
Conclusion
We selected 2 columns for prediction - ('Day', 'Time').
After applying 4 ML Algorithms. We concluded that KNN was the best ML Algorithm
for our dataset.
References:
[1] https://medium.com/@contactsunny/linear-regression-in-python-using-scikit-learn-
f0f7b125a204
[2] http://dataaspirant.com/2017/02/01/decision-tree-algorithm-python-with-scikit-learn/
[3] https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/
[4] https://www.datacamp.com/community/tutorials/k-means-clustering-python
Contributions of each member:
X
Amber Bhargava
X
Amritesh Rai
X
Raghav Maheshwari