Professional Documents
Culture Documents
Rossmann Store
Ayari Nadhmi 4 erp-bi 3
Sales
2016-2017
Content
1 Problem Introduction
2 Exploratory data
analysis
3 Models
3 Conclusion
4/29/17 2
2
Problem introduction
Problem Introduction
4/29/17 4
4
Exploratory Data
Analysis
Dataset Details
Train
Variables : store, day of week, date, sales,customers, open, promo,
state holiday, school holiday
Store
store, storetype, assortment, competition distance, competition open since
month, promo2, promo2since week, promo2since year, promo interval
4/29/17 6
6
Dataset Details
Id - an Id that represents a Store Assortment -describes an assortment
Customers - the number of customers on level: a = basic, b = extra,c = extended
a given day Open - an indicator for CompetitionDistance - distance in meters
whether the store was open: 0 = closed, to the nearest competitor store
1 = open CompetitionOpenSince[Month/Year] -
StateHoliday -indicatesa state holiday. gives the approximate year and month of
Normally all stores, with few exceptions, the time the nearest competitor was
are closed on state holidays. Note that opened
all schools are closed on public holidays Promo - indicates whether a store is
and weekends. a = public holiday, b = running a promo on that day
Easter holiday, c = Christmas, 0 = None Promo2 - Promo2is a continuing and
SchoolHoliday -indicatesif theStore was consecutive promotion for some stores:
affected by the closure of public schools 0 = store is not participating, 1 = store is
StoreType-differentiates between 4 participating
different store models: a, b, c, d Promo2Since[Year/Week] -describes the
PromoInterval -describesthe year and calendar week when the store
consecutive intervals Promo2is started, startedparticipating inPromo2
naming the months the promotion is
started anew. E.g. "Feb,May,Aug,Nov"
means each round starts in February,
May, August, November of any given
year 4/29/17
for that store 7
7
Dataset Details
First description
4/29/17 8
8
Dataset Details
4/29/17 9
9
Dataset Details
Proportion of the number of customers against closed and
opened stores
As we can see there some days when some stores were open
without having any customers
4/29/17 10
1
Dataset Details
testing<-train[which(train$Sales!=0 &
train$Customers != 0),]
ggplot(testing,aes(x = factor(testing$Promo), y =
testing$Sales)) +geom_jitter(alpha = 0.1) +
geom_boxplot(color = "yellow", outlier.colour =
NA, fill = NA)
4/29/17 11
1
Dataset Details
testing<-train[which(train$Sales != 0),]
ggplot(testing,aes(x = factor(testing$DayOfWeek),
y = testing$Sales)) +
geom_jitter(alpha = 0.1) +
geom_boxplot(color = "yellow", outlier.colour =
NA, fill = NA)
4/29/17 12
1
Dataset Details
testing<-train[which(train$Sales != 0),]
ggplot(testing, aes(x =
factor(testing$SchoolHoliday), y =
testing$Sales)) +
geom_jitter(alpha = 0.1, color = "lightblue")
+
geom_boxplot(color = "red", outlier.colour =
NA, fill = NA)
4/29/17 13
1
Dataset Details
testing<-train[which(train$Sales != 0 &
train$Customers != 0),]
ggplot(testing,
aes(x = factor(testing$Promo), y =
testing$Customers)) +
geom_jitter(alpha = 0.1, color = "lightblue") +
geom_boxplot(color = "hotpink", outlier.colour =
NA, fill = NA)
4/29/17 14
1
Data Conversion and
preprocessing
15
Data Conversion
train$Date<-as.Date(train$Date)
train$month <- as.integer(format(train$Date, "%m"))
train$year <- as.integer(format(train$Date, "%y"))
train$day <- as.integer(format(train$Date, "%d"))
train$SchoolHoliday<-as.factor(train$SchoolHoliday)
train$Promo<-as.factor(train$Promo)
4/29/17 16
1
Data Preprocessing
set.seed(123)
trainsample<- sample(1:nrow(train_store),
0.7*nrow(train_store))
test <- sample(setdiff(seq_len(nrow(train_store)),
trainsample), 0.3*nrow(train_store))
4/29/17 17
1
Models Used
18
Linear regression Model
Model Formula
4/29/17 19
1
Linear regression Model
The rmse given from the above model used We try to improve our model using the StepAIC
for variables selection
4/29/17 20
2
Random Forest Model
library(randomForest)
rf <- randomForest(Sales ~Promo
+DayOfWeek + StateHoliday
+month + year + day+StoreType+
StoreType+SchoolHoliday,train_store
[trainsample,], ntree=20)
4/29/17 21
2
Random Forest Model
varImpPlot(rf) plot(rf)
This plot shows the We can see through this
importance of variables plot the evolution of errror
used by the model within the number of trees
4/29/17 22
2
Random Forest Model
4/29/17 23
2
Random Forest Model
4/29/17 24
2
SVM Model
4/29/17 25
2
SVM Model
4/29/17 26
2
SVM Model
4/29/17 27
2
SVM Model
4/29/17 28
2
Conclusion
VVV 29