Rossman Store Sales Predictions

Data Mining Project
Rossmann Store
Ayari Nadhmi 4 erp-bi 3
Sales
2016-2017
Content
1 Problem Introduction
2 Exploratory data
analysis
3 Models
3 Conclusion
4/29/17 2
2
Problem introduction
Problem Introduction
Forecast sale using store, promotion and competitor data

1115 Stores
Historical Data of sales from January 2013 to July 2015
4/29/17 4
4
Exploratory Data
Analysis
Dataset Details
Train
Variables : store, day of week, date, sales,customers, open, promo,
state holiday, school holiday
Store
store, storetype, assortment, competition distance, competition open since
month, promo2, promo2since week, promo2since year, promo interval
4/29/17 6
6
Dataset Details
Id - an Id that represents a Store Assortment -describes an assortment
Customers - the number of customers on level: a = basic, b = extra,c = extended
a given day Open - an indicator for CompetitionDistance - distance in meters
whether the store was open: 0 = closed, to the nearest competitor store
1 = open CompetitionOpenSince[Month/Year] -
StateHoliday -indicatesa state holiday. gives the approximate year and month of
Normally all stores, with few exceptions, the time the nearest competitor was
are closed on state holidays. Note that opened
all schools are closed on public holidays Promo - indicates whether a store is
and weekends. a = public holiday, b = running a promo on that day
Easter holiday, c = Christmas, 0 = None Promo2 - Promo2is a continuing and
SchoolHoliday -indicatesif theStore was consecutive promotion for some stores:
affected by the closure of public schools 0 = store is not participating, 1 = store is
StoreType-differentiates between 4 participating
different store models: a, b, c, d Promo2Since[Year/Week] -describes the
PromoInterval -describesthe year and calendar week when the store
consecutive intervals Promo2is started, startedparticipating inPromo2
naming the months the promotion is
started anew. E.g. "Feb,May,Aug,Nov"
means each round starts in February,
May, August, November of any given
year 4/29/17
for that store 7
7
Dataset Details
First description
4/29/17 8
8
Dataset Details
Test if there are some stores closed in the training Data
As we can see there no stores closed in the train data

Proportion of open stores against those closed in the train
data
Proportion of Sales against the fact if the store was closed or

open
4/29/17 9
9
Dataset Details
Proportion of the number of customers against closed and
opened stores
As we can see there some days when some stores were open
without having any customers
Proportion of sales against the fact if there have been a promo

or not
4/29/17 10
1
Dataset Details
testing<-train[which(train$Sales!=0 &
train$Customers != 0),]
ggplot(testing,aes(x = factor(testing$Promo), y =
testing$Sales)) +geom_jitter(alpha = 0.1) +
geom_boxplot(color = "yellow", outlier.colour =
NA, fill = NA)
From the graphic we can see the effect of a promo

on Sales
4/29/17 11
1
Dataset Details
testing<-train[which(train$Sales != 0),]
ggplot(testing,aes(x = factor(testing$DayOfWeek),
y = testing$Sales)) +
geom_jitter(alpha = 0.1) +
geom_boxplot(color = "yellow", outlier.colour =
NA, fill = NA)
The plot next to the left shows as the effect of day

on Sales as we can see for days of week 2,3,4 the
sales are mostly the same
4/29/17 12
1
Dataset Details
testing<-train[which(train$Sales != 0),]
ggplot(testing, aes(x =
factor(testing$SchoolHoliday), y =
testing$Sales)) +
geom_jitter(alpha = 0.1, color = "lightblue")
+
geom_boxplot(color = "red", outlier.colour =
NA, fill = NA)
This plot shows as the fact of having a

SchoolHoliday on Sales Amount
4/29/17 13
1
Dataset Details
testing<-train[which(train$Sales != 0 &
train$Customers != 0),]
ggplot(testing,
aes(x = factor(testing$Promo), y =
testing$Customers)) +
geom_jitter(alpha = 0.1, color = "lightblue") +
geom_boxplot(color = "hotpink", outlier.colour =
NA, fill = NA)
The fact of having a promo on the the number of

customers
4/29/17 14
1
Data Conversion and
preprocessing
15
Data Conversion
train$Date<-as.Date(train$Date)
train$month <- as.integer(format(train$Date, "%m"))
train$year <- as.integer(format(train$Date, "%y"))
train$day <- as.integer(format(train$Date, "%d"))
train$SchoolHoliday<-as.factor(train$SchoolHoliday)
train$Promo<-as.factor(train$Promo)
4/29/17 16
1
Data Preprocessing
train_store <- merge(train, store, by = "Store")

train_store <- train_store[train_store$Open != 0, ]
set.seed(123)
trainsample<- sample(1:nrow(train_store),
0.7*nrow(train_store))
test <- sample(setdiff(seq_len(nrow(train_store)),
trainsample), 0.3*nrow(train_store))
4/29/17 17
1
Models Used
18
Linear regression Model
lr.model <- lm(Sales ~ Promo +DayOfWeek + StateHoliday +month + year +

day+StoreType+ StoreType+CompetitionDistance , train_store[trainsample,])
Model Formula
Residuals with a max

value of 34941
Value being close to 0

indicating that the model
is bad
4/29/17 19
1
Linear regression Model
The rmse given from the above model used We try to improve our model using the StepAIC
for variables selection
4/29/17 20
2
Random Forest Model
library(randomForest)
rf <- randomForest(Sales ~Promo
+DayOfWeek + StateHoliday
+month + year + day+StoreType+
StoreType+SchoolHoliday,train_store
[trainsample,], ntree=20)
4/29/17 21
2
Random Forest Model
varImpPlot(rf) plot(rf)
This plot shows the We can see through this
importance of variables plot the evolution of errror
used by the model within the number of trees
4/29/17 22
2
Random Forest Model
Using h2o library
4/29/17 23
2
Random Forest Model
4/29/17 24
2
SVM Model
For using the svm regression model we took the store 1

because it took a long time to process all data
4/29/17 25
2
SVM Model
model.SVM <- svm(Sales~Promo +DayOfWeek +

StateHoliday +month + year + day+StoreType+
StoreType+SchoolHoliday , train_store1[sample,])
summary(model.SVM)
4/29/17 26
2
SVM Model
4/29/17 27
2
SVM Model
4/29/17 28
2
Conclusion
VVV 29

Rossman Store Sales Predictions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rossman Store Sales Predictions

Uploaded by

Copyright:

Available Formats

Data Mining Project

Forecast sale using store, promotion and competitor data

Test if there are some stores closed in the training Data

As we can see there no stores closed in the train data

Proportion of Sales against the fact if the store was closed or

Proportion of sales against the fact if there have been a promo

From the graphic we can see the effect of a promo

The plot next to the left shows as the effect of day

This plot shows as the fact of having a

The fact of having a promo on the the number of

train_store <- merge(train, store, by = "Store")

lr.model <- lm(Sales ~ Promo +DayOfWeek + StateHoliday +month + year +

Residuals with a max

Value being close to 0

Using h2o library

For using the svm regression model we took the store 1

model.SVM <- svm(Sales~Promo +DayOfWeek +

You might also like