Professional Documents
Culture Documents
Data Mining Algorithms and Statistical Analysis for Sales Data Forecast
Abstract—This paper develops and compares different incorporated multiple input factors to improve the forecast
models to forecast new product sales data with accuracy [2]. We analyze both pure time-series forecasting
increasing sales trend and multiple predictor inputs. In models and time-series forecasting model with causal factor
order to analyze new product with increasing sales forecasting method. We found that with causal factor inputs,
trend, we developed and evaluated multiple time series the forecasting result can be greatly improved [3]. We
forecasting methods, including Exponential Smoothing concluded that causal factor inputs can compensate for the
model, Holt’s Linear model, ARMA model, and ARMA less accurate forecast due to limited data in new product
wit linear trend models. Furthermore, we created sales forecasting.
multiple Causal Factor Forecasting models to II. TIME SERIES FORECASTING MODELS
incorporate various dependent input factors such as sale
person’s quotes, product pricing, product seasonality A. Exponential Smoothing (ES) Method
factors, to further reduce forecasting error. We analyzed The Exponential Smoothing method fits a trend model
original data regression model, trend and residual such that the most recent data are weighted more heavily
regression model, and ARMAV wit linear trend model than data in the early part of the series. It has a weight
to consider input factors. We discovered that ARMAV parameter α , which is between 0 and 1. The larger the
wit linear trend model gives best forecasting accuracy alpha, the new forecast is influenced more by the recent
and lowest RSS (Residual Sum of Square). In conclusion, data. The reason that it is called exponential smoothing is
ARMAV with linear trend method is the best that the weight of an observation is a geometric
benchmark model to forecast sales data for new product (exponential) function of the number of periods that the
with trend and with sales person’s inputs. observation extends into the past relative to the current
period [4]. The model for ES is:
Keywords-Forecast; Time-Series Forecasting; Causal ∞
Factor Forecasting; ARMA; ARMAV Ft +1 = ∑ α (1 − α ) jYt
j =0
(1)
Or
I. INTRODUCTION Ft +1 = αYt + (1 − α ) Ft (2)
In consumer electronics industry, the normal product
F Y
selling cycle is two to three years. During this selling Here, t is the predicted data at time t , and t is the
period, product goes through initial product introduction and actual data at time t .
matured selling period. Forecasting sales of consumer ES method can be easily implemented. With initial
electronics products faces challenges. First, in the new
product introduction stage, the demand may have upward F =Y
predicted data 1 1 , the prediction of all months can be
trend. Second, consumer electronics sales may experience generated by equation (2). α is a decision variable to
seasonal selling pattern. Third, short selling period restricts choose with the objective to minimize the RSS. The whole
the data set size, which is a big challenge in time series model ca be implemented in Excel, and solved by Excel
forecasting. Even at the end of 2nd year, monthly sales only solver.
have 24 data points, which is much smaller than the
traditional time series problem [1]. Min RSS
Prior research on sales data forecast is intensively By changing α ,
focusing on forecasting with large historical data. However, Subject to 0 ≤ α ≤ 1
new product sales forecast with limited data points remains
as new research area. In addition, prior paper on sales data The optimal α that minimize RSS is α * = 0.58 . The
forecast mostly uses time-series models with no input minimum RSS in ES method is
factors. This paper explores new method in forecasting new RSS = 3.012 ×106 RSS=3.012
product sales data using 24 months historical data, and
578
523.975 + 73.279t . Figure 6 shows the original sales data)’s impact in the sales data forecasting. Causal factors
forecasting considers input factors, and uses them to
data fitting the linear trend line.
improve the forecasting accuracy.
After fitting the linear line, the data model becomes A. Original data regression
Yt = 523.975 + 73.279t + X t . Next, the residual X t is In this method, we directly regress the actual sales data
with respect to quotes data (with 1, 2, 3, 4, 5 months prior to
fitted into ARMA model. Using Matlab code, we found that the sales month) and seasonality index. The seasonality
ARMA(4,2) is the adequate model. index is implemented by giving dummy variables to
X t = 1.251X t −1 − 0.5207 X t −2 + 0.6235 X t −3 − 0.5193 X t −4represent
+ at − 1.799 1 + 0.8601
eachat −month. Theatresult
−2 proves that the only
significant factor is the quotes data 3 months prior to the
sales month), with t-stat=3.13. All other quotes data and
seasonality data are insignificant.
The four characteristics roots are:
The regression forecasting model is
λ1,2 = 0.8901 ± 0.2561i, F t = − 8 2 9 .5 3 6 + 0 .7 9 5 × Q t − 3 ( 7 )
λ3,4 = −0.2647 ± 0.7317i Q Q
Here, t is the quote data at time t , and t −3 is the
quotes data with 3 months prior to the sales month.
No seasonality is observed in this model. Figure 5 gives the actual data and forecast data using
6
forecasting model (7). The RSS is 3.401×10 , which is
higher than all the previous method. From Figure 8 we
observe that the forecast displays an over-forecast in the
earlier months and under-forecast in the later months. This
is because that the regression method does not consider the
increasing trend.
Original Data Regression Forecast vs. Actual Sales
3000
2500
Figure 3. Fit Actual Sales into a Trend Line
2000
Actual Sales
Units
Figure 4 gives the actual data and forecast data comparison 1500
Forecast
6
Figure 3. The RSS is 2.691× 10 , which is lower than ES
500
0
method, but higher than Holt’s method and ARMA with no
Feb-08
Mar-08
Apr-08
May-08
Jun-08
Jul-08
Jul-08
Aug-08
Sep-08
Oct-08
Nov-08
Dec-08
Jan-09
Feb-09
Mar-09
Apr-09
May-09
Jun-09
Jul-09
Aug-09
Sep-09
Oct-09
Nov-09
Dec-09
Jan-10
Feb-10
trend model. This result tells that the linear trend does not
help in the sales prediction for ARMA model [7]. Months
579
data 3 months prior to the sales month), with t-stat=7.0. All ARMAV + Linear Trend Forecast vs Actual Sales ARMAV+Trend Forecast
other quotes data and seasonality data are insignificant. Actual Sales
Units
Yt = 523.975 + 73.279t − 1413.74 + 0.495 × Qt −3
1500
1000
500
Figure 6 shows the actual and forecast data using residual 0
Feb-08
Mar-08
May-08
Sep-08
Dec-08
Feb-09
Mar-09
May-09
Sep-09
Dec-09
Feb-10
Mar-10
Apr-08
Oct-08
Jun-08
Aug-08
Nov-08
Jan-09
Apr-09
Oct-09
Jun-09
Aug-09
Nov-09
Jan-10
Jul-08
Jul-09
0.815 ×106 , which is much lower than all the previous Months
900
-100 months quote data. Quote data is the number of quotes that
May-08
May-09
Mar-08
Nov-08
Mar-09
Nov-09
Feb-08
Sep-08
Dec-08
Jan-09
Feb-09
Sep-09
Dec-09
Jan-10
Feb-10
Apr-08
Jun-08
Jul-08
Jul-08
Aug-08
Oct-08
Apr-09
Jun-09
Jul-09
Aug-09
Oct-09
the sales people sent out each month. Some quote data end
Months
up with an actual sales event, but other quotes may be lost.
Therefore, quote data gives an indication of final sales data,
Figure 6. Linear Trend + Residual Forecast
but is not completely correlated to the sales data. In
C. ARMAV with Linear Trend addition, quotes data may not have immediate impact to
Since Trend + Residual Regression method can greatly sales data [11]. The market may demonstrates sales data be
improve the forecasting accuracy, my hypothesis is that impacted by the quotes data which sent out one or several
vector ARMA (ARMAV model) with Linear Trend may months ago. The purpose of this project is to identify the
work even better than Trend + Residual Regression model. best sales forecasting model considering historical sales data
The reasons are as follows: and quote data [12].
ARMAV with Linear Trend method considers the Figure 8 and 9 give the actual sales data and quote data.
increasing trend; Actual Sales
ARMAV has a more complicated algorithm than
regression, thus can capture more input factor impacts [9]. 2500
1000
Xt Qt
500
580
Traditional forecasting methodologies can be divided However, Holt’s method is very simple to understand and
into two big categories: Time-Series Forecasting and Causal easier to implement.
Factors Forecasting. In this paper, I will introduce several 4. The forecast accuracy may improve if we have more
models in each category, and compare the forecasting data for ARMA model.
accuracies for those models. The following lists the models 5. ARMA with linear trend does not improve the
that I am discussing in this paper [13]: forecasting accuracy.
1. Time series forecasting In conclusion, to forecast new product sales with
Exponential Smoothing (ES) Method limited past data points, ARMAV and linear trend model is
Holt’s Linear Method the best model. This is because this model can use input
ARMA factors to compensate for the less accurate forecast due to
ARMA with linear trend limited data in new product sales forecasting.
2. Causal factors forecasting
Original data regression ACKNOWLEDGMENT
Trend and residual regression This work is supported by computing Science by The
ARMAV with linear trend Communication of China (No.XNG1144), the National
I extend this paper to the models beyond ARMA Natural Science Foundation of P. R. China (No. 60970127)
model. This is because although ARMA may provide best and partly supported by Program for New Century Excellent
forecasting result because of its model complexity, the Talents in University (NCET-09-0709).
model itself is not as intuitive as other methods, e.g. ES and
Holt’s method. In addition, ARMA model is difficult to REFERENCES
derive without programming and relevant software (e.g.
MATLAB) [14]. Compared to ARMA, other methods (e.g. [1] H. L. Willis and J. V. Aanstoos, "Some unique signal processing
ES, Holt’s, Regression) can be easily implemented in Excel. applications in power system planning", IEEE Trans. Acoust.,
In this consideration, ARMA may be attractive if it shows Speech, Signal Processing, vol. ASSP-27, 1979,pp.685 -697.
significantly advantage in improving forecasting accuracy [2] C. Komninakis, "A fast and accurate Rayleigh fading simulator," in
compared to other methods [15]. If ARMA only shows IEEE Globecom, vol.6, 2003, pp. 3306-3310.
slightly better result, companies may still prefer simple [3] W.Xie, L.Yu and S.Y.Xu, A new method for crude oil price
forecasting based on support vector machines, LECTURE NOTES IN
methods (e.g. ES, Holt’s, Regression) because they are easy COMPUTER SCIENCE, 3994, pp.441-451, June,2006.
to implement.
[4] E.Williams, Energy intensity of computer manufacturing: hybrid
The RSS of different models are summarized in Table assessment combining process and economic input- output methods.
Table 1. RSS of Multiple Models Environmental science technology. 2004.
Models RSS
Exponential Smoothing (ES) [5] E.Williams and R.Kuehr; "Today's markets for used PCs - and ways
3.012 ×106 to enhance them." In: In: Kuehr, R. and Williams, E., Eds. 2003.
Holt’s Linear Method [6] 1. A. Santhakumaran and V. Thangaraj, “A Single Server Queue with
2.638 ×106 Impatient and Feedback Customers,” Vol. 11, pp. 71-79, June,2000.
ARMA
2.567 × 106 [7] F.Iravani and B.Balcoglu, “On Priority Queues with Impatient
ARMA + Linear Trend Customers,” Vol. 58, pp. 239-260,July,2008.
2.691×106 [8] L.M. Liu, and Z. Kong, Data Mining in Sales Forecasting[J],
Original Data Regression
3.401× 106 Business Times, 2007, pp.8-9.
Residual Regression + Linear [9] C. He, Design and Technique research on ETL system[J],
Trend 0.815 ×106 ComputerApp lications and Software, 2009, pp.198-201.
ARMAV + Linear Trend [10] Allen, D.E., S. Cruickshank and N. Morkel-Kingsbury, "A Comment
0.676 ×106 on 'The Information Content of Earnings and Prices: A Simultaneous
Equations Approach' by W.H. Beaver, M.L. McAnally and
C.H.Stinson ( 1997)", Working Paper, School of Finance and
V. CONCLUSION Business Economics, Edith Cowan University, and School of
Business and Economics, Monash University.
This paper studies different forecasting models, from [11] C.Lee, and Tsai,, "The time-series relation between monthly sales and
simple exponential smoothing model, Holt’s linear model, stock prices", Proceedings of the 9th Joint Conference on Information
to ARMA model, ARMAV + Trend model. Based on above Sciences.
results, we can give the following 5 conclusions: [12] G.Agrawal, Shared Memory Parallelization of Data Mining
1. ARMAV + Linear Trend model and Residual Algorithms: Techniques, Programming Interface Knowledge and
Regression + Linear Trend model consider both the trend of Data Engineering,2005.
the data and the input factor (quotes data). They give [13] H.Du,B.Zhang and D.F. Chen. Design and actualization of SOA-
significantly better forecasting accuracies. based data mining system. Computer-Aided Industrial Design and
Conceptual Design, 2008.
2. ARMAV + Linear Trend model gives best
[14] M.Quzzani and A.Bouguettaya. Efficient Access to Web
forecasting accuracy because ARMAV is a more Services[J].IEEE Internet Computing, 2004.
comprehensive way to model input output relationships than [15] G .Apostolikas, On- line RBFNN based identification of rapidly time-
multiple regression. varying nonlinear systems with optimal structureadaptation.-J]
3. In the time-series model without input factors, both Mathematics and computers in Simulation, 2003.
ARMA model and Holt’s method give good results.
581