Professional Documents
Culture Documents
u
H
h
h |j h |i
H
|j |i d
1
2
)) ( ) ( (
1
) , ( (2)
b)!Distance between a representative load diagram and
the center of a set of diagrams
) , (
1
) , (
) ( ) (
1
2
) (
) ( ) (
) (
m k
n
m
k
k k
| r d
n
L r d
k
(3)
The MIA [4] depends on the average of the mean distances
between each pattern assigned to the cluster and its center.
) , (
1
1
) ( ) ( 2
K
k
k k
C r d
K
MIA
(4)
The CDI [4] depends on the distance between the load
diagrams in the same cluster and (inversely) on the distance
between the class representative load diagrams.
In (8) R is the set of the class representative load diagrams.
K
k
k
K
k
n
n
k m
k
R r d
K
C | d
n K
CDI
k
1
) ( 2
1 1
) ( ) ( 2
) (
) , (
2
1
) , (
. 2
1 1
) (
(5)
In our case study, three different clustering algorithms
were tested, the Two-Step algorithm, K-Means and Self
Organizing Maps (SOM), and the algorithm that produces the
smaller MIA and CDI values prevails over the others in term
of performance of partition. Indeed, the smaller value of MIA
indicates more compact clusters.
B.! C|assification Mode|.
In classification problems a set of pre-classified data points
are given and the classification algorithm tries to discover a
rule, which allows mimicking as closely as possible the
observed classification. A classification problem is a
supervised learning task where the output information is a
discrete classification. Broadly speaking, the classification
task consists in built a classification model that it can be
applied to unclassified records, seeking to classify them in
classes. In others words, consist in examine the obiect
features and attribute it to one of the predefined classes
(supervised learning).
To obtain more relevant information to describe the
consumption patterns of each cluster population we have used
4
a rule-based modeling technique, the C5.0 classification
algorithm. This algorithm was chosen due it is easier to
understand since the rules derived from the model have a very
straightforward interpretation.
The classification model should allow the attribution of a
new consumer to a certain cluster, basing on the rules
generated by the classification model. So, the rules must be
intelligible. For that, normalizes shape indicators were used as
attributes in the classification model. As mentioned in section
III, the commercial indexes have no relation with the load
curves, so if used isolated they can not provide a good
consumer classification. Thus, there was need to define
indexes in order to obtain sense rules of satisfactory
interpretation, and therefore, to express relevant information
about the electricity consumer behaviour.
These indexes are derived from the daily load diagrams
and some of them are based on the set of indexes proposed in
[1], [4]. They give information about the daily load curve
shape and about the consumption pattern of each consumer.
These indexes will be used as attributes in the classification
process. In Table 1 is presented a set of indices that there
were used, where Pmax is the maximum power demand, Pmin
is the minimum power demand and Pav is the average power
demand during a representative day.
TABLE I
NORMALIZED SHAPE INDIXES FOR CHARACTERIZARISING THE LOAD PROFILES
Parameter Definition Acquisition Period
Daily P
av
/P
max
day
day av
P
P
f
max,
,
1
1 day
Daily P
min
/P
max
day
day
P
P
f
max,
min,
2
1 day
Daily P
min
/P
av
day av
day
P
P
f
,
min,
3
1 day
Night mpact
day av
night av
P
P
f
,
,
4
3
1
1 day (8 hours night,
from 11 p.m. to 6 a.m.)
Lunch mpact
day av
|unch av
P
P
f
,
,
5
8
1
1 day (2 hours lunch,
from 12 a.m. to 2 p.m.,
16 hours daytime from 6
a.m. to 23 p.m.)
Daily
P
av
/P
inst
inst
day av
P
P
f
,
6
1 day
These indexes were extracted directly from the
representative load curve and they represent the load diagram
shape.
The model evaluation is performed using ten-fold cross
validation presented in [9]. As described in [5], the evaluation
is performed by randomly splitting the initial sample in 10
sub-samples. The model is trained using 9/10 of the data set
and tested with the 1/10 left. This process is performed ten
times on different training sets and, finally, the ten error
estimates are averaged to yield an overall error estimate.
The classification model can use all available inputs
attributes` choosing for each rule the most relevant. In Figure
2, the classification model structure is represented.
Fig. 2- Classification model structure
IV. CASE STUDY
A case study was applied on the data concerning 229
medium voltage customers of the Portuguese distribution
company. This sample was collected in a period of 3 months
in Summer and 3 months in Winter for working days and
weekends, and the consumed power was recorded with a
cadence of 15 minutes.
In the first step, the typical daily load curve of each
customer was determined. Through data preprocessing,
twenty-one customers were discarded from the initial data,
remaining 208 consumers to be analyzed.
During this stage, it also had need in filling out missing
values of measure. Thus, to estimate those values it was used
extrapolation, in the case when the missing values were under
1 hour, and beyond that, it was used a multi layer perceptron
(MLP) artificial neural net in order to filling out the missing
values and to include those customers in the study.
For this population, there was also available the commercial
data related to the monthly energy consumption, the activity
type, and the hired power.
All experiments that will be described in this section were
conducted using Clementine version 8.5 [Clementine Data
Mining System, web page - http://www.spss.com]. This is an
integrated DM toolkit, which uses a visual-programming
interface, and supports all KDD stages.
1. Data Preprocessing
A previous data-cleaning phase is essential to detect and
correct bad data (noise suppression), as well as a data-
treatment phase to derive data according to DM algorithms
that will be used [6].
In this first data-cleaning phase, were detected some
CLASSIFICATION MODEL
- C5.0 Classification Algorithm
REPRESENTATIVE
LOAD
DIAGRAMS
GENERATION OF RULES
DECISION TREE
LOAD SHAPE INDEXES
(Each representative load curve is
represented by a set of load shape
indexes)
> @
6 5 4 3 2 1
, , , , , f f f f f f f
- INPUT ATTRIBUTES: VECTOR {f}
- TEST SET
- TRAINING SET
- TEN-FOLD CROSS VALIDATION
- EVALUATION ACCURACY
- ANALYSIS OF THE CONFUSION
MATRIX
5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
3 6 9 12 15 clusters
MA
CD
damaged files and some customers without registered values,
which were removed from the initial data sample. It was also
detected missing values of measures. These failures can be
due to transmission interruptions or damage in the measuring
equipment. A multi layer perceptron (MLP) artificial neural
net was used to estimate the lacking power values, detected in
data preprocessing. The neural net was trained and the
missing values of measure estimated, supported on the data
from similar days of each customer's consumption [10].
In table II and III it is possible to analyze the distribution
of the sample population according to the hired power and the
activity type.
TABLE II
DESCRIPTION OF THE CONSUMERS DATA SET - HIRED POWER
Contracted Power
(kW)
until
250
251 to
500
501 to
1000
1001 to
1500
Then
1500
Consumers
Distribution (%)
52,4 18,3 13,5 7,7 8,1
TABLE III
DESCRIPTION OF THE CONSUMERS DATA SET - ACTIVITY TYPE
Activity Type
Consumers
Distribution (%)
Activity Type
Consumers
Distribution (%)
20 2,4 200 0,5
30 6,3 210 0,5
40 1,9 220 4,8
50 0,5 230 1,0
60 9,6 240 1,0
70 4,8 270 12,0
90 4,8 280 4,3
110 0,5 290 1,0
120 1,0 310 0,5
130 0,5 330 1,0
140 3,8 340 2,4
160 1,0 350 21,5
170 0,5 360 4,3
190 1,9 370 5,7
With all data completed, a representative load curve was
obtained by averaging the daily load diagrams of each
customer. Therefore, each customer is now represented by
one typical load curve. However, these representative load
profile are concerning to the power consumption which means
that the diagram shape is directly proportional to the amount
of the electric energy. As it is intended to compare the
consumption pattern among customers, the power
consumption was normalized to the [0,1] range, using the
peak power of the each representative load diagram,
maintaining this way the information related to the initial load
profile shape. Each customer is now represented by a
normalized representative daily load curve.In order to reduce
the data volume the representative load curves of all
customers were separated in two loading conditions, working
days and weekends.
2.! C|ustering A|gorithms App|ication and Consumers
Characterization
In this stage it is intended to group the customers in classes
following a similarity criterion. It is expected to group the
load patterns on the basis of their distinguishing features. It
was used the representative daily load curve normalized,
illustrated in (1). The choice and selection of the clustering
algorithm is decisive. Thus, it was chosen and tested three
different algorithms:
Two-Step Cluster Analysis
K-means
Kohonen Net - Self Organizing Features Maps
For each clustering result, proceeding from each algorithm,
de clustering performance was compared by the indexes
described in (4) and (5). These indexes were also used in
order to choose the number of cluster.
Analyzing the Figure 3, the performance between Two-
Step cluster algorithm and K-means is very close. As the
Two-Step algorithm is indicated to handle with a large
number of obiects and, in the future, with the installation of
real time measuring equipment, there will be a huge amount
of data to treat. Thus, the Two-Step cluster algorithm was
chosen.
Two-Step
K-means
SOM
MA
CD
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
MA
CD
Fig. 3- Clustering performance comparison, for weekends (9 clusters)
Before the application of the clustering algorithms, it was
indispensable the definition of the number of classes. The
number of clusters is usually defined a priori and the expert`s
information (Distributions Company) is in this stage essential.
In [5] is described that the number of classes must belong to
the range [2, M ], where M is the number of consumers in
the data set. However, typically, the number of clusters must
be small enough to allow the definition of different tariff
structures to each class, but, on the other hand, the partition
precision will be proportional to the increase of the clusters
number.Thus, must be a compromise in the definition of the
number of clusters. The two distances, (4) and (5) were
computed to evaluate the quality of the partition. Analyzing
Figure 4, it can be conclude that the indexes decrease as the
number of clusters increases, and, for a number of clusters
higher than 9, the reduction gain it is not very significantly,
and based on information from the electricity utility, that the
number of cluster should belong to the [6,9] range, it was
chosen 9 numbers of clusters.
6
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
1 3 5 7 9 11 13 15 17 19 21 23
Time (h)
P
o
w
e
r
(
p
.
u
.
)
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
1 3 5 7 9 11 13 15 17 19 21 23
Time (h)
P
o
w
e
r
(
p
.
u
.
)
.
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
Fig. 4- Evolution of the indexes with the number of clusters
After the choice of the cluster algorithm and the number of
clusters that is intended to reach, the normalized
representative load curves were used in order to obtain the
clusters.
Thus, with the resulting clusters achieved performed by the
Two-Step algorithm, it was obtained the representative
diagram for each cluster for weekends and working days for a
period of one year, average ng the representative load
diagrams of the clients assigned to the same cluster.
Figure 5 shows the representative load diagram obtained
for each cluster, using the measurement power on working
days directly. Each curve represents the load profile of the
corresponding customer class.
Fig. 5 - Typical load profile for working days
The results showed that, for instance, the cluster-1 has 28
consumers with the same consumption pattern, while cluster-8
has 4 customers with the same consumption shape, which
represents 1,92% of the sample. As it can be seen in Figures 5
and 6, the clustering module has well separated the customer
population and representative load diagrams were created
with distinct load shape.
Fig. 6 - Typical load profile for weekends
In [11], a trial was performed to search for associations
between the clusters and the components of the contractual
data (commercial indices). The results have showed that a
poor correlation exists between the main clusters and the
contractual data. These results have proved that the
contractual data is highly ineffective from the viewpoint of
the characterization of the electrical behavior of the
costumers. As described in Table I, there was extracted
several shape indexes, that represent the shape load curve, in
order to use them in the classification model.
3.! Consumers C|assification
To obtain more relevant information to describe the
consumption patterns of each cluster population we have used
a rule-based modeling technique, the C5.0 classification
algorithm. In this first phase, we have chosen this algorithm
due it is easier to understand since the rules derived from the
model have a very straightforward interpretation. In the
future, it will be our aim to use other algorithms and compare
the results obtained with C5.0.
The classification model should allow the attribution of a
new consumer to a certain cluster, basing on the rules
generated by the classification model. So, the rules must be
intelligible. For that, normalizes shape indicators were used as
attributes in the classification model. The indexes vector is
formed by the indexes shown in the Table I.
An artificial neural net was also used for analyzing the
importance of each attribute (shape indexes). It was realized
that the "f
6
" index had a poor importance, therefore it was
removed from the classification.
Once again, the Two-Step algorithm was used in order to
find out the new 9 clusters for the working and weekend days,
but now using the normalized shape indicators. These data set
were separate and was formed a training and a test group. The
training group used by the classification model has 2/3 of the
data, and the remaining 1/3 of the data were used for test.
The following Table presents a rule set example obtained
from the C5.0 algorithm for, in this case, working days data
set. The obtained rules are simple and easy to understand.
TABLE IV
RULE SET FOR THE WORKING DAYS CLASSIFICATION MODEL
If
3
f 0,48 and
2
f 0,13 and
5
f 0,55 and
1
f 0,35 and
4
f 0,31
then cluster -8
If
3
f 0,48 and
2
f 0,13 and
5
f 0,55 and
1
f 0,35 and
4
f > 0,31
then cluster -9
If
3
f 0,48 and
2
f 0,13 and
5
f 0,55 and
1
f > 0,35 then cluster -5
If
3
f 0,48 and
2
f 0,13 and
5
f > 0,55 and
5
f 0,6 then cluster -7
If
3
f 0,48 and
2
f 0,13 and
5
f > 0,55 and
5
f > 0,67 and
2
f 0,06
then cluster -6
If
3
f 0,48 and
2
f 0,13 and
5
f > 0,55 and
5
f > 0,67 and
2
f > 0,06
then cluster -7
If
3
f 0,48 and
2
f > 0,13 and
4
f 0,24 then cluster -4
If
3
f 0,48 and
2
f > 0,13 and
4
f > 0,24 then cluster -5
7
If
3
f > 0,48 and
3
f 0,78 and
2
f 0,44 then cluster -3
If
3
f > 0,48 and
3
f 0,78 and
2
f > 0,44 then cluster -2
If
3
f > 0,48 and
3
f > 0,7 then cluster -1
The classification model used all the available attributes,
selecting for each rule merely the attributes that provided
larger information gain.
The model has been tested and it`s overall accuracy was
94,83% for working days and 95,45 for weekend days, which
shows that the results are satisfactory.
V. CONCLUSION AND FURTHER WORK
This paper presents a methodology for the characterization
and classification of electric medium voltage consumers,
based on the historical data. It was used and compared the
performance of three different clustering algorithms in order
to obtain the representative load diagrams of each costumer.
The Two-Step cluster algorithm it was chosen and the typical
load profile of each class was obtained, taking into account
the criterion of the number of clusters. These clusters were
performed using the C5.0 classification algorithm.
The classification results show that the commercial
parameters are poorly connected to the load profiles.
The clustering algorithm was able to produce load profiles
with distinctly different load shapes, and the classification
algorithm presents a good overall accuracy both working days
and weekends loading conditions. Normalized shaped indices
were used as attributes in the classification model which
generated a rule set. The shape indexes were extracted from
the representative load curve and express the shape of the
representative load diagrams. These rules are simple and easy
to understand.
By knowing the representative load diagram and following
the electrical behaviour of the consumers, it will be possible
to present new tariff structures to apply for each customer
class, according to their consumption pattern, and which must
be sufficiently flexible to follow the variations in the load
patterns of their customers.
The distribution companies, as well as the consumers, can
take advantages of the typical load profile knowledge and this
knowledge can improve the electric power supplier-consumers
settlements.
The development of new tariffs structures, in articulation
with electricity markets prices, it will be a potential tool for
the retail companies.
VI. ACKNOWLEDGMENT
The authors would like to express their gratitude to EDP
Distribuio, the Portuguese Distribution Company, for
supplying the data used in this work.
The authors would also like to acknowledge FCT, FEDER,
POCTI, POSI, POCI and POSC for their support to R&D
Proiects and GECAD Unit.
VII. REFERENCES
[1]! Srgio Ramos, Zita Vale, Jos Santana & Jorge Duarte, "Data Mining
Contributions to Characterize MV Consumers and to Improve the
Supp|iers-Consumers Sett|ements", PES GM 07 - IEEE Power
Engineering Society, Tampa, Florida, USA, 24-28 July, 2007.
[2]! Pitt, B. and D. Kirchen, Applications of Data Mining Techniques to Load
Profiling, in Proc. IEEE PICA, Santa Clara, CA, May, 1999.
[3]! Gellings, Clark W., Emerging Energy Customers of the Twenty-First
Century, CIGRE/IEEE Technical Session, IEEE Power Engineering
Review, October, 1998.
[4]! Chicco, G, Napoli, R., Postulache, P., Scutariu, M. And Toader C.,
Customer Characterization Options for Improving the Tariff Offer, IEEE
Transactions on Power Systems, Vol. 18, N1, February, pp. 381-387,
2003.
[5]! Figueiredo V., Rodrigues F., Vale Z. & Gouveia, B., An Electric Energy
Characterization Framework based on Data Mining Techniques. In the
IEEE Transactions on Power Systems, Vol. 20, N.2, pp. 596-602, May
2005.
[6]! Fayyad, U., G. Piatetsky-Shapiro, P.J. Smith, R. Uthurasamy, From Data
Mining to Knowledge Discovery: An Overview. In Advances in
Knowledge Discovery and Data Mining, pages 1-34. AAAI/MIT Press,
1996.
[7]! Frawley, W.J., G. Piatetsky-Shapiro, C. Matheus, Knowledge Discovery
in Databases: An Overview, Technical Report, 1995.
[8]! Srgio Ramos, Zita Vale, Ftima Rodrigues, Raul Pinheiro, & Judite
Ferreira "Decision Support System for Improving the Tariff Offer Based
on Patterns Extracted from MV Load Diagrams", ICKEDS06, in Proc. of
the International Conference on Knowledge Engineering and Decision
Support, pp 107-115, Lisbon, Portugal, May, 2006.
[9]! Witthen, I. & Frank, E. Data Mining - Practical Machine Learning Tools
and Techniques with Java Implementations, Morgan Kaufmann
Publishers, Academic Press, 2000.
[10]! Srgio Ramos, Zita Vale, Jos Santana & Ftima Rodrigues, "An
Approach to the Consumer-Supplier Relationship Supported by Data
Mining Techniques for MV Costumers ", WSEAS Transactions on Power
Systems, Issue 7, Volume 1, July, pp. 1350-1357, 2006.
[11]! Figueiredo V., Duarte F.J.,Rodrigues F., Vale Z., Ramos, C. Ramos, S.,
Gouveia B., 2003, Electric Energy Customer Characterization by
Clustering., Proceedings of ISAP 2003, Lemnos, Greece.
VIII. BIOGRAPHIES
Srgio Ramos graduated in the Polytechnic Institute of
Porto in 1999 and received his MSc degree from the
Instituto Superior Tcnico (Lisbon-Portugal) in 2006.
He is currently an Assistant Professor of Electric
Power Systems in the Polytechnic Institute of Porto. His
research interests include competitive electricity
markets, energy efficiency, load research and electrical
installations.
Zita Vale graduated in the University of Porto in 1986
and received her Ph.D degree in Electrical Engineering
from the same University in 1993.
8
She is currently a Coordinator Professor in the Polytechnic Institute of Porto.
Her research areas include Power Systems Operation and Control, Electricity
Markets, Decision Support and Artificial Intelligence.