Data Mining Techniques Application in Power Distribution Utilities

1
!"#$%&'$-- This paper presents an electricity medium voltage

(MV) consumer characterization framework supported on the
data base knowledge discovery process (KDD). Data Mining
(DM) techniques are used to discover a set of a MV consumer
typical load profile and, therefore, to extract knowledge
concerning to the electric energy consumption patterns. In order
to form the different customers classes a hierarchical clustering
algorithm is used. The framework includes several steps, starting
from the pre-processing data, application of DM algorithms,
classification model, and finally, the interpretation of the
discovered knowledge. To validate the proposed framework, a
case study which includes real databases of MV consumers is
used.

()*+,- .+%/#- -- Classification, clustering, data mining, load
profiles, electricity markets.
I. INTRODUCTION
n competitive electrical energy markets, it is expected for
suppliers to know, as well as possible, the electrical
consumption behaviour of their customers, in order to offer
them suitable electric energy services at the least cost.
The knowledge about customer's consumption patterns is
extremely important for the accomplishment of agreements in
the price of the electricity between consumers and suppliers,
the definition of marketing policies and innovative contracts
and services. The suppliers companies cluster consumers in
representative classes and use the representative load profile
to study consumer`s behaviour.
The knowledge about costumer`s behaviour can be a useful
decision tool, not only for the retail companies, but also for
consumers indeed. The knowledge resulting from the load
profile study can be used by the retail companies to identify
the aspects that cause the increase of the diagrams` peaks,
development of specific consumer`s contracts (allowing the
improvement of the net`s use) and, finally, allow the
optimization of the offers of electric power purchase. As the
electric energy consumer`s are concerning, the knowledge
obtained can support the choice of the electricity supplier with
best tariff structure proposal, it can also allow some
modulation of their electric consumption habits, and, finally,
they will be able to celebrate contracts of electric energy
interruption [1].

For the suppliers, the knowledge of how and when their
customers use the electricity it will be crucial for leadership in
electricity market environment.
Starting from the real customers` electrical behaviour (real
data bases), the definition of customer`s classes can be
conveniently extracted. In [2] data mining techniques are used
on the determination of load profiles taking into consideration
the effect of weather conditions.
A load profile can be defined as a pattern of electricity
demand for a consumer, or group of consumers, over a given
period of time.
For these kinds of studies, the quality of the data bases is
essential to obtain good results, as well as, other additional
information that can influence the electric energy
consumption, such as type of activity, hired power value,
consumed energy, weather conditions and tariff type [3].
In the last years, significant researches effort has been
devoted clustering techniques in order to obtain daily load
profiling [4], [5].
Though, nowadays for the MV customers and, in a near
future for low voltage (LV), there will be collected and
recorded a large amount of data available to study.
Consequently, this increase of data brings a new challenge to
the consumer`s characterization. The new methodologies and
the new algorithms should handle huge amounts of data bases.
With the increase of the data bases volume, the preprocessing
data (data-cleaning operation) requires an important attention.
The obiective of this paper is to present a framework to
characterize MV consumers to support the retail and the
distribution companies, as well as the consumers, on the
achieving of knowledge starting from the electricity
consumption data bases.
In order to obtain the typical load profile, hierarchical
clustering algorithms were used.
The remaining of this paper is organized as follows: In
section 2, we briefly summarize the main steps of the medium
voltage consumer`s characterization, based on a knowledge
Discovery in Databases. In section 3, data mining techniques
were presented in order to characterize the typical load
profile. In the following section a case study is presented, and
the typical load profile achieved by using clustering
algorithms. Finally, in the last section some conclusions and
possible future work directions are presented.
Data Mining techniques application in Power
Distribution utilities
Srgio Ramos, Zita Vale
I
978-1-4244-1904-3/08/$25.00 2008 IEEE
2

DATA BASE
DATA PR-
PROCESSING
- Data cleaning
- Filling up missing values
- Estimation of values
FORMATTED DATA

- Reduction data Volume
- Normalization
CLUSTERING
ALGORITHMS

- Two-Step
- K-Means
- SOM
CLASSIFICATION
MODEL

- C5.0 Class. Algorit.
- Shape indices
TYPICAL
LOAD
PROFILE
NEW
TARIFF
STRUCTURES
- RULE SETS
- DECISION TREE
- OVERALL
ACCURACY
DATA MINING TECHNIQUES
RELATIONSHIP
CONSUMER/
ELECTRICITY
SUPLLIER

CLUSTERS
II. MV COSTUMERS CHARACTERIZATION
Figure 1 show the study process structure based on a
Knowledge Discovery in Databases (KDD) [5-7]. The
framework includes several phases which include the
application of data mining (DM) techniques.

Fig. 1- Study Process Structure framework of MV consumer`s characterization

The framework is fragmented in different steps with
different degrees of complexity.
1.! Data and features Se|ection.
This phase includes the creation and definition of a data
sample which will be applied to all KDD process. It is in this
step that are defined which customers will be chosen for the
monitoring campaigns (MV, LV), which parameters will be
colleted (kW, kVA), what period of time will be chosen
(month, season) and the definition of the data recorded
cadence (15, 30 minutes, etc.).

2.! Data preprocessing.
There are always problems with data which is why a
previous data-cleaning phase is require in order to detect and
correct bad data, as well as a data-treatment phase to derive
data according to DM algorithms that will be used. The load
curves of each consumer are examined in order to verify is
there are missing values of measure. The missing values of
measures can be filled using a neural net [8]. However, the
historical data can serve as support to estimate the lacking
power values. The costumers without registered values must
be removed from the initial data base for not depreciate the
clustering operation.
3.! Formatted Data.
Each customer is represented by its daily load diagram.
The number of the daily load curve is directly proportional to
the number of colleting days. In order to reduce the data
volume, and therefore increase the reckoning capacity, a
reduction data volume treat can be made, and each customer
can be represented by iust one single representative load
curve. The representative load curve is built by averaging the
daily load diagrams related to each customer. Each customer
is now defined for a representative load curve for each of the
loading conditions. The knowledge about the loading
conditions, such as the season of the year, the type of day
(working days or weekend days, day of national holyday),
must be previously defined, in order that the data can be
separated in smaller sets. According to the type of the
customer, the loading conditions must be chosen in order that
can reflect the way that those loading condition can influence
the electricity consumption. Usually, the load diagrams use
directly the field-measurements (electric power). In MV is
verified a vast range of the power value, and therefore, in
order to contained the customers in classes, maintaining the
consumption pattern, it is necessary a data normalization. To
maintaining the shape of the diagrams and compare the
diagrams among them, in terms of consumption pattern, each
diagram can be normalized to the [0,1] range using the peak
power of the each representative load diagram.
4.! Data Mining Techniques.
In this step a DM algorithm is chosen and applied to
discover patterns and interesting relations in the data, such as
regression, classification, neural nets, clustering, etc. The
definition of the models and parameters to use is, at times,
difficult to choose, considering that this should be consistent
with the obiectives of the KDD process. On the occasion of
the data exploration and the discovery of patterns is, usually,
made a representation such as classification rule sets, decision
trees, association rules, regression, etc.
5.! Extracted Know|edge.
Once the model are obtained and proceeding to its
validation, eventually with experts support, the knowledge
and the resulting conclusions extracted can be valid and
sufficiently satisfactory. This phase includes the interpretation
and assessment of the extracted data relations so as to
transform them in knowledge. It is also necessary to integrate
the obtained knowledge with the knowledge already known.
3
III. DATA MINING TECHNIQUES APPLICATION
With recourse of data mining techniques application it is
intended to characterize the typical load profile, starting from
an initial data set. The DM process involves the utilization of
algorithms in order to discover patterns among the data
following a similarity criterion. The recognition operations of
patterns are based on the combination of unsupervised and
supervised learning techniques (respectively, in clustering and
classification tasks).
After the data preprocessing operation, with all data
completed, its volume reduced and normalized, each customer
is represented by its representative load curve (1).

> @ ^ ` ^ ` H h M m | | |
m
hnorm
m
norm
m
norm
... 1 , ... 1 , ,...
) ( ) (
1
) (
(1)

Where (
) (m
norm
| ) represents the normalized vector of the
daily load curve of each customer, (m) represents the
consumer number in analysis, M represents the number of
consumers of the sample and H=96, represents the 15 minute
intervals in a day.
Until now, the distribution companies classified its
customers by the commercial indexes, such as consumer`s
activity type, hired power, tariff option and supply voltage
level. Thus, the correlation between these indexes and the
typical load profile is extremely poor and inexistent. The
framework of Fig.1 shows two main models according to the
techniques used and the obtained results. In the first model, it
was used clustering algorithms (unsupervised operation) for
the set of customer`s load curves. After the choice of the
clustering algorithm, and it calibration, the clusters of
consumers and the respective representative cluster load
profile curve is formed. The different classes shape represents
the different consumption pattern that was detected among the
data. In the second model was built a classification model
(supervised operation), using a decision tree, in order that
when applied to new unclassified record, will allow to foresee
the class to which belongs. Thus, in the future, it will allow
attributing to each new consumer the consumption profile that
best represents it.

A.! Customers C|ustering.
The main goal of the load profiling is to group the data set
in classes in such a way that the obiects of a cluster should be
a high similarity among them, and a low or a very different
similarity among obiects of others classes. When trying to
discover knowledge from data bases, one of the first arising
tasks is to identify groups of similar obiects to carry out
cluster analysis for obtaining data partitions. There are several
clustering methods that can be used for cluster analysis, yet
for a given data set, each clustering method may identify
groups whose member obiects are different. Thus, a decision
must be taken for choosing the clustering method that
produces the best data partition for a given data collection. In
order to support such decision, we have used indices for
measuring the quality of the data partition. Different
clustering algorithms were performed, and to evaluate the
quality of the partition two measures of adequacy were tested
to evaluate the evolution of the indexes, the Mean Index
Adequacy (MIA) and Clustering Dispersion Indicator (CDI),
according to described in [4].
The distances (2) and (3) are defined to assist the
formulation of the adequacy measure:
a)! Distance between two load diagrams

u
H
h
h |j h |i
H
|j |i d
1
2
)) ( ) ( (
1
) , ( (2)
b)!Distance between a representative load diagram and
the center of a set of diagrams
) , (
1
) , (
) ( ) (
1
2
) (
) ( ) (
) (
m k
n
m
k
k k
| r d
n
L r d
k
(3)
The MIA [4] depends on the average of the mean distances
between each pattern assigned to the cluster and its center.
) , (
1
1
) ( ) ( 2
K
k
k k
C r d
K
MIA
(4)
The CDI [4] depends on the distance between the load
diagrams in the same cluster and (inversely) on the distance
between the class representative load diagrams.
In (8) R is the set of the class representative load diagrams.

K
k
k
K
k
n
n
k m
k
R r d
K
C | d
n K
CDI
k
1
) ( 2
1 1
) ( ) ( 2
) (
) , (
2
1
) , (
. 2
1 1
) (
(5)

In our case study, three different clustering algorithms
were tested, the Two-Step algorithm, K-Means and Self
Organizing Maps (SOM), and the algorithm that produces the
smaller MIA and CDI values prevails over the others in term
of performance of partition. Indeed, the smaller value of MIA
indicates more compact clusters.
B.! C|assification Mode|.
In classification problems a set of pre-classified data points
are given and the classification algorithm tries to discover a
rule, which allows mimicking as closely as possible the
observed classification. A classification problem is a
supervised learning task where the output information is a
discrete classification. Broadly speaking, the classification
task consists in built a classification model that it can be
applied to unclassified records, seeking to classify them in
classes. In others words, consist in examine the obiect
features and attribute it to one of the predefined classes
(supervised learning).
To obtain more relevant information to describe the
consumption patterns of each cluster population we have used
4
a rule-based modeling technique, the C5.0 classification
algorithm. This algorithm was chosen due it is easier to
understand since the rules derived from the model have a very
straightforward interpretation.
The classification model should allow the attribution of a
new consumer to a certain cluster, basing on the rules
generated by the classification model. So, the rules must be
intelligible. For that, normalizes shape indicators were used as
attributes in the classification model. As mentioned in section
III, the commercial indexes have no relation with the load
curves, so if used isolated they can not provide a good
consumer classification. Thus, there was need to define
indexes in order to obtain sense rules of satisfactory
interpretation, and therefore, to express relevant information
about the electricity consumer behaviour.
These indexes are derived from the daily load diagrams
and some of them are based on the set of indexes proposed in
[1], [4]. They give information about the daily load curve
shape and about the consumption pattern of each consumer.
These indexes will be used as attributes in the classification
process. In Table 1 is presented a set of indices that there
were used, where Pmax is the maximum power demand, Pmin
is the minimum power demand and Pav is the average power
demand during a representative day.

TABLE I
NORMALIZED SHAPE INDIXES FOR CHARACTERIZARISING THE LOAD PROFILES
Parameter Definition Acquisition Period

Daily P
av
/P
max
day
day av
P
P
f
max,
,
1

1 day

Daily P
min
/P
max
day
day
P
P
f
max,
min,
2

1 day

Daily P
min
/P
av
day av
day
P
P
f
,
min,
3

1 day

Night mpact
day av
night av
P
P
f
,
,
4
3
1

1 day (8 hours night,
from 11 p.m. to 6 a.m.)

Lunch mpact
day av
|unch av
P
P
f
,
,
5
8
1

1 day (2 hours lunch,
from 12 a.m. to 2 p.m.,
16 hours daytime from 6
a.m. to 23 p.m.)

Daily
P
av
/P
inst

inst
day av
P
P
f
,
6

1 day

These indexes were extracted directly from the
representative load curve and they represent the load diagram
shape.
The model evaluation is performed using ten-fold cross
validation presented in [9]. As described in [5], the evaluation
is performed by randomly splitting the initial sample in 10
sub-samples. The model is trained using 9/10 of the data set
and tested with the 1/10 left. This process is performed ten
times on different training sets and, finally, the ten error
estimates are averaged to yield an overall error estimate.
The classification model can use all available inputs
attributes` choosing for each rule the most relevant. In Figure
2, the classification model structure is represented.

Fig. 2- Classification model structure
IV. CASE STUDY
A case study was applied on the data concerning 229
medium voltage customers of the Portuguese distribution
company. This sample was collected in a period of 3 months
in Summer and 3 months in Winter for working days and
weekends, and the consumed power was recorded with a
cadence of 15 minutes.
In the first step, the typical daily load curve of each
customer was determined. Through data preprocessing,
twenty-one customers were discarded from the initial data,
remaining 208 consumers to be analyzed.
During this stage, it also had need in filling out missing
values of measure. Thus, to estimate those values it was used
extrapolation, in the case when the missing values were under
1 hour, and beyond that, it was used a multi layer perceptron
(MLP) artificial neural net in order to filling out the missing
values and to include those customers in the study.
For this population, there was also available the commercial
data related to the monthly energy consumption, the activity
type, and the hired power.
All experiments that will be described in this section were
conducted using Clementine version 8.5 [Clementine Data
Mining System, web page - http://www.spss.com]. This is an
integrated DM toolkit, which uses a visual-programming
interface, and supports all KDD stages.
1. Data Preprocessing
A previous data-cleaning phase is essential to detect and
correct bad data (noise suppression), as well as a data-
treatment phase to derive data according to DM algorithms
that will be used [6].
In this first data-cleaning phase, were detected some
CLASSIFICATION MODEL

- C5.0 Classification Algorithm

REPRESENTATIVE

LOAD

DIAGRAMS

GENERATION OF RULES

DECISION TREE
LOAD SHAPE INDEXES
(Each representative load curve is
represented by a set of load shape
indexes)
> @
6 5 4 3 2 1
, , , , , f f f f f f f
- INPUT ATTRIBUTES: VECTOR {f}
- TEST SET
- TRAINING SET
- TEN-FOLD CROSS VALIDATION
- EVALUATION ACCURACY

- ANALYSIS OF THE CONFUSION
MATRIX
5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
3 6 9 12 15 clusters
MA
CD
damaged files and some customers without registered values,
which were removed from the initial data sample. It was also
detected missing values of measures. These failures can be
due to transmission interruptions or damage in the measuring
equipment. A multi layer perceptron (MLP) artificial neural
net was used to estimate the lacking power values, detected in
data preprocessing. The neural net was trained and the
missing values of measure estimated, supported on the data
from similar days of each customer's consumption [10].
In table II and III it is possible to analyze the distribution
of the sample population according to the hired power and the
activity type.
TABLE II
DESCRIPTION OF THE CONSUMERS DATA SET - HIRED POWER
Contracted Power
(kW)
until
250
251 to
500
501 to
1000
1001 to
1500
Then
1500
Consumers
Distribution (%)
52,4 18,3 13,5 7,7 8,1

TABLE III
DESCRIPTION OF THE CONSUMERS DATA SET - ACTIVITY TYPE
Activity Type
Consumers
Distribution (%)

Activity Type
Consumers
Distribution (%)
20 2,4 200 0,5
30 6,3 210 0,5
40 1,9 220 4,8
50 0,5 230 1,0
60 9,6 240 1,0
70 4,8 270 12,0
90 4,8 280 4,3
110 0,5 290 1,0
120 1,0 310 0,5
130 0,5 330 1,0
140 3,8 340 2,4
160 1,0 350 21,5
170 0,5 360 4,3
190 1,9 370 5,7

With all data completed, a representative load curve was
obtained by averaging the daily load diagrams of each
customer. Therefore, each customer is now represented by
one typical load curve. However, these representative load
profile are concerning to the power consumption which means
that the diagram shape is directly proportional to the amount
of the electric energy. As it is intended to compare the
consumption pattern among customers, the power
consumption was normalized to the [0,1] range, using the
peak power of the each representative load diagram,
maintaining this way the information related to the initial load
profile shape. Each customer is now represented by a
normalized representative daily load curve.In order to reduce
the data volume the representative load curves of all
customers were separated in two loading conditions, working
days and weekends.
2.! C|ustering A|gorithms App|ication and Consumers
Characterization
In this stage it is intended to group the customers in classes
following a similarity criterion. It is expected to group the
load patterns on the basis of their distinguishing features. It
was used the representative daily load curve normalized,
illustrated in (1). The choice and selection of the clustering
algorithm is decisive. Thus, it was chosen and tested three
different algorithms:
Two-Step Cluster Analysis
K-means
Kohonen Net - Self Organizing Features Maps
For each clustering result, proceeding from each algorithm,
de clustering performance was compared by the indexes
described in (4) and (5). These indexes were also used in
order to choose the number of cluster.
Analyzing the Figure 3, the performance between Two-
Step cluster algorithm and K-means is very close. As the
Two-Step algorithm is indicated to handle with a large
number of obiects and, in the future, with the installation of
real time measuring equipment, there will be a huge amount
of data to treat. Thus, the Two-Step cluster algorithm was
chosen.
Two-Step
K-means
SOM
MA
CD
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
MA
CD

Fig. 3- Clustering performance comparison, for weekends (9 clusters)

Before the application of the clustering algorithms, it was
indispensable the definition of the number of classes. The
number of clusters is usually defined a priori and the expert`s
information (Distributions Company) is in this stage essential.
In [5] is described that the number of classes must belong to
the range [2, M ], where M is the number of consumers in
the data set. However, typically, the number of clusters must
be small enough to allow the definition of different tariff
structures to each class, but, on the other hand, the partition
precision will be proportional to the increase of the clusters
number.Thus, must be a compromise in the definition of the
number of clusters. The two distances, (4) and (5) were
computed to evaluate the quality of the partition. Analyzing
Figure 4, it can be conclude that the indexes decrease as the
number of clusters increases, and, for a number of clusters
higher than 9, the reduction gain it is not very significantly,
and based on information from the electricity utility, that the
number of cluster should belong to the [6,9] range, it was
chosen 9 numbers of clusters.

6
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
1 3 5 7 9 11 13 15 17 19 21 23
Time (h)
P
o
w
e
r

(
p
.
u
.
)
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
1 3 5 7 9 11 13 15 17 19 21 23
Time (h)
P
o
w
e
r
(
p
.
u
.
)
.
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9

Fig. 4- Evolution of the indexes with the number of clusters
After the choice of the cluster algorithm and the number of
clusters that is intended to reach, the normalized
representative load curves were used in order to obtain the
clusters.
Thus, with the resulting clusters achieved performed by the
Two-Step algorithm, it was obtained the representative
diagram for each cluster for weekends and working days for a
period of one year, average ng the representative load
diagrams of the clients assigned to the same cluster.
Figure 5 shows the representative load diagram obtained
for each cluster, using the measurement power on working
days directly. Each curve represents the load profile of the
corresponding customer class.
Fig. 5 - Typical load profile for working days
The results showed that, for instance, the cluster-1 has 28
consumers with the same consumption pattern, while cluster-8
has 4 customers with the same consumption shape, which
represents 1,92% of the sample. As it can be seen in Figures 5
and 6, the clustering module has well separated the customer
population and representative load diagrams were created
with distinct load shape.

Fig. 6 - Typical load profile for weekends
In [11], a trial was performed to search for associations
between the clusters and the components of the contractual
data (commercial indices). The results have showed that a
poor correlation exists between the main clusters and the
contractual data. These results have proved that the
contractual data is highly ineffective from the viewpoint of
the characterization of the electrical behavior of the
costumers. As described in Table I, there was extracted
several shape indexes, that represent the shape load curve, in
order to use them in the classification model.

3.! Consumers C|assification
To obtain more relevant information to describe the
consumption patterns of each cluster population we have used
a rule-based modeling technique, the C5.0 classification
algorithm. In this first phase, we have chosen this algorithm
due it is easier to understand since the rules derived from the
model have a very straightforward interpretation. In the
future, it will be our aim to use other algorithms and compare
the results obtained with C5.0.
The classification model should allow the attribution of a
new consumer to a certain cluster, basing on the rules
generated by the classification model. So, the rules must be
intelligible. For that, normalizes shape indicators were used as
attributes in the classification model. The indexes vector is
formed by the indexes shown in the Table I.
An artificial neural net was also used for analyzing the
importance of each attribute (shape indexes). It was realized
that the "f
6
" index had a poor importance, therefore it was
removed from the classification.
Once again, the Two-Step algorithm was used in order to
find out the new 9 clusters for the working and weekend days,
but now using the normalized shape indicators. These data set
were separate and was formed a training and a test group. The
training group used by the classification model has 2/3 of the
data, and the remaining 1/3 of the data were used for test.
The following Table presents a rule set example obtained
from the C5.0 algorithm for, in this case, working days data
set. The obtained rules are simple and easy to understand.

TABLE IV
RULE SET FOR THE WORKING DAYS CLASSIFICATION MODEL
If
3
f 0,48 and
2
f 0,13 and
5
f 0,55 and
1
f 0,35 and
4
f 0,31
then cluster -8
If
3
f 0,48 and
2
f 0,13 and
5
f 0,55 and
1
f 0,35 and
4
f > 0,31
then cluster -9
If
3
f 0,48 and
2
f 0,13 and
5
f 0,55 and
1
f > 0,35 then cluster -5
If
3
f 0,48 and
2
f 0,13 and
5
f > 0,55 and
5
f 0,6 then cluster -7
If
3
f 0,48 and
2
f 0,13 and
5
f > 0,55 and
5
f > 0,67 and
2
f 0,06
then cluster -6
If
3
f 0,48 and
2
f 0,13 and
5
f > 0,55 and
5
f > 0,67 and
2
f > 0,06
then cluster -7
If
3
f 0,48 and
2
f > 0,13 and
4
If
3
f 0,48 and
2
f > 0,13 and
4
7
If
3
f > 0,48 and
3
f 0,78 and
2
If
3
f > 0,48 and
3
f 0,78 and
2
If
3
f > 0,48 and
3

The classification model used all the available attributes,
selecting for each rule merely the attributes that provided
larger information gain.
The model has been tested and it`s overall accuracy was
94,83% for working days and 95,45 for weekend days, which
shows that the results are satisfactory.

V. CONCLUSION AND FURTHER WORK
This paper presents a methodology for the characterization
and classification of electric medium voltage consumers,
based on the historical data. It was used and compared the
performance of three different clustering algorithms in order
to obtain the representative load diagrams of each costumer.
The Two-Step cluster algorithm it was chosen and the typical
load profile of each class was obtained, taking into account
the criterion of the number of clusters. These clusters were
performed using the C5.0 classification algorithm.
The classification results show that the commercial
parameters are poorly connected to the load profiles.
The clustering algorithm was able to produce load profiles
with distinctly different load shapes, and the classification
algorithm presents a good overall accuracy both working days
and weekends loading conditions. Normalized shaped indices
were used as attributes in the classification model which
generated a rule set. The shape indexes were extracted from
the representative load curve and express the shape of the
representative load diagrams. These rules are simple and easy
to understand.
By knowing the representative load diagram and following
the electrical behaviour of the consumers, it will be possible
to present new tariff structures to apply for each customer
class, according to their consumption pattern, and which must
be sufficiently flexible to follow the variations in the load
patterns of their customers.
The distribution companies, as well as the consumers, can
take advantages of the typical load profile knowledge and this
knowledge can improve the electric power supplier-consumers
settlements.
The development of new tariffs structures, in articulation
with electricity markets prices, it will be a potential tool for
the retail companies.
VI. ACKNOWLEDGMENT
The authors would like to express their gratitude to EDP
Distribuio, the Portuguese Distribution Company, for
supplying the data used in this work.
The authors would also like to acknowledge FCT, FEDER,
POCTI, POSI, POCI and POSC for their support to R&D
Proiects and GECAD Unit.
VII. REFERENCES

[1]! Srgio Ramos, Zita Vale, Jos Santana & Jorge Duarte, "Data Mining
Contributions to Characterize MV Consumers and to Improve the
Supp|iers-Consumers Sett|ements", PES GM 07 - IEEE Power
Engineering Society, Tampa, Florida, USA, 24-28 July, 2007.

[2]! Pitt, B. and D. Kirchen, Applications of Data Mining Techniques to Load
Profiling, in Proc. IEEE PICA, Santa Clara, CA, May, 1999.

[3]! Gellings, Clark W., Emerging Energy Customers of the Twenty-First
Century, CIGRE/IEEE Technical Session, IEEE Power Engineering
Review, October, 1998.

[4]! Chicco, G, Napoli, R., Postulache, P., Scutariu, M. And Toader C.,
Customer Characterization Options for Improving the Tariff Offer, IEEE
Transactions on Power Systems, Vol. 18, N1, February, pp. 381-387,
2003.

[5]! Figueiredo V., Rodrigues F., Vale Z. & Gouveia, B., An Electric Energy
Characterization Framework based on Data Mining Techniques. In the
IEEE Transactions on Power Systems, Vol. 20, N.2, pp. 596-602, May
2005.

[6]! Fayyad, U., G. Piatetsky-Shapiro, P.J. Smith, R. Uthurasamy, From Data
Mining to Knowledge Discovery: An Overview. In Advances in
Knowledge Discovery and Data Mining, pages 1-34. AAAI/MIT Press,
1996.

[7]! Frawley, W.J., G. Piatetsky-Shapiro, C. Matheus, Knowledge Discovery
in Databases: An Overview, Technical Report, 1995.

[8]! Srgio Ramos, Zita Vale, Ftima Rodrigues, Raul Pinheiro, & Judite
Ferreira "Decision Support System for Improving the Tariff Offer Based
on Patterns Extracted from MV Load Diagrams", ICKEDS06, in Proc. of
the International Conference on Knowledge Engineering and Decision
Support, pp 107-115, Lisbon, Portugal, May, 2006.

[9]! Witthen, I. & Frank, E. Data Mining - Practical Machine Learning Tools
and Techniques with Java Implementations, Morgan Kaufmann
Publishers, Academic Press, 2000.

[10]! Srgio Ramos, Zita Vale, Jos Santana & Ftima Rodrigues, "An
Approach to the Consumer-Supplier Relationship Supported by Data
Mining Techniques for MV Costumers ", WSEAS Transactions on Power
Systems, Issue 7, Volume 1, July, pp. 1350-1357, 2006.

[11]! Figueiredo V., Duarte F.J.,Rodrigues F., Vale Z., Ramos, C. Ramos, S.,
Gouveia B., 2003, Electric Energy Customer Characterization by
Clustering., Proceedings of ISAP 2003, Lemnos, Greece.

VIII. BIOGRAPHIES
Srgio Ramos graduated in the Polytechnic Institute of
Porto in 1999 and received his MSc degree from the
Instituto Superior Tcnico (Lisbon-Portugal) in 2006.
He is currently an Assistant Professor of Electric
Power Systems in the Polytechnic Institute of Porto. His
research interests include competitive electricity
markets, energy efficiency, load research and electrical
installations.

Zita Vale graduated in the University of Porto in 1986
and received her Ph.D degree in Electrical Engineering
from the same University in 1993.
8
She is currently a Coordinator Professor in the Polytechnic Institute of Porto.
Her research areas include Power Systems Operation and Control, Electricity
Markets, Decision Support and Artificial Intelligence.

Data Mining Techniques Application in Power Distribution Utilities

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Techniques Application in Power Distribution Utilities

Uploaded by

Copyright:

Available Formats

1

!"#$%&'$-- This paper presents an electricity medium voltage

You might also like