Professional Documents
Culture Documents
AbstractThe current trend in the industry is to analyze different variables without losing much of information
large data sets and apply data mining, machine learning [1].
techniques to identify a pattern. But the challenges with Artificial Neural Network (ANN) is an information
huge data sets are the high dimensions associated with it. processing computational model based on biological
Sometimes in data analytics applications, large amounts neural networks and is composed of several
of data produce worse performance. Also, most of the interconnected processing elements (neurons) that work
data mining algorithms are implemented column wise and in parallel to solve a generic problem. In general ANN are
too many columns restrict the performance and make it used to identify complex relations between input and
slower. Therefore, dimensionality reduction is an output or to find patterns in a given data set. ANN is an
important step in data analysis. Dimensionality reduction adaptive system that can change its structure based on the
is a technique that converts high dimensional data into flow of information through the network during the
much lower dimension, such that maximum variance is training phase.
explained within the first few dimensions. The two important learning paradigms in ANNs are
This paper focuses on multivariate statistical and supervised learning and unsupervised learning. In
artificial neural networks techniques for data reduction. supervised learning, there exists a training data that helps
Each method has a different rationale to preserve the in the construction of the model by specifying classes
relationship between input parameters during analysis. and by providing positive and negative objects that
Principal Component Analysis which is a multivariate belong to those classes. In unsupervised learning, there is
technique and Self Organising Map a neural network no preexisting taxonomy and the algorithm classifies the
technique is presented in this paper. Also, a hierarchical output into different classes. Kohonens Self Organising
clustering approach has been applied to the reduced data Map is an unsupervised learning technique that reduces a
set. A case study of Air quality measurement has been high dimensional data into a 2 dimensional space. This
considered to evaluate the performance of the proposed reduction in dimensionality helps in understanding the
techniques. relationship quickly and also SOM provides better
Keywords Air Quality Dimensionality reduction, visualization of components [2].
Hierarchical Clustering,Principal Component Analysis, The rest of the paper is organized as follows. Section 2
Self Organising Maps. discusses related work and their findings. Section 3
briefly explains the existing Principal Component
I. INTRODUCTION Analysis and Self Organising Map techniques used in our
Multivariate Statistical Analysis is useful in the case work. Section 4 consists of a case study on air quality to
where data is of high dimension. Since human vision is evaluate the performance of the proposed techniques,
limited to 3 dimensions, all application above 2 or 3 along with expected results. Finally the concluding points
dimension is an ideal case for data to be analyzed through are given in section 5.
Multivariate Statistical Analysis (MVA). This analysis
provides joint analysis and easy visualization of the II. RELATED WORK
relationship between involved variables. As a result, Some of the related work pertaining to Principal
knowledge that was unrevealed and hidden among vast Component Analysis and Self Organising Maps are
amounts of data can be obtained. PCA is a powerful discussed in this section. In [3] multivariate analysis used
multivariate technique in reducing the dimension of data to analyze and interpret data from a large chemical
set and revealing the hidden relationship among the process. PCA was used to identify correct correlations
between the variables to reduce the dimensionality of
where V is the current input vector The important task in PCA is deciding the number of
W is the nodes weight vector principal components to be considered for further
d) The radius of neighborhood of the BMU is analysis. Fig. 1. Indicates a scree plot for determining the
calculated. All nodes within this radius range number of PCs. The elbow point shows the number of
will be updated in the next iterations. Initially, components to be considered and beyond this the eigen
values are small and indicates negligible data loss.
REFERENCES
[1] Johnson, Richard Arnold and Wichern, Dean W and
others. Applied multivariate statistical analysis
s.l. : Prentice hall Englewood Cliffs, NJ, 1992. Vol.
4.
[2] Gurney, Kevin, An introduction to neural
networks, CRC press, 1997.
[3] Kosanovich, KA and Piovoso, MJ , Process data
analysis using multivariate statistical methods ,
IEEE, pp. 721724, 1991
[4] Barreto, Alexandre S., Multivariate statistical
analysis for dermatological disease diagnosis,
Biomedical and Health Informatics (BHI), IEEE-
EMBS International Conference IEEE, 2014, pp.
500-504.
[5] Harrou; Fouzi and Nounou; Mohamed Numan and
Nounou;Hazem Numan , Detecting abnormal
Fig. 6. Blue nodes indicate High Ozone, Orange colour ozone levels using PCA-based GLR hypothesis
nodes indicate Medium Ozone and Green color nodes testing, s.l. : IEEE, 2013, pp. 95-102.
indicate Low Ozone. [6] Raychaudhuri, Soumya and Stuart, Joshua M and
Altman, Russ B, Principal components analysis to
Fig. 6. shows the formation of clusters into low ozone, summarize microarray experiments: application to
medium and high ozone that is depicted by green, orange sporulation time series, Pacific Symposium on
and blue respectively. Biocomputing. Pacific Symposium on
In High Ozone class, the effect of solar radiation and Biocomputing, NIH Public Access, 2000, pp. 455.
temperature are above average and wind speed has very [7] Annas, Suwardi and Kanai, Takenori and Koyama,
low effect. The effect of solar radiation and temperature is Shuhei, Principal component analysis and self-
below average in Low Ozone class and wind speed has organizing map for visualizing and classifying fire
significant effect. In Medium Ozone cluster, solar risks in forest regions, Agricultural Information
radiation, temperature and wind have about average Research Journal, 2007,pp. 44-51.
effect. The clustering results obtained are satisfactory and [8] Astel, A., Tsakovski, S., Barbieri, P., & Simeonov,
Self Organising Maps can be used to classify the quality V. , Comparison of self-organizing maps
of air. classification approach with cluster and principal
components analysis for large environmental data
V. CONCLUSION sets Water Research,41(19), 2007, pp. 4566-4578.
In this paper, principal component analysis (PCA) and [9] Klobucar, Damir, and Marko Subasic. "Using self-
Self Organising Map (SOM) has been applied to organizing maps in the visualization and analysis of
demonstrate the dimensionality reduction of dataset. A forest inventory." iForest-Biogeosciences and
case study was considered to visualize and classify air Forestry 5.5, 2012 , pp. 216.
quality data. PCA explained most of the variance in data [10] Chattopadhyay, Manojit, Pranab K. Dan, and
but it was difficult to interpret the data pattern. However, Sitanath Mazumdar. "Principal component analysis
SOM appears to be the best fit to represent complex data. and self-organizing map for visual clustering of
Especially, the component planes of SOM provides machine-part cell formation in cellular
effective visualization and uncovers the correlation manufacturing system." Systems Research Forum.
between the input variables. Vol. 5., 2011,pp.25-51.
U-matrix can be used for data classification, but may not [11] Koua, E. L. "Using self-organizing maps for
be a good choice in all cases. Therefore, in this paper information visualization and knowledge discovery
hierarchical clustering technique is applied to the SOM in complex geospatial datasets." Proceedings of 21st
results obtained. The quality of air is partitioned into 3