You are on page 1of 2

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

On Realizing Rough Set Algorithms with Apache Spark


Kuo-Min Huang, Hsin-Yu Chen & Kan-Lin Hsiung
Innovation Center for Big Data and Digital Convergence &
Department of Electrical Engineering, Yuan Ze University
135 Yuan-Tung Road, Chung-Li, TAIWAN 32003
E-mail: convex@saturn.yzu.edu.tw
ABSTRACT
In this note, in line with the emerging granular
computing paradigm for huge datasets, we consider
a Spark implementation of rough set theory, which is
a powerful mathematical tool to deal with vagueness
and uncertainty in imperfect data.

KEYWORDS
Data Mining, Granular Computing, Rough Sets,
Apache Spark, Hadoop MapReduce

1 INTRODUCTION
1.1 Apache Spark
Apache Spark [9], an alternative to Hadoop
MapReduce, is currently one of the most active open source projects in the big data world.
Hadoops lack of suitability for many new big
data applications has been mentioned [8]. For
example, due to the short of long-lived MapReduce jobs and the overhead of fetching data from
HDFS (for each iteration), Hadoop is not wellsuited for iterative computations, such as iterations on matrices of high dimensions in optimization algorithms over many real-time machine learning frameworks.
The major advantage of Spark over Hadoop
is that it is a cluster computing framework
that lets users perform in-memory computations,
which cache data in memory during iterations,
in a fault tolerant manner [10]. Supporting iterative algorithms out of the box, Spark has

ISBN: 978-1-941968-35-2 2016 SDIWC

been adopted by many organizations to replace


MapReduce.
1.2

Rough Set Theory

As a powerful and popular data mining tool,


the theory of rough sets [5, 6, 7, 2], introduced
by Pawlak in the eighties of the twentieth century [4, 1], is specifically suitable for dealing
with information systems that exhibit data inconsistencies (i.e., objects with different values
for the decision attribute but identical values for
the condition attributes). Rough sets, which categorize objects based on the indiscernibility of
their attribute values, allow dealing with incomplete, imprecise or uncertain data.
One of the main advantages of rough set
models is that they require no preliminary or
additional information concerning data, such as
membership values in fuzzy set models or probability distribution in statistics. Due to their
versatility, rough set methods and algorithms
have been widely used in various fields, including voice recognition, audio and image processing, finance, process control, pharmacology and
medicine, text mining and exploration of the
web, and power system security analysis.
The rough set approach to process inexact, uncertain or vague knowledge is based on
a pair of two crisp sets (i.e., conventional sets)
which give the lower and upper approximation of the original set. However, in the data
analysis pipeline utilizing rough set methods for
huge datasets (with possibly millions of objects),
the required task of computing upper and lower

111

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

approximations (of concepts [3]) is quite demanding, in terms of memory and runtime. In
this note, a parallel and distributed implementation over Apache Spark to compute rough approximations (of decision classes) in huge information systems is briefly reported.

FUTURE WORK

In the future we would like to improve the performance of our Spark realization for rough set
algorithms.
ACKNOWLEDGEMENT

2 NUMERICAL RESULTS
We have created a prototype implementation using Apache Spark to execute basic operations of
the rough set theory, including computing the
rough set approximations of a given crisp set.
Our implementation was tested on a Spark cluster using Amazons AWS. Eight m4.4xlarge
machines were used as workers with 64 Gb of
memory each; datasets of various problem sizes,
with the number of condition attributes varying
from 10 to 102 and the number of instances varying from 104 to 107 , were randomly generated.
In addition, values of decision attributes were
randomly chosen between 0 and 1, and all the
values of condition attributes are randomly generated between 0 and 103 .
To highlight the scalability of our distributed implementation, the RoughSets package in R [11] was first used to make the rough
set analysis for the randomly generated, huge
datasets. (Since a distributed R version of rough
set algorithms cannot be found, the test for the
RoughSets package was limited with using only
one worker.) Due to its non-distributed natural
(i.e., the whole indiscernibility matrix is stored
in memory all at once), the RoughSets package
suffers from inefficient memory usage quickly,
and cannot handle a dataset with more than ten
condition attributes and tens of thousands of instances or so. On the contrary, for datasets with
a varying number of instances from 104 to 107
and 100 decision attributes, the execution time of
our Spark implementation grows approximately
quadratically (in terms of the number of instances), and varies from 102 to 106 seconds.

ISBN: 978-1-941968-35-2 2016 SDIWC

This research was supported by Ministry of Science and Technology, Taiwan (through Grant
#MOST 103-2221-E-155-067).
REFERENCES
[1] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers,
Norwell, MA, USA, 1991.
[2] J. Stepaniuk, Rough - Granular Computing in
Knowledge Discovery and Data Mining, volume 152 of Studies in Computational Intelligence,
Springer, 2008.
[3] J. Bazan, H.S. Nguyen, and M. Szczuka, A view on
rough set concept approximations, Fundamenta Informaticae, vol. 59, no. 2-3, pp. 107-118, April 2004.
[4] Z. Pawlak, Rough sets, International Journal of
Computer and Information Sciences, pp. 341-356,
1982.
[5] Z. Pawlak and A. Skowron, Rough sets and Boolean
reasoning, An International Journal of Information
Sciences, vol. 177, no. 1, pp. 41-73, 2007.
[6] Z. Pawlak and A. Skowron, Rough sets: Some extensions, An International Journal of Information
Sciences, vol. 177, no. 1, pp. 28-40, 2007.
[7] Z. Pawlak and A. Skowron, Rudiments of rough
sets, An International Journal of Information Sciences, vol. 177, no. 1, pp. 3-27, 2007.
[8] V.S. Agneeswaran, Big Data Analytics Beyond
Hadoop: Real-Time Applications with Storm, Spark,
and More Hadoop Alternatives, Pearson FT Press,
2014.
[9] M. Zaharia, M. Chowdhury, M.J. Franklin, S.
Shenker, and I. Stoica, Spark: Cluster computing
with working sets, in Proc. of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 10,
Berkeley, CA, USA, 2010.
[10] M. Zaharia, M. Chowdhury, T. Das, and A. Dave, et
al., Resilient distributed datasets: A fault-tolerant
abstraction for in-memory cluster computing, in
Proc. of the 9th USENIX Conference on Networked
Systems Design and Implementation, p. 2, Berkeley,
CA, USA, 2012.
[11] L.S. Riza, A. Janusz, C. Bergmeir, and C. Cornelis,
et al., Implementing algorithms of rough set theory
and fuzzy rough set theory in the R package RoughSets, Information Sciences, vol. 287, pp. 6889,
2014.

112

You might also like