Professional Documents
Culture Documents
ABSTRACT
Rapid growth of data requires systems that are able
to provide a scalable infrastructure for distributed
storage and processing of vast amount of data
efficiently. Hive is a MapReduce-based data
warehousing system for data summarization and
query analysis. This warehousing system can
arrange millions of rows of data into tables, where
its data placement structures play a significant role
in the performance of this warehouse. It also
provides SQL-like language called HiveQL, that
able to compile MapReduce jobs into queries on
Hadoop.
In this paper, we investigate the
performance of Hive's data placement structures
(RCFile and ORCFile). The experimental results
showed the effectiveness of RCFile and ORCFile
for data placement structure in MapReduce system.
KEYWORDS
Big Data; RCFile; ORCFile; Hive; Performance
INTRODUCTION
15
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
2.1 Hadoop
3
Hadoop is a Java based open source software
framework developed by Apache Software
Foundation, which can store, process, and
analyze alarge volume of datasets in parallel on
clusters of computers. Hadoop can scale up
easily from a single server to hundred servers,
where each server provides local computation
and storage capability. Hadoop has four core
modules: MapReduce, HDFS (Distributed
storage), Yarn framework, and common
utilities.
MapReduce is a programming model to process
a large volume of datasets in parallel on clusters
of computers [12].
Hadoop Distributed File System (HDFS) has
master-slave architecture, which introduces
master as NameNode, and slave as a number of
DataNodes [14]. NameNodes are responsible
for namespace operation such as opening,
closing, renaming files directories, and
mapping blocks to DataNodes [4]. DataNodes
store data in different blocks in HDFS and
allows read/write operations by clients [4].
DataNode is responsible for performing block
creation, deletion and replication according to
the instructions by the NameNode.
Apache Hive was developed by Facebook to
manage its growing volume of data sets [14].
Hive has been included in Hadoop as a data
16
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
The ORCFileStructure
17
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
3.2.2
18
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
19
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
CONCLUSION
20
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
REFERENCES
21