You are on page 1of 9

65

VOL 11 NO 1
2013
Big Data: Testing Approach to
Overcome Quality Challenges
By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja
T
esting Big data is one of the biggest
challenges faced by organizations because
of lack of knowledge on what to test and how
much data to test. Organizations have been
facing challenges in defning the test strategies
for structured and unstructured data validation,
setting up an optimal test environment, working
with non-relational databases and performing
non-functional testing. These challenges are
causing in poor quality of data in production
and delayed implementation and increase in
cost. Robust testing approach need to be defned
for validating structured and unstructured
data and start testing early to identify possible
defects early in the implementation life cycle
and to reduce the overall cost and time to
market.
Different testing types like functional
and non-functional testing are required along
with strong test data and test environment
management to ensure that the data from varied
sources is processed error free and is of good
quality to perform analysis. Functional testing
activities like validation of map reduce process,
structured and unstructured data validation,
data storage validation are important to ensure
that the data is correct and is of good quality.
Apart from functional validations other non-
functional testing like performance and failover
testing plays a key role to ensure the whole
process is scalable and is happening within
specifed SLA.
Big data implementation deals with
writing complex Pig, Hive programs and
running these jobs using Hadoop map reduce
framework on huge volumes of data across
different nodes. Hadoop is a framework that
allows for the distributed processing of large
data sets across clusters of computers. Hadoop
uses Map/Reduce, where the application is
divided into many small fragments of work,
each of which may be executed or re-executed
on any node in the cluster. Hadoop utilizes its
own distributed fle system, HDFS, which makes
data available to multiple computing nodes.
Figure 1 shows the step by step process
on how Big data is processed using Hadoop
ecosystem. First step loading source data into
Validate data quality by employing a
structured testing technique
Infosys Labs Briefings
66
HDFS involves in extracting the data from
different source systems and loading into
HDFS. Data is extracted using crawl jobs for
web data, tools like sqoop for transactional
data and then loaded into HDFS by splitting
into multiple fles. Once this step is completed
second step perform map reduce operations
involves in processing the input files and
applying map and reduce operations to get a
desired output. Last setup extract the output
results from HDFS involves in extracting the
data output generated out of second step and
loading into downstream systems which can
be enterprise data warehouse for generating
analytical reports or any of the transactional
systems for further processing
BIG DATA TESTING APPROACH
As we are dealing with huge data and executing
on multiple nodes there are high chances of
having bad data and data quality issues at each
stage of the process. Data functional testing is
performed to identify these data issues because
of coding errors or node configuration errors.
Testi ng shoul d be performed at each of
the three phases of Big data processing to
ensure that data is getting processed without
any errors. Functional Testing includes (i)
validation of pre-Hadoop processing; (ii),
validation of Hadoop Map Reduce process
data output; and (iii) validation of data
extract, and load into EDW. Apart from these
functional validations non-functional testing
including performance testing and failover
testing needs to be performed.
Fi gure 2 shows a typi cal Bi g data
architecture diagram and highlights the areas
where testing should be focused.
Validation of Pre-Hadoop Processing
Data from various sources like weblogs, social
network sites, call logs, transactional data
etc., is extracted based on the requirements
and loaded into HDFS before processing it
further.
Issues: Some of the issues which we face during
this phase of the data moving from source
Figure 1: Big Data Testing Focus Areas Source: Infosys Research
1 2
3
Loading Source
data files into HDFS
Perform Map
Reduce operations
Extract
the output
results from
HDFS
67
systems to Hadoop are incorrect data captured
from source systems, incorrect storage of data,
incomplete or incorrect replication.
Validations: Some hi gh l evel scenari os
that need to be validated during this phase
include:
1. Comparing input data file against
source systems data to ensure the data
is extracted correctly
2. Validating the data requirements and
ensuring the right data is extracted,
3. Validating that the fles are loaded into
HDFS correctly, and
4. Validating the input files are split,
moved and replicated in different data
nodes.
Validation of Hadoop Map Reduce Process
Once the data is loaded into HDFC Hadoop
map-reduce process is run to process the data
coming from different sources.
Issues: Some issues that we face during this
phase of the data processing are coding issues
in map-reduce jobs, jobs working correctly
when run in standalone node, but working
incorrectly when run on multiple nodes,
incorrect aggregations, node confgurations,
and incorrect output format.
Validations: Some high level scenarios that
need to be val i dated duri ng thi s phase
include:
1. Val i dat i ng t hat dat a pr oces s i ng
i s c ompl et ed and out put f i l e i s
generated
Figure 2: Big Data architecture Source: Infosys Research
Enterprise Data
Warehouse
ReportsTesting
Reporting using BI Tools
25% 25%
25% 25% 1
2
3
4
5
Big Data Testing Focus Areas
Bar graph
Big Data Analytics
Pig HIVE
HBase (NoSQL DB)
Map Reduce
(Job Execution)
HDFS (Hadoop Distributed File System)
Transactional
Data (RDBMS)
N
o
n
-
F
u
n
c
t
i
o
n
a
l
T
e
s
t
i
n
g

(
P
e
r
f
o
r
m
a
n
c
e
,

F
a
i
l

o
v
e
r

t
e
s
t
i
n
g
)
4
4
Map-Reduce
process validation
2
ETL Process
validation
3
Pre-Hadoop
process validation
1
Web Logs
Streaming
Data
Social Data
Processed Data
Data Load using Sqoop
h
a
d
o
o
p
ETL Process
68
2. Val i dat i ng t he busi ness l ogi c on
standalone node and then validating
after running against multiple nodes
3. Validating the map reduce process to
verify that key value pairs are generated
correctly
4. Val i dat i ng t he aggr egat i on and
consolidation of data after reduce
process
5. Validating the output data against
the source fles and ensuring the data
processing is completed correctly
6. Validating the output data fle format
and ensuring that the format is per the
requirement.
Validation of Data Extract, and Load into EDW
Once map-reduce process is completed and data
output fles are generated, this processed data
is moved to enterprise data warehouse or any
other transactional systems depending on the
requirement.
Issues: Some issues that we face during this
phase include incorrectly applied transformation
rules, incorrect load of HDFS fles into EDW and
incomplete data extract from Hadoop HDFS.
Validations: Some high level scenarios that
need to be validated during this phase include:
1. Validating that transformation rules are
applied correctly
2. Validating that there is no data corruption by
comparing target table data against HDFS
fles data
3. Validating the data load in target system
4. Validating the aggregation of data
5. Validating the data integrity in the target
system.
Validation of Reports
Analytical reports are generated using reporting
tools by fetching the data from EDW or running
queries on Hive.
Issues: Some of the issues faced while generating
reports are report defnition not set as per the
requirement, report data issues, layout and
format issues.
Validations: Some high level validations
performed during this phase include:
Reports Validation: Reports are tested after
ETL/transformation workfows are executed for
all the sources systems and the data is loaded
into the DW tables. The metadata layer of the
reporting tool provides an intuitive business
view of data available for report authoring.
Checks are performed by writing queries to
verify whether the views are getting the exact
data needed for the generation of the reports.
Cube Testing: Cubes are testing to verify
that dimension hierarchies with pre-aggregated
values are calculated correctly and displayed
in the report.
Dashboard Testing: Dashboard testing
consists of testing of individual web parts and
reports placed in a dashboard. Testing would
involve ensuring all objects are rendered
properly and the resources on the webpage
are current and latest. The data fetched from
various web parts is validated against the
databases.
69
VOLUME, VARIETY AND VELOCITY:
HOW TO TEST?
In the earlier sections we have seen step by step
details on what need to be tested at each phase
of the Big data processing. During these phases
of Big data processing the three dimensions or
characteristics of Big data i.e. volume, variety
and velocity are validated to ensure there are no
data quality defects and no performance issues.
Volume: The amount of data created both inside
corporations and outside the corporations via
the web, mobile devices, IT infrastructure,
and other sources is increasing exponentially
each year [3]. Huge volume of data fows from
multiple systems which need to be processed
and analyzed. When it comes to validation it is
a big challenge to ensure that whole data setup
processed is correct. Manually validating the
whole data is a tedious task. We should use
compare scripts to validate the data. As data
is stored in HDFS is in fle format scripts can
be written to compare two fles and extract the
differences using compare tools [4]. Even if we
use compare tools it will take a lot of time to do
100% data comparison. To reduce the time for
execution we can either run all the comparison
scripts in parallel on multiple nodes just like
how data is processed using Hadoop map-
reduce process or sample the data ensuring
maximum scenarios are covered.
Figure 3 shows the approach on how
voluminous amount of data is compared. Data
is converted into expected result format and
then compared using compare tools with actual
data. This is a faster approach but involves
initial scripting time. This approach will reduce
further regression testing cycle time. When
we dont have time to validate complete data,
sampling can be done for validation.
Variety: The variety of data types is increasing,
namely unstructured text-based data and semi-
structured data like social media data, location-
based data, and log-fle data.
Structured Data is data which is in
defned format which is coming from different
RDBMS tables or from structured files. The
data that is of transactional nature can be
handled in fles or tables for validation purpose.
Figure 3: Approach for High Volume Data Validation Source: Infosys Research
T
o
o
l

t
o
c
o
m
p
a
r
e
t
h
e

f
i
l
e
s
Map Reduce Jobs
Discrepancy Report
Testing Scripts to validate data in HDFS
U
n
s
t
r
u
c
t
u
r
e
d
t
o

S
t
r
u
c
t
u
r
e
d
Map Reduce Jobs
run in test
environment to
generate the output
Custom scripts to
convert unstructured
data to structured data
Scripts to convert
data to expected
results data
Unstructured
(data)
testing
R
a
w

D
a
t
a
t
o

E
x
p
e
c
t
e
d
R
e
s
u
l
t
s
f
o
r
m
a
t
Structured
data
1 SD Test
2 SD1Test1
Output
Data
Files
Unstructured
(data)
testing
Structured
data
testing
Structured
data testing
Actual
Results
Expected
Results
Expected
Results
File by
File
Comparison
70
Semi-structured data does not have any defned
format but structure can be derived based on the
multiple patterns of the data. Example of data is
extracted by crawling through different websites
for analysis purposes. For validation data need
to be first transformed into structured format
using custom built scripts. First the pattern
need to be identified and then copy books
or pattern outline need to be prepared, later
this copy book need to be used in scripts to
convert the incoming data into a structured
format and then validations performed using
compare tools.
Unstructured data i s the data that
does not have any format and is stored in
documents or web content, etc. Testi ng
unstructured data is very complex and is time
consuming. Automation can be achieved to
some extent by converting the unstructured
data into structured data using scripting like
PIG scripting as showing in Figure 3. But
the overall coverage using automation will
be very less because of unexpected behavior
of data; input data can be in any form and
changes every time new test is performed. We
need to deploy a business scenario validation
strategy for unstructured data. In this strategy
we need to identify different scenarios that
can occur in our day to day unstructured data
analysis and test data need to be setup based
on test scenarios and executed.
Velocity: The speed at which new data is being
created and the need for real-time analytics
to derive business value from it -- is increasing
thanks to digitization of transactions, mobile
computing and the sheer number of internet
and mobile device users. Data speed needs
to be considered when implementing any
Big data appliance to overcome performance
probl ems. Performance testi ng pl ays an
important role to identify any performance
bottleneck in the system and the system can
handle high velocity streaming data.
NON-FUNCTIONAL TESTING
In the earlier sections we have seen how
functional testing is performed at each phase of
Big data processing, these tests are performed to
identify functional coding issues, requirements
issues. Performance testing and failover testing
need to be performed to identify performance
bottlenecks and to validate the non-functional
requirements.
Performance Testing: Any Big data project
involves in processing huge volumes of
structured and unstructured data and is
processed across multiple nodes to complete
the j ob in less time. At times because of
bad architecture and poorly designed code,
performance is degraded. If the performance
is not meeting the SLA, the purpose of setting
up Hadoop and other Big data technologies is
lost. Hence, performance testing plays a key role
in any Big data project due to huge volume of
data and complex architecture.
Some of the areas where performance
issues can occur are imbalance in input splits,
redundant shuffle and sorts, moving most
of the aggregation computations to reduce
process which can be done at map process. [5].
These performance issues can be eliminated
by carefully designing the system architecture
and doing performance test to identify the
bottlenecks.
Performance t est i ng i s conduct ed
by setting up huge volume of data and an
infrastructure similar to production. Utilities
like Hadoop performance monitoring tool can
be used to capture the performance metrics and
identify the issues. Performance metrics like
71
job completion time, throughput, and system
level metrics like memory utilization etc. are
captured as part of performance testing.
Failover Testing: Hadoop architecture consists
of a name node and hundreds of data notes
hosted on several server machines and each
of them are connected. There are chances of
node failure and some of the HDFS components
become non-functional. Some of the failures can
be name node failure, data node failure and
network failure. HDFS architecture is designed
to detect these failures and automatically
recover to proceed with the processing.
Failover testing is an important focus
area in Big data implementations with the
objective of validating the recovery process
and to ensure the data processing happens
seamlessly when switched to other data nodes.
Some val i dat i ons t hat need t o be
performed during failover testing are validating
that checkpoints of edit logs and FsImage
of name node are happening at a defined
intervals, recovery of edit logs and FsImage
files of name node, no data corruption because
of the name node failure, data recovery when
data node fails and validating that replication
is initiated when one of data node fails or data
become corrupted. Recovery Time Objective
(RTO) and Recovery Point Objective (RPO)
metrics are captured during failover testing.
TEST ENVIRONMENT SETUP
As Big data involves handling huge volume
and processing across multiple nodes, setting
up a test environment is the biggest challenge.
Setting up the environment on cloud will
give us the fexibility to setup and maintain it
during test execution. Hosting the environment
on the cloud will also help in optimizing the
infrastructure and faster time to market.
Key steps involved in setting up environment
on cloud are [6]:
A. Big data Test infrastructure requirement
assessment
1. As s es s t he Bi g dat a pr oces s i ng
requirements
2. Evaluate the number of data nodes
required in QA environment
3. Un de r s t a n d t h e da t a pr i v a c y
requirements to evaluate private or
public cloud
4. Evaluate the software inventory required
to be setup on cloud environment
(Hadoop, File system to be used, No
SQL DBs, etc).
B. Big data Test infrastructure design
1. Document the high level cloud test
infrastructure design (Disk space, RAM
required for each node, etc.)
2. Identify the cloud infrastructure service
provider
3. Document the SLAs, communication
plan, maintenance plan, environment
refresh plan
4. Document the data security plan
5. Document hi gh l evel t est st rat egy,
t e s t i ng r e l e a s e c y c l e s , t e s t i ng
t ypes, vol ume of dat a processed
b y Ha do o p , t h i r d p a r t y t o o l s
requi red.
72
C. Big data Test Infrastructure Implementation
and Maintenance
Create a cloud instance of Big data test
environment
Install Hadoop, HDFS, MapReduce and other
software as per the infrastructure design
Perform a smoke test on the environment
by processing a sample map reduce,
Pig/Hive jobs
Deploy the code to perform testing.
BEST PRACTICES
Data Quality: It is very important to establish
the data quality requirements for different
forms of data like traditional data sources, data
from social media, data from sensors, etc. If the
data quality is ascertained, the transformation
logic alone can be tested, by executing tests
against all possible data sets.
Da t a Sa mpl i ng: Data sampl i ng gai ns
significance in Big data implementation and
it becomes the testers job to identify suitable
sampling techniques that includes all critical
business scenarios and the right test data set.
Automation: Automate the test suites as much
as possible. The Big data regression test suite
will be used multiple times as the database will
be periodically updated. Hence an automated
regression test suite should be built to use it
after reach release. This will save a lot of time
during Big data validations.
CONCLUSION
Data quality challenges can be encountered by
deploying a structured testing approach for both
functional and non-functional requirements.
Applying right test strategies and following
best practices will improve the testing quality
which will help in identifying the defects early
and reduce overall cost of the implementation. It
is required that organizations invest in building
skillset both in development and testing. Big
data testing will be a specialized stream and
testing team should be built with diverse skillset
including coding, white-box testing skills and
data analysis skills for them to perform a better
job in identifying quality issues in data.
REFERENCES
1. Big data overview, Wikipedia.org at
http://en.wikipedia.org/wiki/Big_data.
2. White, T. (2010), Hadoop- The Defnitive
Guide 2nd Edition, OReilly Media.
3. Kelly, J. (2012), Big data: Hadoop,
Business Analytics and Beyond, A
Big data Manifesto from the Wikibon
Communi ty. Avai l abl e at http: //
wi ki bon. or g/wi ki /v/Bi g_Dat a: _
Hadoop, _Business_Analytics_and_
Beyond, Mar 2012.
4. Informatica Enterprise Data Integration
(1998), Data verifcation using File and
Table compare utility for HDFS and Hive
tool. Available at https://community.
informatica.com/solutions/1998.
5. Bhandarkar M. (2009), Practical Problem
Solving with Hadoop, USENIX 09
annual technical conference, June 2009.
Available at http://static.usenix.org/
event/usenix09/training/tutonefle.html.
6. Naganat han, V. ( 2012) , I ncr ease
Business Value with Cloud-based QA
Environments, Available at http://www.
infosys.com/IT-services/independent-
validation-testing-services/Pages/
cloud-based-QA-environments.aspx.
For information on obtaining additional copies, reprinting or translating articles, and all other correspondence,
please contact:
Email: InfosyslabsBriefngs@infosys.com
Infosys Limited, 2013
Infosys acknowledges the proprietary rights of the trademarks and product names of the other
companies mentioned in this issue of Infosys Labs Briefngs. The information provided in this
document is intended for the sole use of the recipient and for educational purposes only. Infosys
makes no express or implied warranties relating to the information contained in this document or to
any derived results obtained by the recipient from the use of the information in the document. Infosys
further does not guarantee the sequence, timeliness, accuracy or completeness of the information and
will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of,
any of the information or in the transmission thereof, or for any damages arising there from. Opinions
and forecasts constitute our judgment at the time of release and are subject to change without notice.
This document does not contain information provided to us in confdence by our clients.
Authors Profiles
MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached
at mahesh_gudipati@infosys.com.
SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be
contacted at Shanthi_Rao@infosys.com.
NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted
at naju_mohan@infosys.com.
NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be
contacted at Naveen_Gajja@infosys.com.

You might also like