Data Lifecycle Management in The Era of Big Data - Ibmc170

IBM Software
The fundamentals of data lifecycle

management in the era of big data
How data lifecycle management complements a big data strategy
The fundamentals of data lifecycle management in the era of big data
1 2 3 4 5 6
Introduction
Big data,
big impact:
Dealing with
the three Vs
Best practices:
Putting data
lifecycle
management
into action
The power of
enterprise-scale
data lifecycle
management
Enhance data
warehouse
agility with
IBM InfoSphere
Why InfoSphere?
Introduction
Organizations are eager to harness the
power of big data. But as new big data
opportunities emerge, ensuring that
information is trusted and protected
becomes exponentially more difficult.
If these challenges are not addressed
directly, end users may lose confidence
in the insights generated from their data
which can leave them unable to act on
new opportunities or address threats.
The tremendous volume, variety and

velocity of big data means that the old
manual methods of discovering, governing
and correcting data are no longer feasible.
Organizations need to automate information
integration and governance from the start.
By automating information integration
and governance and employing it at the
point of data creation and throughout
its lifecycle, organizations can help
protect information and improve the
accuracy of big data insights.
3
1 Introduction
2 Big data, big impact:

Dealing with the
three Vs
3 Best practices:
Putting data lifecycle
management into action
4 The power of enterprisescale data lifecycle

management
5 Enhance data
warehouse agility with
IBM InfoSphere
6 Why InfoSphere?
Information integration and governance

solutions must become a natural part
of big data projects. They must support
automated discovery and profiling and
they must facilitate an understanding of
diverse data sets to provide the complete
context required to make informed decisions.
They must be agile enough to accommodate
a wide variety of data and seamlessly
integrate with diverse technologies, from
data marts to Apache Hadoop systems.
Plus, they must discover, protect and
monitor sensitive information across its
lifecycle as part of big data applications.
Understanding the context of data

and being able to extract the precise
information necessary to meet a business
objective is key to utilizing big data to the
fullest. Managing the data lifecycle so that
data is accurate, is appropriately used and
is correctly stored to meet the required
service levels and retention needs has
wide-ranging benefits. These benefits
include risk reduction, performance
improvements and preventing an overload
of useless information.
This e-book explores the challenges

of managing big data, best practices
for enterprise-scale data lifecycle
management and how IBM InfoSphere
Optim data lifecycle management
solutions incorporate a comprehensive
range of information integration and
governance capabilities that enable
companies to properly manage data
over its lifetime.
4
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Big data, big impact: Dealing with the three Vs

Without effective data lifecycle management,
the increasing volume, variety and velocity
of big data can reduce performance,
increase margins and amplify risks.
Performance and time-to-market
As more users execute more queries on
larger data volumes, slow response times
and degraded application performance
become major issues. If left unchecked,
continued data growth will stretch resources
beyond capacity and negatively impact
response time for critical queries and
reporting processes. These problems can
affect production environments and hamper
upgrades, migrations and disaster recovery
efforts. Implementing intelligent data
management of historical, dormant data

is essential for avoiding these potentially
business-halting issues.
Rapid data growth also makes testing
more difficult. As data warehouses and big
data environments grow to petabytes or
more, testing processes are taxed by
having to cull data for their specific needs.
The results include longer test cycles,
slower time-to-market and fewer defects
identified in advance of release. Speeding
up testing workflows and delivery of data
warehouses requires organizations to
automate the creation of realistic rightsized
test datawhile keeping appropriate
security measures in place.
Margins
Exponential data growth also can drive up
infrastructure and operational costs, often
consuming most of an organizations data
warehousing or big data budget. Rising
data volumes require more capacity,
and organizations often must buy more
hardware and spend more money to
maintain, monitor and administer their
expanding infrastructure. Large data
warehouses and big data environments
generally require bigger servers, appliances
and testing environments, which can also
increase software licensing costs for the
database and database tooling, not to
mention labor, power and legal costs.
5
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Risks
Following the lets keep it in case someone
needs it later mandate, many organizations
already keep too much historical data.
According to the CGOC 2012 Summit
Survey, 69 percent of data has no value.
Opening the doors to excessive storage
and retention only exacerbates the situation.
At the same time, organizations must ensure

the privacy and security of the growing
volumes of confidential information.
Government and industry regulations from
around the world, such as the Health
Insurance Portability and Accountability
Act (HIPAA), the Personal Information
Protection and Electronic Documents Act

(PIPEDA) and the Payment Card Industry
Data Security Standard (PCI DSS) require
organizations to protect personal information
no matter where it liveseven in test and
development environments.
Data breaches and attacks risk negative consumer sentiment
75%
75% of IT
risks impact
customer
satisfaction
and brand
reputation
43%
43% are increasing

focus on reputational
risk because of
growth in emerging
technologies such as
social media
Maintaining compliance with data retention regulations, protecting privacy and archiving
data are not just legal mattersthey are essential for sustaining customer satisfaction
and brand reputation. In recent IBM surveys, respondents indicate that data theft/
cybercrime is the number-one threat to a companys reputationa greater threat than
system failures. Sixty-four percent of respondents say their company will be focusing
more on managing and protecting their reputation than they did five years ago.1
Source: Insights from the 2012 Global Reputational Risk and IT Study.
6
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
The danger of treating a backup

as an archive
Many organizations are confused about the difference
between archiving and backing up data. Archiving
preserves data, providing a long-term repository of
information that can be used by litigation and audit
teams. By contrast, backing up data involves copying
production data and moving it to another environment
to enable disaster recovery and the restoration of
deleted files. Backups are often retained for a short
time, until a fresh backup replaces the existing backup.
Archiving complements backups by removing old,
redundant and infrequently accessed data from a
system and by reducing the size of databases and
their backups. Approximately 75 percent of the data
stored is typically inactive, rarely accessed by any
user, process or application. An estimated 90 percent
of all data access requests are serviced by new data

usually data that is less than a year old.2 With an
effective archiving strategy, organizations can protect
old data and comply with data retention rules while
reducing costs and enhancing system performance.
In an attempt to meet archiving needs, some
organizations simply back up data to a Hadoop
environment. But this kind of backup will not ensure
that data will be fully protected or remain query-able,
the way a true archive would. With an effective data
lifecycle management solution, companies can create
an archive that protects data, meets compliance
standards, and supports queries and reporting. An
emerging trend is for organizations to use Hadoop
as a lower-cost storage alternative for archives.
7
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Best practices: Putting data lifecycle management into action

The data lifecycle stretches through
multiple phases as data is created, used,
shared, updated, stored and eventually
archived or defensively disposed. Data
lifecycle management plays an especially
key role in three of these phases of
datas existence: archiving, test data
management and data masking.
Where management tasks fall in the data lifecycle
Test data
management
Dispose
Create
Use
Store /retain
Archiving
Data
masking
Share
Archive
Update
The entire data lifecycle (shown as the grey circle) benefits from
good governance, but management capabilities that focus on the
use, share and archive steps have wide-ranging benefits for cost
reduction and efficiency gains.
Archiving
Retention policies are designed to keep
important data elements for reference and
for future use while deleting data that is no
longer necessary to support the legal needs
of an organization. Effective data lifecycle
management includes the intelligence not
only to archive data in its full context, which
may include information across dozens of
databases, but also to archive it based on
specific parameters or business rules, such
as the age of the data. It can also help
storage administrators develop a tiered and
automated storage strategy to archive
dormant data in a data warehouse, thereby
improving overall warehouse performance.
8
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Enterprise information
1%
Subject
to legal
hold
31%
25%
Has business
utility
69%
Everything
else
5%
Regulatory
record keeping
Many organizations hope that big data will provide a large,

centralized lake of data, but in many cases, it becomes a data
swamp full of unreliable information.
Many organizations envision big data as a

large, pristine, centralized data lake. But
a data lake can quickly turn into a data
swamp when data is poorly managed and
controlled. By setting up an intelligent data
lifecycle management strategy and archiving
to inexpensive storage, you can avoid
turning your big data environment into a
dumping ground.
Test data management
In development, testers must automate the
creation of realistic, rightsized data sources
that mirror the behaviors of existing production
databases. To ensure that queries can be
run easily and accurately, they must create
a subset of actual production data and
reproduce actual conditions to help identify

defects or problems as early as possible in
the testing cycle.
The tremendous size of big data systems
creates challenges for testers. There is a
greater need to speed delivery of big data
applications, requiring organizations to create
realistic, rightsized, masked test data for
testing those applications for performance
and functionality. Testers also need ways to
generate test data sets that facilitate realistic
functional and performance testing. Because
production data contains information that
may identify customers, organizations must
mask that information in test environments
to maintain compliance and privacy.
9
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Applying data masking techniques to

the test data means testers use realisticlooking, but fictional datano actual
sensitive data is revealed. Application
developers can also use test data
management technologies to easily access
and refresh test data, which speeds the
testing and delivery of the new data source.
Organizations also need ways to mask
certain sensitive data, such as credit card
and phone numbers. While testing their
big data environments, they must mask
sensitive data from unauthorized users,
even though those users might be

authorized to see the data in aggregate.
For example, a pharmaceutical company
that is testing its data warehouse
environment might mask Social Security
numbers and dates of birth but not patients
ages and other demographic information.
Masking certain data this way satisfies
corporate and industry regulations by
removing identifiable information, while
still maintaining business context and
referential integrity for testing in nonproduction environments.
Original data
Customers table
Cust ID
08054
19101
Name
Alice Bennett
Carl Davis
Elliot Flynn
27645
Street
2 Park Blvd
258 Main
96 Avenue
Orders table
Cust ID
27645
27645
Item #
80-2382
86-4538
Order date
20 June 2004
10 October 2005
De-identified data
Customers table
Cust ID
10000
10001
10002
Name
Auguste Renoir
Claude Monet
Pablo Picasso
Street
23 Mars
24 Venus
25 Saturn
Orders table
Cust ID
10002
10002
Item #
80-2382
86-4538
Order date
20 June 2004
10 October 2005
Data masking techniques protect the confidentiality of

private information.
10
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Private cloud
Public cloud
EJB
Third-party
services
Complex IT landscapes
make setting up test
labs extremely costly
As volume, variety and velocity impacts the
complexity of data infrastructures, scaling test
environments becomes a significant problem. It
isnt unusual for Fortune 500 companies to
spend up to USD30 million building a single test
laband many of these organizations have
dozens of labs. Add in rising wages, and testing
costs begin to spiral out of control.
Business partners
Messaging
services
Collaboration
Web/Internet
Content
providers
Routing
services
Shared services
Archives
Portals
Data
warehouse
Directory
identity
Mainframe
Enterprise
service bus
File systems
Heterogeneous environments
11
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
The power of enterprise-scale data lifecycle management

Effective data lifecycle management benefits
both IT and business stakeholders.
Increasing margin: Lower infrastructure
and capital costs, improved productivity
and reduced application defects during
the development lifecycle.
Reducing risks: Reduced application
downtime, minimized service and
performance disruptions, and adherence
to data retention requirements.
Promoting business agility: Improved
time-to-market, increased application
performance and improved quality of
applications through realistic test data.

With InfoSphere Optim, organizations gain

a single data lifecycle management solution
that can scale to meet enterprise needs.
Whether they implement InfoSphere Optim
for a single application, data warehouse or
big data environment, organizations can
streamline data lifecycle management with a
consistent strategy. The unique relationship
engine in InfoSphere Optim provides a
single point of control to guide data
processing activities such as archiving,
subsetting and retrieving data.
12
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Enhance data warehouse agility with IBM InfoSphere

InfoSphere Optim solutions help organizations
meet requirements for information integration
and governance and address challenges
exacerbated by the increasing volume,
variety and velocity of data. By archiving
old data from huge data warehouse
environments, businesses can improve
response times and reduce costs by
reclaiming valuable storage capacity.
By creating realistic, rightsized data sources
for testing, they can enhance the accuracy
of testing and identify problems early in the
testing cycle. And by implementing data
masking capabilities, they can protect
sensitive data and help ensure compliance
with privacy regulations.
As a result, organizations gain more control

of their IT budget while simultaneously
helping their big data and data warehouse
environments run more efficiently and reducing
the risk of exposure of sensitive data.
InfoSphere Optim supports

major big data and data
warehouse environments,
including IBM PureData for
Analytics, IBM PureData for
Transactions, IBM InfoSphere
BigInsights, Teradata,
Oracle and popular Hadoop
distributions. It also supports
enterprise databases and
operating systems, including
IBM DB2, Oracle Database,
Sybase, Microsoft SQL Server,
IBM Informix, IBM IMS,
IBM Virtual Storage Access Method (VSAM),
Microsoft Windows, UNIX, Linux and IBM z/OS.
In addition, InfoSphere Optim supports key enterprise
resource planning (ERP) and customer relationship
management (CRM) applications such as Oracle
E-Business Suite, PeopleSoft Enterprise, JD Edwards
EnterpriseOne, Siebel, Amdocs CRM and the
SAP ERP and CRM applications, as well as many
custom applications.
13
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
The value of test data management at a US insurance company

With 42 high-volume back-end systems needed to
generate a full end-to-end system test, a US insurance
company could not confidently launch new features.
Testing in production was becoming the norm. In fact,
claims could not be processed in certain states because
of application defects that the teams skipped over during
the testing process. IT was consuming an increasing
number of resourcesyet application quality was
declining rapidly.
After implementing a process to govern test data
management, the insurance company reduced the costs
of testing by USD400,000 per year. Today, the company
can easily refresh 42 test systems from across the
organization in record time while finding defects
in advance.
The business value from implementing test data

management included:
$500,000
44%
Cost savings of
approximately
USD500,000 per year
44 percent
fewer untested
scenarios
41%
41 percent less
labor required
over 12 months
14
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Why InfoSphere?
As the foundation of the IBM big data platform,
InfoSphere provides market-leading
functionality across all the capabilities of
information integration and governance.
It is designed to handle the challenges of
big data by providing optimal scale and
performance for massive data volumes,
agile and rightsized integration and
governance for the increasing velocity of
data, and support for a wide variety of data
types and big data systems. InfoSphere
helps make big data and analytics projects
successful by delivering the confidence to
act on insight.
InfoSphere capabilities include:

Metadata, business glossary and
policy management: Define metadata,
business terminology and governance
policies with IBM InfoSphere Business
Information Exchange.
Data integration: Handle all integration
requirements, including batch data
transformation and movement (InfoSphere
Information Server), real-time replication
(InfoSphere Data Replication) and data
federation (InfoSphere Federation Server).
Data quality: Parse, standardize, validate
and match enterprise data with InfoSphere
Information Server for Data Quality.

Master data management: Act on a

trusted view of your customers, products,
suppliers, locations and accounts with
InfoSphere MDM.
Data lifecycle management: Manage
data throughout its lifecycle, from
requirements through retirement, with
InfoSphere Optim test data automation
and database archiving capabilities.
Data security and privacy: Continuously
monitor data access and protect
repositories from data breaches,
and support compliance with IBM
InfoSphere Guardium. Ensure sensitive
data is masked and protected with
InfoSphere Optim.

15
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Additional resources
Ready to get started? Take a self-service
InfoSphere Optim Business Value
Assessment and show the ROI results
to your big data project owner.
To learn more about InfoSphere Optim, check out these resources:

Manage the Data Lifecycle of Big Data Environments
IBM InfoSphere Optim solutions for data warehouses
Demo: IBM InfoSphere Optim Data Growth Solution
Demo: IBM InfoSphere Optim Test Data Management Solution
To learn more about the IBM approach to information integration and governance
for big data, please contact your IBM representative or IBM Business Partner,
or visit: ibm.com/software/data/information-integration-governance
16
1 Introduction

Dealing with the
three Vs
3 Best practices:

management
5 Enhance data
IBM InfoSphere
6 Why InfoSphere?
Copyright IBM Corporation 2013

IBM Corporation
Software Group
Route 100
Somers, NY 10589
Produced in the United States of America
August 2013
IBM, the IBM logo, ibm.com, BigInsights, DB2, Guardium, IMS,
Informix, InfoSphere, Optim, PureData, and z/OS are trademarks
of International Business Machines Corp., registered in many
jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at Copyright and trademark
information at ibm.com/legal/copytrade.shtml
Linux is a registered trademark of Linus Torvalds in the United
States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are
trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United
States and other countries.
This document is current as of the initial date of publication and
may be changed by IBM at any time. Not all offerings are available
in every country in which IBM operates.
THE INFORMATION IN THIS DOCUMENT IS PROVIDED
AS IS WITHOUT ANY WARRANTY, EXPRESS OR
IMPLIED, INCLUDING WITHOUT ANY WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND ANY WARRANTY OR CONDITION OF
NON-INFRINGEMENT. IBM products are warranted according
to the terms and conditions of the agreements under which they
are provided.
uhanna, Noel. Your Enterprise Data Archiving Strategy. Forrester. February 2011. ftp://ftp.boulder.ibm.com/software/data/sw-library/
Y
data-management/optim/papers/your-enterprise-data-archiving-strategy.pdf
IBM 2012 Global Reputational Risk and IT Study. ibm.com/services/us/gbs/bus/html/risk_study-2012-infographic.html
The client is responsible for ensuring compliance with laws and

regulations applicable to it. IBM does not provide legal advice or
represent or warrant that its services or products will ensure that the
client is in compliance with any law or regulation.
Please Recycle
IMM14126-USEN-00

Data Lifecycle Management in The Era of Big Data - Ibmc170

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Lifecycle Management in The Era of Big Data - Ibmc170

Uploaded by

Copyright:

Available Formats

IBM Software

The fundamentals of data lifecycle

The fundamentals of data lifecycle management in the era of big data

The fundamentals of data lifecycle management in the era of big data

The tremendous volume, variety and

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

Information integration and governance

Understanding the context of data

This e-book explores the challenges

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

Big data, big impact: Dealing with the three Vs

management of historical, dormant data

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

At the same time, organizations must ensure

Protection and Electronic Documents Act

Data breaches and attacks risk negative consumer sentiment

43% are increasing

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

The danger of treating a backup

of all data access requests are serviced by new data

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

Best practices: Putting data lifecycle management into action

Where management tasks fall in the data lifecycle

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

Many organizations hope that big data will provide a large,

Many organizations envision big data as a

reproduce actual conditions to help identify

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

Applying data masking techniques to

even though those users might be

Data masking techniques protect the confidentiality of

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

The power of enterprise-scale data lifecycle management

With InfoSphere Optim, organizations gain

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

Enhance data warehouse agility with IBM InfoSphere

As a result, organizations gain more control

InfoSphere Optim supports

2 Big data, big impact:

4 The power of enterprisescale data lifecycle

The fundamentals of data lifecycle management in the era of big data

The value of test data management at a US insurance company

IBM 2012 Global Reputational Risk and IT Study. ibm.com/services/us/gbs/bus/html/risk_study-2012-infographic.html