You are on page 1of 21

Hadoop

User Guide

2013 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means
(electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and
product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such
owners.

Abstract
Hadoop user guide provides a brief introduction on cloud connectors and its features.
The guide provides detailed information on setting up the connector and running data
synchronization tasks (DSS). A brief overview of supported features and task operations that can
be performed using Hadoop connector is mentioned.

Table of Contents
Overview .......................................................................................................................................... 3
Hadoop ............................................................................................................................................ 3
Hadoop Plugin ................................................................................................................................. 4
Supported Objects and Task Operations ........................................................................................ 5
Enabling Hadoop Connector ........................................................................................................... 5
Instructions while installing the Secure Agent .......................................................................... 5
Creating a Hadoop Connection as a Source ................................................................................... 5
JDBC URL.................................................................................................................................... 7
JDBC Driver class ........................................................................................................................ 8
Installation Paths .......................................................................................................................... 8
Setting Hadoop Classpath for various Hadoop Distributions ....................................................... 8
Creating Hadoop Data Synchronization Task (Source) ................................................................ 12
Enabling a Hadoop Connection as a Target ................................................................................. 14
Creating Hadoop Data Synchronization Task (Target) ................................................................. 15
Data Filters .................................................................................................................................... 18
Troubleshooting ............................................................................................................................. 19
Increasing Secure Agent Memory.............................................................................................. 19
Additional Troubleshooting Tips................................................................................................. 21
Known Issues ................................................................................................................................ 21

Overview
Informatica cloud connector SDKs are off-cycle, off release add-ins that provide data integration
to SaaS and on-premise applications, which are not supported natively by Informatica cloud. The
cloud connectors are specifically designed to address most common use cases such as moving
data into cloud and retrieving data from cloud for each individual application.

Figure 1: Informatica Cloud Architecture

Once the Hadoop cloud connector is enabled for your ORG Id, you need to create a connection in
Informatica cloud to access the connector.

Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable, distributed
computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-available service on top
of a cluster of computers, each of which may be prone to failures.

The project includes these modules:


Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides highthroughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects include:
Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop
clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog,
HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing
cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications
visually alongwith features to diagnose their performance characteristics in a user-friendly
manner.
Avro: A data serialization system.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large
tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc
querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.
Cloudera Impala: It is the industrys leading massively parallel processing (MPP) SQL query
engine that runs natively in Apache Hadoop. The Apache-licensed, open source Impala
project combines modern, scalable parallel database technology with the power of Hadoop,
enabling users to directly query data stored in HDFS and Apache HBase without requiring
data movement or transformation. Impala is designed from the ground up as part of the
Hadoop ecosystem and shares the same flexible file and data formats, metadata, security and
resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other
components of the Hadoop stack.

Hadoop Plugin
The Informatica Hadoop connector allows you to perform Query and Insert operations on
Hadoop.
The plug-in supports Cloudera 5.0, MapR 3.1, Pivotal HD 2.0, Amazon EMR and Horton Works
2.1 and has been certified to work on CDH 4.2 and HDP 1.1
The Informatica Cloud Secure Agent must be installed on one of the nodes of the Hadoop Cluster
where Hiveserver or Hiveserver2 is running.
The plug-in is used as a target to insert data into Hadoop.
The plug-in connects to Hive and Cloudera Impala to perform relevant data operations.
The plug-in can easily be integrated with the Informatica Cloud.
The plugin supports all operators supported in HiveQL.

The plug-in supports the AND conjunction between filters. It supports both AND and OR
conjunctions in advanced filters.
The plug-in supports filtering on all filterable columns in Hive/Impala tables.

Supported Objects and Task Operations


The table below provides the list of objects and task operations supported by ReST connector.
Task Operation
Objects

DSS
Source

DSS
Target

Query

Insert

All tables in Hive


All tables in
Impala

NA

NA

Data
Preview

Upsert

Delete

NA

NA

NA

NA

NA

NA

NA

NA

: Supported
NA : Not Applicable

Enabling Hadoop Connector


To enable Hadoop connector, get in touch with Informatica support or Informatica representative.
It usually takes 15 minutes for the connector to download to secure agent, after it is enabled.

Instructions while installing the Secure Agent


Follow the given instructions while installing the secure agents:
You must install the secure agent on Hadoop cluster. If you install it outside the
Hadoop cluster you can only read from Hadoop, but you cannot write into the Hadoop.
You must also install the secure agent on the node where hive server 2 is running.

Creating a Hadoop Connection as a Source


To use Hadoop connector in data synchronization task, you must create a connection in
Informatica Cloud.
The following steps help you to create Hadoop connection in Informatica Cloud.
1. In Informatica Cloud home page, click Configure.
2. The drop-down menu appears, select Connections.
3. The Connections page appears.
4. Click New to create a connection.
5. The New Connection page appears.

Look
Up

Update

Figure 2: Connection Parameter

6. Specify the values to the connection parameters.

Connection
Property

Description

Connection Name

Enter a unique name for the connection.

Description

Provide a relevant description for the connection.

Type

Select Hadoop from the list.

Secure Agent

Select the appropriate secure agent from the list.

Username

Mention the username of Schema of Hadoop component.

Password

Mention the password of the schema of Hadoop component.

JDBC Connection
URL

Mention the JDBC URL to connect to the Hadoop Component. Refer JDBC
URL.

Driver

Mention the JDBC driver class to connect to the Hadoop Component. Refer
JDBC Driver class.

Commit Interval
Hadoop Installation
Path

Mention the commit interval. It is the Batch size (in rows) of data loaded into
hive.
Mention Hadoop Installation path.
The Installation path of the Hadoop component* used to connect to
Hadoop. Only one of these installation

Hive Installation Path

Mention the Hive installation path.

HDFS Installation
Path

Mention the HDFS installation path.

HBase Installation
Path

Mention HBase installation path.

Impala Installation
Path

Mention Impala installation path.

Miscellaneous Library
Path

Mention the Miscellaneous Library Path.


This is an additional library that could be used to communicate with
Hadoop.

Enable Logging

Check the Enable Logging box.


This Enables verbose log messages.

Note: Installation paths are the paths where Hadoop jars are listed. The connector loads and sets
one of these or more. Connector loads the libraries from these paths before sending any
instructions to Hadoop.
If you do not want to mention the installation path, you can generate the setHadoopclasspath.sh
file for amazon, HortonWorks and MapR.
Refer Setting Hadoop Classpath for various Hadoop Distributions
7. Click Test to evaluate the connection.
8. Click Ok to save the connection.

JDBC URL
The connector connects to different components of Hadoop using JDBC. The URL format and
parameters vary among components. Hive uses the JDBC URL format mentioned below:
.
jdbc:<hive/hive2>://<server>:<port>/<schema>
The significance of URL parameters is discussed below:
hive/hive2 protocol information depending on the version of the Thrift Server used, hive
forHiveServer and hive2 for HiveServer2.
Server, port server and port information where the Thrift Server is running.
Schema hive schema to which the connector needs to access.
For example, jdbc:hive2://invrlx63iso7:10000/default connects the default schema of Hive,
using a Hive Thrift server HiveServer2 that stars on the server invrlx63iso7 on port 10000.

The Hive thrift serve runs for the connector to communicate with Hive.
The command to start the Thrift server is hive service hiveserver2.
Cloudera Impala uses the JDBC URL format given below:
jdbc:hive2://<server>:<port>/;auth=<auth mechanism>
In this case, the parameter auth must be set to the security mechanism used by the Impala
Server, Kerberos. For example, jdbc:hive2://invrlx63iso7:21050/;auth=noSasl connects to the
default schema of Impala.

JDBC Driver class


The JDBC Driver class tends to vary among Hadoop components. For example,
org.apache.hive.jdbc.HiveDriver for Hive and Impala:

Installation Paths
The following table displays sample installation paths for different Hadoop distributions:
Installation Paths

Default Hive installation path

Default Hadoop installation path

CloudEra 5 VM

/opt/cloudera/parcels/CDH-5.0.01.cdh5.0.0.p0.47/lib/hadoop

/opt/cloudera/parcels/CDH-5.0.01.cdh5.0.0.p0.47/lib/hive

HortonWorks 2.1 Sandbox

/usr/lib/hadoop

/usr/lib/hive

Amazon EMR

/home/hadoop

/home/hadoop/hive/hive-0.11.0

MapR 3.1 demo

/opt/mapr/hadoop/hadoop-0.20.2

/opt/mapr/hive/hive-0.12

Pivotal HD2.0

/usr/lib/gphd/hadoop

/usr/lib/gphd/hive

Note: When you do not mention the installation paths, you can simply set the classpath and
proceed with the connection configuration and creating DSS tasks.

Setting Hadoop Classpath for various Hadoop Distributions


In the connection parameters if you do not mention the installation paths, you can perform the
connection operations by generating the setHadoopConnectorClasspath.sh file.
This section helps you to set the classpath for the distributions of Hadoop and procedure to set
Classpath.
Follow the procedure for generating setHadoopConnectorClasspath.sh for Amazon, Horton
works and Pivotal.
1. Make Changes in /main/tomcat/saas-infaagentapp.sh file as shown in the figure.

2. Start the Agent as shown in the below command prompt.

3. Create the Hadoop Connection using the connector


4. Test the connection. This will generate the setHadoopConnectorClasspath.sh file in
Infa_Agent_DIR/main/tomcat path.

5. Stop the Agent using Ctrl+C keys.

6. From Infa_agent_DIR, execute the . ./main/tomcat/setHadoopConnectorClasspath.sh


using the command.

7. Restart the Agent. And execute the DSS tasks

Note: If you want to generate the setHadoopConnectorClasspath.sh file again, then delete the
existing one and regenerate.
After generating the above steps, if the Hadoop classpath does not point towards the correct
class path, then you must execute following steps to undo the commands executed above:
1. Enter vi saas-infaagentapp.sh
2. Enter Insert command
3. Press Delete or backspace to delete the following entries:

4. Press Escape key


5. Type :(Collon) + w +q keys
Once you follow the above procedure the commands will be deleted, and then you can move on
to the section given below to go direct the Hadoop to the correct classpath.

Directing the Hadoop classpath to the correct classpath


In certain cases the Hadoop may point to the incorrect classpath. Follow the procedure given
below to direct it to the correct classpath .
1. Enter the command hadoop classpath from the terminal. This will display the stream of jars.

10

2. Copy and paste the above stream in a notepad.


3. Delete the following entries from the notepad file:
a. :/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar
b. :/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api1.0.4.jar (retain the latest version and delete the previous)
4. Copy the remaining content and export it to a variable called HADOOP_CLASSPATH.

5. In the command prompt window, mention the path where this file resides, that is,
InfaAgentDir/main/tomcat/saas-infaagentapp.sh.

11

6. Now follow Steps for generating setHadoopConnectorClasspath.sh mentioned above.


Refer Setting Hadoop Classpath for various Hadoop Distributions.

Creating Hadoop Data Synchronization Task (Source)


Note: You need to create a connection before getting started with data synchronization task.
The following steps help you to setup a data synchronization task in Informatica Cloud.
Let us consider the task operation Insert (Fetch/Read) to perform the Data synchronization task.
1. In Informatica Cloud home page, click Applications.
2. The drop-down menu appears, select Data Synchronization.
3. The Data Synchronization page appears.
4. Click New to create a data synchronization task.
5. The Definition tab appears.

Figure 3: Definition Tab

6. Specify the Task Name, provide a Description and select the Task Operation Insert.

12

7. Click Next.
8. The Source tab appears.

Figure 4: Source Tab

9. Select the source Connection, Source Type and Source Object to be used for the task.
10. Click Next.
11. The Target tab appears. Select the target Connection and Target Object required for the
task.

Figure 5: Target Tab

12. Click Next.


13. In Data Filters tab by default, Process all rows is chosen.
14. Click Next.
15. In Field Mapping tab, map source fields to target fields accordingly.

13

Figure 6: Field Mapping

16. Click Next.


17. The Schedule tab appears.
18. In Schedule tab, you can schedule the task as per the requirement and save.
19. If you do not want schedule the task, click Save and Run the task.

Figure 7: Save and Run the Task

After you Save and Run the task, you will be redirected to monitor log page. In monitor log page,
you can monitor the status of data synchronization tasks.

Enabling a Hadoop Connection as a Target


To use Hadoop connector in data synchronization task, you must create a connection in
Informatica Cloud.
The following steps help you to create Hadoop connection in Informatica Cloud.

14

1. In Informatica Cloud home page, click Configure.


2. The drop-down menu appears, select Connections.
3. The Connections page appears.
4. Click New to create a connection.
5. The New Connection page appears.

Figure 8: Connection Parameter

6. Specify the values to the connection parameters. Refer Creating a Hadoop Connection as a
Source.
7. Click Test to evaluate the connection.
8. Click Ok to save the connection.

Creating Hadoop Data Synchronization Task (Target)


Note: You need to create a connection before getting started with data synchronization task.
The following steps help you to setup a data synchronization task in Informatica Cloud.
Let us consider the task operation Insert (Fetch/Read) to perform the Data synchronization task.
1. In Informatica Cloud home page, click Applications.
1. The drop-down menu appears, select Data Synchronization.
2. The Data Synchronization page appears.
3. Click New to create a data synchronization task.

15

4. The Definition tab appears.

Figure 9: Definition Tab

5. Specify the Task Name, provide a Description and select the Task Operation Insert.
6. Click Next.
7. The Source tab appears.

Figure 10: Source Tab

8. Select the source Connection, Source Type and Source Object to be used for the task.
9. Click Next.
10. The Target tab appears. Select the target Connection and Target Object required for the
task.

16

Figure 11: Target Tab

11. Click Next.


12. In Data Filters tab by default, Process all rows is chosen. See Also Data Filters.
13. Click Next.
14. In Field Mapping tab, map source fields to target fields accordingly.

Figure 12: Field Mapping

15. Click Next.


16. The Schedule tab appears.
17. In Schedule tab, you can schedule the task as per the requirement and save.
18. If you do not want schedule the task, click Save and Run the task.

17

Figure 13: Save and Run the Task

After you Save and Run the task, you will be redirected to monitor log page. In monitor log page,
you can monitor the status of data synchronization tasks.

Data Filters
Data filters help you to fetch specific data based on the APIs configured in Config.csv file.
The data synchronization task will process the data based on the filter field assigned.
Note: Advanced data filters are not supported by Hadoop Connector

The following steps help you to use data filters.


1. In Data synchronization task, select Data Filters tab.
2. The Data Filters tab appears.
3. Click New as shown in the figure below.

Figure 14: Data Filters

4. The Data Filter dialog box appears.

18

Figure 15: Data Filters-2

5. Specify the following details.

Field Type

Description

Object

Select Object for which you want to assign filter fields

Filter By

Select the Filter Field

Operator

Select Equals operator. Only Equals operator is supported with this release.

Filter Value

Enter the Filter value

6. Click Ok.

Troubleshooting
Increasing Secure Agent Memory
To overcome memory issues faced by secure agent follow the steps given below.
1. In Informatica Cloud home page, click Configuration.
2. Select Secure Agents.
3. The secure agent page appears.
4. From the list of available secure agents, select the secure agent for which you want to
increase memory.\
5. Click pencil icon corresponding to the secure agent. The pencil icon is to edit the secure
agent.
6. The Edit Agent page appears.
7. In System Configuration section, select the Type as DTM.
8. Edit JVMOption1 as -Xmx512m as shown in the figure below.

19

Figure 16: Increasing Secure Agent Memory-1

9. Again in System Configuration section, select the Type as TomCatJRE.


10. Edit INFA_memory to -Xms256m -Xmx512m as shown in the figure below.

Figure 17: Increasing Secure Agent Memory-2

11. Restart the secure agent.


The secure agent memory has been increased successfully.

20

12. .

Additional Troubleshooting Tips


When the connection is used as a target, the last batch of the insert load is not reflected in the
record count. Refer the session logs for the record count of the last batch inserted.
For example, if the commit interval is set to 1 million and the actual rows inserted are 1.1
million, the record count in the UI shows 1 million and the session logs reveal the row count of
the reminder 100k records.
Set the commit interval to the highest value possible before java.lang.OutofMemoryError is
encountered.
When the connection is used as a target to load data into Hadoop, ensure that all the fields
are mapped.
After a data load in Hive, Impala needs to be refreshed manually for the latest changes to the
table to be reflected in Impala. In the current version, the connector does not automatically
refresh Impala upon a Hive dataset insert.

Known Issues
The connector is currently certified to work with Cloudera CDH 4.2. and HortonWorks HDP
1.1.
The connector may encounter java.lang.OutOfMemory exception while fetching large data
sets for tables with a large number of columns (for example, 5 million for a 15 column table).
In such scenarios, restrict the resultset by adding appropriate filters or by decreasing the
number of field mappings.
The Enable Logging connection parameter is place-holder for a future release, and its state
has no impact on connector functionality.
The connector has been certified and tested on Hadoops pseudo-distributed mode.
Performance is a factor of Hadoops cluster setup.
Ignore log4j initialization warnings in the session logs.

21

You might also like