Professional Documents
Culture Documents
User Guide
2013 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means
(electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and
product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such
owners.
Abstract
Hadoop user guide provides a brief introduction on cloud connectors and its features.
The guide provides detailed information on setting up the connector and running data
synchronization tasks (DSS). A brief overview of supported features and task operations that can
be performed using Hadoop connector is mentioned.
Table of Contents
Overview .......................................................................................................................................... 3
Hadoop ............................................................................................................................................ 3
Hadoop Plugin ................................................................................................................................. 4
Supported Objects and Task Operations ........................................................................................ 5
Enabling Hadoop Connector ........................................................................................................... 5
Instructions while installing the Secure Agent .......................................................................... 5
Creating a Hadoop Connection as a Source ................................................................................... 5
JDBC URL.................................................................................................................................... 7
JDBC Driver class ........................................................................................................................ 8
Installation Paths .......................................................................................................................... 8
Setting Hadoop Classpath for various Hadoop Distributions ....................................................... 8
Creating Hadoop Data Synchronization Task (Source) ................................................................ 12
Enabling a Hadoop Connection as a Target ................................................................................. 14
Creating Hadoop Data Synchronization Task (Target) ................................................................. 15
Data Filters .................................................................................................................................... 18
Troubleshooting ............................................................................................................................. 19
Increasing Secure Agent Memory.............................................................................................. 19
Additional Troubleshooting Tips................................................................................................. 21
Known Issues ................................................................................................................................ 21
Overview
Informatica cloud connector SDKs are off-cycle, off release add-ins that provide data integration
to SaaS and on-premise applications, which are not supported natively by Informatica cloud. The
cloud connectors are specifically designed to address most common use cases such as moving
data into cloud and retrieving data from cloud for each individual application.
Once the Hadoop cloud connector is enabled for your ORG Id, you need to create a connection in
Informatica cloud to access the connector.
Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable, distributed
computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-available service on top
of a cluster of computers, each of which may be prone to failures.
Hadoop Plugin
The Informatica Hadoop connector allows you to perform Query and Insert operations on
Hadoop.
The plug-in supports Cloudera 5.0, MapR 3.1, Pivotal HD 2.0, Amazon EMR and Horton Works
2.1 and has been certified to work on CDH 4.2 and HDP 1.1
The Informatica Cloud Secure Agent must be installed on one of the nodes of the Hadoop Cluster
where Hiveserver or Hiveserver2 is running.
The plug-in is used as a target to insert data into Hadoop.
The plug-in connects to Hive and Cloudera Impala to perform relevant data operations.
The plug-in can easily be integrated with the Informatica Cloud.
The plugin supports all operators supported in HiveQL.
The plug-in supports the AND conjunction between filters. It supports both AND and OR
conjunctions in advanced filters.
The plug-in supports filtering on all filterable columns in Hive/Impala tables.
DSS
Source
DSS
Target
Query
Insert
NA
NA
Data
Preview
Upsert
Delete
NA
NA
NA
NA
NA
NA
NA
NA
: Supported
NA : Not Applicable
Look
Up
Update
Connection
Property
Description
Connection Name
Description
Type
Secure Agent
Username
Password
JDBC Connection
URL
Mention the JDBC URL to connect to the Hadoop Component. Refer JDBC
URL.
Driver
Mention the JDBC driver class to connect to the Hadoop Component. Refer
JDBC Driver class.
Commit Interval
Hadoop Installation
Path
Mention the commit interval. It is the Batch size (in rows) of data loaded into
hive.
Mention Hadoop Installation path.
The Installation path of the Hadoop component* used to connect to
Hadoop. Only one of these installation
HDFS Installation
Path
HBase Installation
Path
Impala Installation
Path
Miscellaneous Library
Path
Enable Logging
Note: Installation paths are the paths where Hadoop jars are listed. The connector loads and sets
one of these or more. Connector loads the libraries from these paths before sending any
instructions to Hadoop.
If you do not want to mention the installation path, you can generate the setHadoopclasspath.sh
file for amazon, HortonWorks and MapR.
Refer Setting Hadoop Classpath for various Hadoop Distributions
7. Click Test to evaluate the connection.
8. Click Ok to save the connection.
JDBC URL
The connector connects to different components of Hadoop using JDBC. The URL format and
parameters vary among components. Hive uses the JDBC URL format mentioned below:
.
jdbc:<hive/hive2>://<server>:<port>/<schema>
The significance of URL parameters is discussed below:
hive/hive2 protocol information depending on the version of the Thrift Server used, hive
forHiveServer and hive2 for HiveServer2.
Server, port server and port information where the Thrift Server is running.
Schema hive schema to which the connector needs to access.
For example, jdbc:hive2://invrlx63iso7:10000/default connects the default schema of Hive,
using a Hive Thrift server HiveServer2 that stars on the server invrlx63iso7 on port 10000.
The Hive thrift serve runs for the connector to communicate with Hive.
The command to start the Thrift server is hive service hiveserver2.
Cloudera Impala uses the JDBC URL format given below:
jdbc:hive2://<server>:<port>/;auth=<auth mechanism>
In this case, the parameter auth must be set to the security mechanism used by the Impala
Server, Kerberos. For example, jdbc:hive2://invrlx63iso7:21050/;auth=noSasl connects to the
default schema of Impala.
Installation Paths
The following table displays sample installation paths for different Hadoop distributions:
Installation Paths
CloudEra 5 VM
/opt/cloudera/parcels/CDH-5.0.01.cdh5.0.0.p0.47/lib/hadoop
/opt/cloudera/parcels/CDH-5.0.01.cdh5.0.0.p0.47/lib/hive
/usr/lib/hadoop
/usr/lib/hive
Amazon EMR
/home/hadoop
/home/hadoop/hive/hive-0.11.0
/opt/mapr/hadoop/hadoop-0.20.2
/opt/mapr/hive/hive-0.12
Pivotal HD2.0
/usr/lib/gphd/hadoop
/usr/lib/gphd/hive
Note: When you do not mention the installation paths, you can simply set the classpath and
proceed with the connection configuration and creating DSS tasks.
Note: If you want to generate the setHadoopConnectorClasspath.sh file again, then delete the
existing one and regenerate.
After generating the above steps, if the Hadoop classpath does not point towards the correct
class path, then you must execute following steps to undo the commands executed above:
1. Enter vi saas-infaagentapp.sh
2. Enter Insert command
3. Press Delete or backspace to delete the following entries:
10
5. In the command prompt window, mention the path where this file resides, that is,
InfaAgentDir/main/tomcat/saas-infaagentapp.sh.
11
6. Specify the Task Name, provide a Description and select the Task Operation Insert.
12
7. Click Next.
8. The Source tab appears.
9. Select the source Connection, Source Type and Source Object to be used for the task.
10. Click Next.
11. The Target tab appears. Select the target Connection and Target Object required for the
task.
13
After you Save and Run the task, you will be redirected to monitor log page. In monitor log page,
you can monitor the status of data synchronization tasks.
14
6. Specify the values to the connection parameters. Refer Creating a Hadoop Connection as a
Source.
7. Click Test to evaluate the connection.
8. Click Ok to save the connection.
15
5. Specify the Task Name, provide a Description and select the Task Operation Insert.
6. Click Next.
7. The Source tab appears.
8. Select the source Connection, Source Type and Source Object to be used for the task.
9. Click Next.
10. The Target tab appears. Select the target Connection and Target Object required for the
task.
16
17
After you Save and Run the task, you will be redirected to monitor log page. In monitor log page,
you can monitor the status of data synchronization tasks.
Data Filters
Data filters help you to fetch specific data based on the APIs configured in Config.csv file.
The data synchronization task will process the data based on the filter field assigned.
Note: Advanced data filters are not supported by Hadoop Connector
18
Field Type
Description
Object
Filter By
Operator
Select Equals operator. Only Equals operator is supported with this release.
Filter Value
6. Click Ok.
Troubleshooting
Increasing Secure Agent Memory
To overcome memory issues faced by secure agent follow the steps given below.
1. In Informatica Cloud home page, click Configuration.
2. Select Secure Agents.
3. The secure agent page appears.
4. From the list of available secure agents, select the secure agent for which you want to
increase memory.\
5. Click pencil icon corresponding to the secure agent. The pencil icon is to edit the secure
agent.
6. The Edit Agent page appears.
7. In System Configuration section, select the Type as DTM.
8. Edit JVMOption1 as -Xmx512m as shown in the figure below.
19
20
12. .
Known Issues
The connector is currently certified to work with Cloudera CDH 4.2. and HortonWorks HDP
1.1.
The connector may encounter java.lang.OutOfMemory exception while fetching large data
sets for tables with a large number of columns (for example, 5 million for a 15 column table).
In such scenarios, restrict the resultset by adding appropriate filters or by decreasing the
number of field mappings.
The Enable Logging connection parameter is place-holder for a future release, and its state
has no impact on connector functionality.
The connector has been certified and tested on Hadoops pseudo-distributed mode.
Performance is a factor of Hadoops cluster setup.
Ignore log4j initialization warnings in the session logs.
21