You are on page 1of 28

What is the relation between EME , GDE and Co-operating system ? ans.

EME is said as enterprise metdata env, GDE as graphical devlopment env and Cooperating sytem can be said as asbinitio server relation b/w this CO-OP, EME AND GDE is as fallows Co operating system is the Abinitio Server. this co-op is installed on perticular O.S platform that is called NATIVE O.S .comming to the EME, its i just as repository in informatica , its hold the metadata,trnsformations,db config files source and targets informations. comming to GDE its is end user envirinment where we can devlop the graphs(mapping just like in informatica) desinger uses the GDE and designs the graphs and save to the EME or Sand box it is at user side.where EME is ast server side. What is the use of aggregation when we have rollup as we know rollup component in abinitio is used to summirize group of data record. then where we will use aggregation ? ans: Aggregation and Rollup both can summerise the data but rollup is much more convenient to use. In order to understand how a particular summerisation being rollup is much more explanatory compared to aggregate. Rollup can do some other functionalities like input and output filtering of records. Aggregate and rollup perform same action, rollup display intermediat result in main memory, Aggregate does not support intermediat result what are kinds of layouts does ab initio supports Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is defined such as it's same as the degree of parallelism. How can you run a graph infinitely? To run a graph infinitely, the end script in the graph should call the .ksh file of the graph. Thus if the name of the graph is abc.mp then in the end script of the graph there should be a call to abc.ksh. Like this the graph will run infinitely. How do you add default rules in transformer? Double click on the transform parameter of parameter tab page of component properties, it will open transform editor. In the transform editor click on the Edit menu and then select Add Default Rules from the dropdown. It will show two options - 1) Match Names 2) Wildcard. Do you know what a local lookup is? If your lookup file is a multifile and partioned/sorted on a particular key then local lookup function can be used ahead of lookup function call. This is local to a particular partition depending on the key. Lookup File consists of data records which can be held in main memory. This makes the

transform function to retrieve the records much faster than retirving from disk. It allows the transform component to process the data records of multiple files fastly. What is the difference between look-up file and look-up, with a relevant example? Generally Lookup file represents one or more serial files(Flat files). The amount of data is small enough to be held in the memory. This allows transform functions to retrive records much more quickly than it could retrive from Disk. A lookup is a component of abinitio graph where we can store data and retrieve it by using a key parameter. A lookup file is the physical file where the data for the lookup is stored. How many components in your most complicated graph? It depends the type of components you us. usually avoid using much complicated transform function in a graph. Explain what is lookup? Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per the data present in a particular file (serial/multi file). The dataset can be static as well dynamic ( in case the lookup file is being generated in previous phase and used as lookup file in current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of the input to the join contains less number of records with slim record length. AbInitio has built-in functions to retrieve values using the key for the lookup What is a ramp limit? The limit parameter contains an integer that represents a number of reject events The ramp parameter contains a real number that represents a rate of reject events in the number of records processed. no of bad records allowed = limit + no of records*ramp. ramp is basically the percentage value (from 0 to 1) This two together provides the threshold value of bad records. Have you worked with packages? Multistage transform components by default uses packages. However user can create his own set of functions in a transfer function and can include this in other transfer functions. Have you used rollup component? Describe how.

If the user wants to group the records on particular field values then rollup is best way to do that. Rollup is a multi-stage transform function and it contains the following mandatory functions. 1. initialise 2. rollup 3. finalise Also need to declare one temporary variable if you want to get counts of a particular group. For each of the group, first it does call the initialise function once, followed by rollup

function calls for each of the records in the group and finally calls the finalise function once at the end of last rollup call. How do you add default rules in transformer?

Add Default Rules Opens the Add Default Rules dialog. Select one of the following: Match Names Match names: generates a set of rules that copies input fields to output fields with the same name. Use Wildcard (.*) Rule Generates one rule that copies input fields to output fields with the same name. )If it is not already displayed, display the Transform Editor Grid. 2)Click the Business Rules tab if it is not already displayed. 3)Select Edit > Add Default Rules. In case of reformat if the destination field names are same or subset of the source fields then no need to write anything in the reformat xfr unless you dont want to use any real transform other than reducing the set of fields or split the flow into a number of flows to achive the functionality. What is the difference between partitioning with key and round robin?

Partition by Key or hash partition -> This is a partitioning technique which is used to partition data when the keys are diverse. If the key is present in large volume then there can large data skew. But this method is used more often for parallel data processing. Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin manner. How do you improve the performance of a graph?

There are many ways the performance of the graph can be improved. 1) Use a limited number of components in a particular phase 2) Use optimum value of max core values for sort and join components 3) Minimise the number of sort components 4) Minimise sorted join component and if possible replace them by in-memory join/hash join 5) Use only required fields in the sort, reformat, join components 6) Use phasing/flow buffers in case of merge, sorted joins 7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port 8) For large dataset don't use broadcast as partitioner 9) Minimise the use of regular expression functions like re_index in the trasfer functions 10) Avoid repartitioning of data unnecessarily Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible output file should also be partitioned.

How do you truncate a table? From Abinitio run sql component using the DDL "trucate table By using the Truncate table component in Ab Initio Have you eveer encountered an error called "depth not equal"? When two components are linked together if their layout doesnot match then this problem can occur during the compilation of the graph. A solution to this problem would be to use a partitioning component in between if there was change in layout. What is the function you would use to transfer a string into a decimal?

In this case no specific function is required if the size of the string and decimal is same. Just use decimal cast with the size in the transform function and will suffice. For example, if the source field is defined as string(8) and the destination as decimal(8) then (say the field name is field1). out.field :: (decimal(8)) in.field If the destination field size is lesser than the input then use of string_substring function can be used likie the following. say destination field is decimal(5). out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading and trailing spaces */ What is a Graph in Ab Initio? The ETL process in AbInitio is represented by AbInitio graphs. Graphs are formed by components (from the standard components library or custom), flows (data streams) and parameters. What is Co>Operating System in Ab Initio? Co>Operating System is a program provided by AbInitio which operates on the top of the operating system and is a base for all AbInitio processes. It provides additional features known as air commands which can be installed on a variety of system environments such as Unix, HP-UX, Linux, IBM AIX and Windows systems. The AbInitio CoOperating System provides the following features:
y y y y

Manage and run AbInitio graphs and control the ETL processes Provides AbInitio extensions to the operating system ETL processes monitoring and debugging Metadata management and interaction with the EME

What is AbInitio GDE (Graphical Development Enviroment)?

GDE is a graphical application for developers which is used for designing and running AbInitio graphs. It also provides:
y

y y y

The ETL process in AbInitio is represented by AbInitio graphs. Graphs are formed by components (from the standard components library or custom), flows (data streams) and parameters. A user-friendly frontend for designing Ab Initio ETL graphs Ability to run, debug Ab Initio jobs and trace execution logs GDE AbInitio graph compilation process results in generation of a UNIX shell script which may be executed on a machine without the GDE installed.

What is AbInitio EME? Enterprise Meta>Environment (EME) is an AbInitio repository and environment for storing and managing metadata. It provides capability to store both business and technical metadata. EME metadata can be accessed from the Ab Initio GDE, web browser or AbInitio CoOperating system command line (air commands). What is Conduct>It in AbInitio? Conduct>It is an environment for creating enterprise Ab Initio data integration systems. Its main role is to create AbInitio Plans which is a special type of graph constructed of another graphs and scripts. AbInitio provides both graphical and command-line interface to Conduct>IT. What is a data profiler inAbInitio? The Data Profiler is a graphical data analysis tool which runs on top of the Co>Operating system. It can be used to characterize data range, scope, distribution, variance, and quality. What kind of parallelisms supported by Ab Initio? Ab Initio implements parallelism in mainly 3 ways: Data parallelism data is divided among many partitions known as multi-files. During processing, each partition is processed in parallel. Component parallelism multiple components are run in parallel. Components execute simultaneously on different branches of a graph. Pipeline parallelism when a record is processed in one component and a previous record is being processed in another components. Operations like sorting and aggregation break pipeline parallelism. Explain what is Lookup?

y y y y

Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per the data present in a particular file(serial/multi file). The dataset can be static or dynamic(in case the lookup file is being generated in previous phase and used as lookup file in current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of the input to the join contains less number of records with slim record length. AbInitio has built-in functions to retrieve values using the key for the lookup.

What is the difference between lookup file and lookup?


y y

A lookup is a component of AbInitio graph where we can store data and retrieve it by using a key parameter. A lookup file is the physical file where the data for the lookup is stored.

What is a local lookup? If the lookup file is a multifile and partitioned/sorted on a particular key then local lookup function can be used ahead of lookup function call. This is local to a particular partition depending on the key. Lookup File consists of data records which can be held in main memory. This makes the transform function to retrieve the records much faster than retrieving from disk. It allows the transform component to process the data records of multiple files fast. Describes the foreign key columns in fact table and dimension Foreign keys of dimension tables are primary keys of entity Foreign keys of facts tables are primary keys of Dimension tables. table? tables.

What is Data Mining? Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information. What is the difference between view and materialized view? A view takes the output of a query and makes it appear like a virtual table and it can be used in place of tables. A materialized view provides indirect access to table data by storing the results of a query in a separate schema object. What is ER Diagram? Entity Relationship Diagrams are a major data modelling tool and will help organize the data in your project into entities and define the relationships between the entities. This process has proved to enable the analyst to produce a good database structure so that the data can be stored and retrieved in a most efficient manner. An entity-relationship (ER) diagram is a specialized graphic that illustrates the interrelationships between entities in a database. A type of diagram used in data modeling for relational data bases. These diagrams show the structure of each table and the links between tables. What is ODS? ODS is abbreviation of Operational Data Store. A database structure that is a repository for

near real-time operational data rather than long term trend data. The ODS may further become the enterprise shared operational database, allowing operational systems that are being re-engineered to use the ODS as there operation databases. What is ETL? ETL is abbreviation of extract, transform, and load. ETL is software that enables businesses to consolidate their disparate data while moving it from place to place, and it doesnt really matter that that data is in different forms or formats. The data can come from any source.ETL is powerful enough to handle such data disparities. First, the extract function reads data from a specified source database and extracts a desired subset of data. Next, the transform function works with the acquired data using rules orlookup tables, or creating combinations with other data to convert it to the desired state. Finally, the load function is used to write the resulting data to a target database. What is VLDB? VLDB is abbreviation of Very Large DataBase. A one terabyte database would normally be considered to be a VLDB. Typically, these are decision support systems or transaction processing applications serving large numbers of users. Is OLTP database is design optimal for Data Warehouse? No. OLTP database tables are normalized and it will add additional time to queries to return results. Additionally OLTP database is smaller and it does not contain longer period (many years) data, which needs to be analyzed. A OLTP system is basically ER model and not Dimensional Model. If a complex query is executed on a OLTP system, it may cause a heavy overhead on the OLTP server that will affect the normal business processes. If de-normalized is improves data warehouse processes, why fact table is in normal form? Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columns which are primary key to other table that itself make normal form table. What are lookup tables? A lookup table is the table placed on the target table based upon the primary key of the target, it just updates the table by allowing only modified (new or updated) records based on thelookup condition. What are Aggregate tables? Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. It is always easy to retrieve data from aggregated tables than visiting original table which has million records. Aggregate tables reduces the load in the database server and increases the performance of the query and can retrieve the result quickly. What is real time data-warehousing? Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly.

What are conformed dimensions? Conformed dimensions mean the exact same thing with every possible fact table to which they are joined. They are common to the cubes. What is conformed fact? Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly. How do you load the time dimension? Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. 100 years may be represented in a time dimension, with one row per day. What is a level of Granularity of a fact table? Level of granularity means level of detail that you put into the fact table in a data warehouse. Level of granularity would mean what detail are you willing to put for each transactional fact. What are non-additive facts? Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. However they are not considered as useless. If there is changes in dimensions the same facts can be useful. What is factless facts table? A fact table which does not contain numeric fact columns it is called factless facts table.
air project import /Projects/ABC_COMPANY_NAME/APPLICATION_NAME/SANDBOX_NAME -basedir /ai/src/./././APPLICATION_NAME/SANDBOX_NAME -files mp/abc.mp air project import <sandbox path in eme> -basedir <sandbox path on our unix machine> -files <folder of file mp or dml or pset etc>/<Name of the file we want to checkin from this location>

By using command line prompt. We have AIR COMMANDS for Check In & Check Out Process Check In : air project import/project// - base dir /ai/src//users/dev//sand/ -files<<'EOF' Check In : air project export/project// - base dir /ai/src//users/dev//sand/ -files<<'EOF'
what is meant by fancing in abinitio ? In SW world fencing means job controlling on priority basis. In AI it actually refers to customized phase breaking. A well fenced graph means no matter what is source data volume process will not cough in dead locks. It actually limits the number of simultaneous processes In Join component which record will go to unused port and which will go to reject port ? In case of inner-join all the records not matching the key specified goes to the respective unused ports, in full outer-join none of the records goes to the unused ports.All the records which evaluates to NULL during joiintransformation will go into reject port if thelimit+ramp*number_of_input_records_so_far <number_of_input_records_so_far.

How do we handle if DML changing dynamicaly? it can be handled in the start up script with dynamic sql creation and create dynamic dml so that there will be no need to change the component henceforth . Two types of dml will be used : conditional dml , dynamic dml . conditional dml will be used whenever the output record flow will be same based on conditional parameters

What is AB_LOCAL expression where do you use it in ab- initio? If you use an SQL SELECT statement to specify the source for Input Table, and if the statement involves a complex query or a join of two or more tables in an unload, Input Table may be unable to determine the best way to run the query in parallel. In such cases, the GDE may return an error message suggesting you use ABLOCAL(tablename) in the SELECT statement to tell Input Table which table to use as the basis for the parallel unload. To do this, you would put an ABLOCAL(tablename) in the appropriate place in the WHERE clause in the SELECT statement, and specify the name of the "driving table" (often the largest table, but see below) as a single argument. When you run the graph, Input Table will replace the expression "ABLOCAL(tablename)" with the appropriate parallel query condition for that table. For example, suppose you want to join two tablescustomer_info and acct_type-and customer_info is the driving table. You would code the SELECT statement as follows: select * from acct_type, customer_info where ABLOCAL(customer_info) and customer_info.acctid = acct_type.id Note that when using an alias for a table, you must tell ABLOCAL(tablename) the alias name as well. select * from acct_type, customer_info custinfo where ABLOCAL(customer_info custinfo) and custinfo.acctid = acct_type.id

What is the difference between Generate Records Component and Create Data Component? There is no transform function in Generate Record comp. Therefore it will create data's defaultly of it's own, according to the dml defined. To change that data, we have to connect some other component after Generate Record comp. and have to modify. Whereas, in Create Data comp. we can write a transform function, so that the data's are generated as per out trfm function. Also index is defaultly defined in Create Data comp.
How to get DML using Utilities in UNIX?
m_db gendml will help you get the DML from Database. cobol-to-dml and xml-todml are other utilities from command line we can use to get DML's.

Co operating system is the core system. It is the Abinitio server. All the graphs which are made in GDE are deployed and run on cooperating system. It is installed on unix. EME stands for Enterprise meta enviornment. It is a repository which holds all the projects, metadata,transformations and transformations.It performs operations like version controlling, statistical analysis,dependence analysis and metadata management. GDE stands for graphical development enviornment and is just like a canvas on which we creat our graphs with the help of various components.It just provides graphical interface for editing and executing Abnitio programs.

How you improve the performance of previous build graphs


How do you improve the performance of a graph?

There are many ways the performance of the graph can be improved. 1) Use a limited number of components in a particular phase 2) Use optimum value of max core values for sort and join components 3) Minimise the number of sort components 4) Minimise sorted join component and if possible replace them by in-memory join/hash join 5) Use only required fields in the sort, reformat, join components 6) Use phasing/flow buffers in case of merge, sorted joins 7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port 8) For large dataset don't use broadcast as partitioner 9) Minimise the use of regular expression functions like re_index in the trasfer functions 10) Avoid repartitioning of data unnecessarily

Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible output file should also be partitioned. How do you truncate a table? From Abinitio run sql component using the DDL "trucate table By using the Truncate table component in Ab Initio Have you eveer encountered an error called "depth not equal"? When two components are linked together if their layout doesnot match then this problem can occur during the compilation of the graph. A solution to this problem would be to use a partitioning component in between if there was change in layout. What is the function you would use to transfer a string into a decimal?

In this case no specific function is required if the size of the string and decimal is same. Just use decimal cast with the size in the transform function and will suffice. For example, if the source field is defined as string(8) and the destination as decimal(8) then (say the field name is field1). out.field :: (decimal(8)) in.field If the destination field size is lesser than the input then use of string_substring function can be used likie the following. say destination field is decimal(5). out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading and trailing spaces */

In SW world fencing means job controlling on priority basis. In AI it actually refers to customized phase breaking. A well fenced graph means no matter what is source data volume process will not cough in dead locks. It actually limits the number of simultaneous processes.

Ab Initio FAQs
1)What is the function you would use to transfer a string into a decimal? 2)How many parallelisms are in Abinitio? Please give a definition of each. 3)What is the difference between a DB config and a CFG file? 4)Have you eveer encountered an error called "Pipeline Broken"? (This occurs when you extensively create graphs it is a trick question) 5)How do you truncate a table? (Each candidate would say only 1 of the several ways to do this.) 6)How do you improve the performance of a graph?

7)What is the difference between partitioning with key and round robin? 8)Have you worked with packages? 9)How do you add default rules in transformer? 10)What is a ramp limit and Maxcore values for Scan, Rollup, Sort, Replicate? 11)Have you used rollup component? Describe how. 12)How many components in your most complicated graph? 13)Do you know what a local lookup is? 14)What is Ad hoc multifile? How is it used? Here is a description of Ad hoc multifile: Ad hoc multifiles treat several serial files having the same record format as a single graph component. Frequently, the input of a graph consists of a set of serial files, all of which have to be processed as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial files, without needing to define a multifile system to contain it. This enables you to represent the needed set of serial files with a single input file component in the graph. Moreover, the set of files used by the component can be determined at runtime. This lets the user customize which set of files the graph uses as input without having to change the graph itself, even after it goes into production. Ad hoc multifiles can be used as output, intermediate, and lookup files as well as input files. The simplest way to define an Ad hoc multifile is to list the files explicitly as follows: 1. Insert an input file component in your graph. 2. Open the properties dialog. Select Description tab. 3. Select Partitions in the Data Location of the Description tab 4. Click Edit to open the Define multifile Partitions dialog box. 5. Click New and enter the first file name. Click New again and enter the second file name and so on. 6. Click OK. If you have added 'n' files, then the input file now acts something like a file in a n-way multifile system, whose data partitions are the n files you listed. It is possible for components to run in the layout of the input file component. However, there is no way to run commands such as m_ls or m_dump on the files, because they do not comprise a real multifile system. There are other ways than listing the input files explicitly in an Ad hoc multifile. 1. Listing files using wildcards - If the input file names have a common pattern then you can use a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are found at the runtime matching the wild card pattern will be taken for the Ad hoc multifile.

2. Listing files in a variable. You can create a runtime parameter for the graph and inside the parameter you can list all the files separated by spaces. 3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in choosing the input files, since you can use complex commands also that involves owner of file or date time stamp. 15) How can I tune a graph so it does not excessivly consume my CPU? How to Tune a Graph Against Excessive CPU consumption? ANSWER: Options: 1. Reduce the DOP ( degree of parallleism ) for components. Example: 1. Change from a 4-way parallel to a 2-way parallel. 2. Examine each transformation for inefficiencies. Example: 1. If transformation uses many local variables, make these variables global. 2. If same function call is performed more than once; call it once and store its value in a global variable. 3. When reading data, reduce the amount of data that needs to be carried forward to the next component.

16) I'm having trouble finding information about the AB_JOB variable. Where and how can I set this variable? ANSWER: You can change the value of the AB_JOB variable in the start script of a given graph. This will enable you to run the same graph multiple times at the same time (thus parallel). However, make sure you append some unique identifier such as timestamp or sequential number to the end of each AB_JOB variable name you assign. You will also need to vary the file names of any outputs to keep the graphs from stepping on each others outputs. I have used this technique to create a "utility" graph as a container for a start script that runs another graph multiple times depending on the local variable input to the "utility" graph. Be careful you don't max out the capacity of the server you are running on.

17) I have a job that will do the following: ftps files from remote server; reformat data in those files and updates the database; deletes the temporary files. How do we trap errors generated by Ab Initio when an ftp fails? If I have to re-run / re-start a graph again, what are the points to be considered? does *.rec file have anything to do with it?

ANSWER: AbInitio has very good restartability and recovery features built into it. In Your situation you can do the tasks you mentioned in one graph with phase breaks.

FTP in phase 1 and your transaformation in next phase and then DB update in another pahse (This is just an example this may not best of doing it as best design depends on various other factors)

If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would see a message saying recovery file exists, do you want to start your graph from last successful check point or restart from begining. Same thing if it fails in Phase 2.

Phases are expensive from Disk I/O perspective, so have to be careful in doing too much phasing.

Coming back to error trapping each component has reject, error, log ports, reject captures rejected records, error captures corresponding error and log captures the execution statistics of the component. You can control reject status of each component by setting reject threshold to either "Never Abort", "Abort on first reject" or setting "ramp/limit"

Recovery files keep tack of crucial information for recovering the graph from failed status, which node the component is executing on etc. It is a bad idea to just remove the *.rec files, you always want to rollback the recovery fils cleanly so that temporary files created during graph execution won't hang around and occupy disk space and create issues.

always use m_rollback -d

18)What is parallelism in Ab Initio?

ANSWER: 1) Component parallelism: A graph with multiple processes running simultaneously on separate data uses component parallelism.

2) Data parallelism: A graph that deals with data divided into segments and operates on each segment simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data parallelism. To support this form of parallelism, Ab Initio software provides Partition Components to segment data, and Departition Components to merge segmented data back together.

3) Pipeline parallelism: A graph with multiple components running simultaneously on the same data uses pipeline parallelism.

Each component in the pipeline continuously reads from upstream components, processes data, and writes to downstream components. Since a downstream component can process records previously written by an upstream component, both components can operate in parallel.

NOTE: To limit the number of components running simultaneously, set phases in the graph.

19) How can I determine what requirements for a new ETL and/or OLAP tool for our organization?

ANSWER: Here are some things to evaluate when considering a new ETL tool: # How many sources users are going to extract the information from? # What are the type of sources? i.e. flat files, Rdbms, ERPs, mainframe, legacy systems, excels, and etc. # What are the expected sizes of these input files? # Is there a need to join information from different sources? # What is the growth expected per month in terms of records, size etc? # What is the current environment for staging areas, loading data warehouse, and OS? # What is the budget for purchase of ETL tool?

Here are some things to evaluate when considering a new OLAP tool: # How many developers need to access the tool?

# Is there any security to be implemented for access? # What is the level of end users knowledge about OLAP, analysis, data warehousing? # What is the budget for purchase of OLAP tool? # What is the report requirement- static or ad-hoc? # What is the functional domain for analysis? This may help in specific selection for analytic applications.

20) What is a sandbox?

ANSWER: Sandbox is a directory structure of which each directory level is assigned a variable name, is used to manage check-in and checkout of repository based objects such as graphs.

fin -------> top level directory ( $AI_PROJECT ) | |---- dml -------> second level directory ( $AI_DML ) | |----- xfr -------> second level directory ( $AI_XfR ) | |----- run --------> second level directory ( $AI_RUN ) |

You'll require a sandbox when you use EME (repository s/w) to maintain release control.

Within EME for the same project an identical structure will exist.

The above-mentioned structure will exist under the os (eg unix), for instance for the project called fin, and is usually name of the top-level directory.

In EME, a similar structure will exist for the project: fin.

When you checkout or check-in a whole project or an object belonging to a project, the information is exchanged between these two structures.

For instance, if you checkout a dml called fin.dml for the project called fin, you need a sandbox with the same structure as the EME project called fin. Once you've created that, as shown above, fin.dml or a copy of it will come out from EME and be placed in the dml directory of your sandbox.

21) How can I read data which contains variable length records with different record structures and no delimiters?

ANSWER: a)Try using the Read Raw component, it should do exactly what you are looking for.

b)Use the dml format: record string(integer(4)) my_field_name; end

22) How do I create subgraphs in Ab Initio?

ANSWER: First, highlight all of the components you would like to have in the subgraph, click on edit, then click on subgraph, and finally click on create.

23) suppose that u are changing fin.dml u said checkout

exactly how do u do it ?and one mroe thign like can quote an example where do u use sandbox parameters and how exactly u create those do u keep those sand box parameters also 2 copies as we keep our graphs and other files.

ANSWER: Checkin and checkout from EME Checkin (sandbox) ---------------> EME Checkout ( sandbox) <------------- EME 1. AbInitio gives command line interfaces via air command to perform checkin and checkout 2. Checkin and checkout must be performed via sandbox 3. GDE gives option to perform checkin and checkout via Project ----> checkin option ----> checkout "" 4. You create a sandbox from GDE via Project -----> Create sandbox option 5. When creating a sandbox you specify a directory name ( try it out; don't be afraid) 6. EME contains one or many Projects ( a project is a collection of graphs and related files and a parameter called Project parameter file )

7. The project parameter file, when resides within EME, is called Project Parameter

8. The project parameter file, resides within one's sandbox, is called sandbox parameter. Therefore, sandbox parameter is a copy of project parameter and is local to sandbox owner. 9. When project parameters change, it'll be reflected in your sandbox parameters , if you have checked out a graph and therefore, a copy of latest project parameter, after that change had taken place. 10. You edit sandbox parameter via Project ----->edit sandbox option

11. You edit project parameters via Project -----> Administrative -------> edit project 12. When checking out an object, use Project--------> checkout option. Navigate down to the Project of your choice Navigate down to required directory ( eg.mp, dml or xfr etc ) Select the object required Then specify a sandbox name ( ie. the top level directory of the directory structure called sandbox) You will be prompted to confirm the checkout 13. Sometimes, when you checkout an object, you get a number of other objects checked out for you automatically and this happens due to dependency.

Example Checkout a graph ( .mp file ) In addition, you might get a .dml .xfr file You will also certainly ger a .ksh file for the graph

24) I have small problem understadning the problem with reformat. i could not figure out why this reformat compoennet runs forever. i beleieve it is in endless loop somehow Reformat component has following input and output DML: record begin string(",") code, code2; intger(2) count ; end("\n")

Note : here variable "code" is never null nor blank. sample data is string_1,name,location,firstname,lastname,middlename,0 string_2,job,location,firstjob,lastjob,0 string_3,design,color,paint,architect,0

out::reformat(in) = begin let string(integer(2)) temp_code2 = in.code2; let string(integer(2)) temp_code22 = " "; let integer(2) i=0; while (string_index(temp_code2, ",") !=0 || temp_code2 "") begin temp_code22 = string_concat(in.code,",", string_substring(temp_code2, 1,string_index(temp_code2,","))); temp_code2 = string_substring(temp_code2, string_index(temp_code2, ","), string_length(temp_code2)); i=i+1; end out.code :: in.code; out.code2 :: string_lrtrim(temp_code22); out.count:: i; end;

my expected output is string_1,string_1,name,string_1,location,string_1,firstname,string_1,lastname,string_1,middlename,5 string_2,string_2,job,string_2,location,string_2,firstjob,string_2,lastjob,4 string_3,string_3,design,string_3,color,string_3,paint,string_3,architect,4

ANSWER: record begin string(",") code, code2; integer(2) count ; end("\n")

In my abinitio it is not validated ..................

25) In my graph I am creating a file with account data. For a given account there can be multiple rows of data. I have to split this file into 4 (specifically) files which are nearly equal in size. The trick is to keep the accounts confined to one file. In other words account data should not span across these files. How do I do it? Also if the records are less than 4 (different accounts) i should be able to create empty files. But I need atleast 4 files. FYI: The requirement to have 4 files is because I need to start 4 parallel processes for load balancing the subsequent processes.

ANSWER: a)I could not get ur requirement very clearly as you want to split the files in 4 equal parts as well as keep the same account numbers in same file. Can you explain what will you do in case of 5 account numbers having 20 records each?..........As far as splitting is concerned a very very crude soln would be as follows In the end script do the following:

1.Find the size of the file and store it in variable (say v_size) 2.v_qtr_size=`expr $v_size / 4` 3.split -b $v_qtr_size 4.Rename the splitted files as per ur requirement. Note the splitted files have a specific pattern in their name

b)Your requirement is such that it essentially depends on the skewness of your data across accounts. If you want to keep same accounts in same partition, then partition the data by key (account) with the out port connected to 4 way parallel layout. But this does not guarantee equal load in all partitions unless the data has little skewness.

But I can suggest you an alternative approach, though cumbersome, still might give you a result, close to your requirement.

You replicate your original dataset into two, and take one of them and rollup on account no to find the record count per account_no. Now sort this result on record count so that you have the account_no with min count at top and the one with max count at bottom. Now apply a partition by round robin and separate out the four partitions (partition 0, 1, 2 & 3).

Now take the first partition and join with your main dataset ( that you have replicated earlier) on account_no and write the matching records (out port) into the first file. Take the unused records of your main flow of the previous join and now join it with the second partition (partition1) on account_no and write the matching records (out port) to the second file. Similarly again take the unused records of the previous join and join it with the third partition (partition 2) on acount_no. Write the matching record (out port) to the third file and the unused records of the main flow in the fourth file.

This way you can get four files, nearly equal in size, and same account not spread across files.

26) How do I create subgraphs in Ab Initio? ANSWER: First, highlight all of the components you would like to have in the subgraph, click on edit, then click on subgraph, and finally click on create.

27)I was trying to use a User Defined Function (int_to_date) inside a Rollup, to type cast date and time values originally stored as integers back to date forms and then concatenate the same.

The code I wrote is as below.

record datetime("YYYY-MM-DD HH24:MI:SS")("\001") output_date_format;

end out::int_to_date(record big endian integer(4) input_date_part; end in0, record big endian integer(4) input_time_part; end in1) begin let datetime("YYYY-MM-DD HH24:MI:SS")("\001") v_output_format =(datetime("YYYY-MM-DD HH24:MI:SS"))string_concat((string("|"))(date("YYYY-MM-DD"))in0.input_date_part,(string("|")) (datetime("HH24:MI:SS"))decimal_lpad(((string("|"))(decimal("|")) in1.input_time_part),6));

out.output_date_format :: v_output_format; end;

out::rollup(in) begin let datetime("YYYY-MM-DD HH24:MI:SS")("\001") rfmt_dt;

rfmt_dt=int_dat(in.reg_date, in.reg_time);

out.datetime_output :: rfmt_dt; out.* :: in.*; end;

However I got an error during run time.

The Error Message looked like:

?While compiling finalize: While compiling the statement: rfmt_dt = int_to_date(in.reg_date, in.reg_time);

Error: While compiling transform int_to_date: Output object "out.output_date_format" unknown.?

28) I have small problem understadning the problem with reformat. i could not figure out why this reformat compoennet runs forever. i beleieve it is in endless loop somehow

Reformat component has following input and output DML: record begin string(",") code, code2; intger(2) count ; end("\n")

Note : here variable "code" is never null nor blank. sample data is string_1,name,location,firstname,lastname,middlename,0 string_2,job,location,firstjob,lastjob,0 string_3,design,color,paint,architect,0

out::reformat(in) = begin let string(integer(2)) temp_code2 = in.code2; let string(integer(2)) temp_code22 = " "; let integer(2) i=0; while (string_index(temp_code2, ",") !=0 || temp_code2 "") begin

temp_code22 = string_concat(in.code,",", string_substring(temp_code2, 1,string_index(temp_code2,","))); temp_code2 = string_substring(temp_code2, string_index(temp_code2, ","), string_length(temp_code2)); i=i+1; end out.code :: in.code; out.code2 :: string_lrtrim(temp_code22); out.count:: i; end;

my expected output is string_1,string_1,name,string_1,location,string_1,firstname,string_1,lastname,string_1,middlename,5 string_2,string_2,job,string_2,location,string_2,firstjob,string_2,lastjob,4 string_3,string_3,design,string_3,color,string_3,paint,string_3,architect,4

ANSWER: record begin string(",") code, code2; integer(2) count ; end("\n") In my abinitio it is not validated ..................

29) In my graph I am creating a file with account data. For a given account there can be multiple rows of data. I have to split this file into 4 (specifically) files which are nearly equal in size. The trick is to

keep the accounts confined to one file. In other words account data should not span across these files. How do I do it?

Also if the records are less than 4 (different accounts) i should be able to create empty files. But I need atleast 4 files.

FYI: The requirement to have 4 files is because I need to start 4 parallel processes for load balancing the subsequent processes.

ANSWER: a) I could not get ur requirement very clearly as you want to split the files in 4 equal parts as well as keep the same account numbers in same file. Can you explain what will you do in case of 5 account numbers having 20 records each?..........As far as splitting is concerned a very very crude soln would be as follows In the end script do the following:

1.Find the size of the file and store it in variable (say v_size) 2.v_qtr_size=`expr $v_size / 4` 3.split -b $v_qtr_size 4.Rename the splitted files as per ur requirement. Note the splitted files have a specific pattern in their name

b) Your requirement is such that it essentially depends on the skewness of your data across accounts. If you want to keep same accounts in same partition, then partition the data by key (account) with the out port connected to 4 way parallel layout. But this does not guarantee equal load in all partitions unless the data has little skewness.

But I can suggest you an alternative approach, though cumbersome, still might give you a result, close to your requirement.

You replicate your original dataset into two, and take one of them and rollup on account no to find the record count per account_no. Now sort this result on record count so that you have the account_no with min count at top and the one with max count at bottom. Now apply a partition by round robin and separate out the four partitions (partition 0, 1, 2 & 3).

Now take the first partition and join with your main dataset ( that you have replicated earlier) on account_no and write the matching records (out port) into the first file. Take the unused records of your main flow of the previous join and now join it with the second partition (partition1) on account_no and write the matching records (out port) to the second file. Similarly again take the unused records of the previous join and join it with the third partition (partition 2) on acount_no. Write the matching record (out port) to the third file and the unused records of the main flow in the fourth file.

This way you can get four files, nearly equal in size, and same account not spread across files.

30) I have a graph parameter state_cd having values based on a If statement. This variable I would like to use in SQL statement in AI_SQL directory. I have 20 SQL statments for 20 table codes. I will be using corresponding SQL statment based on table code passed as parameter to a graph.

eg: SQLs in AI_SQL directory. ---------------------------1. Select a,b from abc where abc.state_cd in ${STATE_CD} 2. Select x,y from xyz where xyz.state_cd in ${STATE_CD}

${STATE_CD} is a graph parameter In value - "(IL,CO,MI)"

 Problem is ${STATE_CD} is not getting interpreted when I echo the 'Select Statement', hence giving problem.

ans:Anand use eval or export of I table components ..................

Or define ${STATE_CD} in ur start script ...................... thats better

You might also like