Real Time Experts
4.9 out of 5 based on 234 Reviews
IBM InfoSphere DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition For Business Intelligence (BI) market is very much dependent on ETL architecture. The Extract, Transform and Loading products have become far more important in the data driven age. DataStage is one of the most important ETL tools which effectively integrate data across various systems. DataStage designs jobs that manage the collection, transformation, validation and loading of data from different systems to data warehouses. DataStage facilitates business analysis through its user friendly interface and providing quality data to help in gaining business intelligence. With IBM acquiring DataStage in 2005, it was renamed to IBM WebSphere DataStage and later to IBM InfoSphere.DataStage has four components namely Administrator, Manager, Designer and Director. DataStage has various versions such as Server Edition, Enterprise Edition, MVS Edition and DataStage for PeopleSoft.
Sometimes DataStage is sold to and installed in an organization and its IT support staff are expected to maintain it and to solve DataStage users’ problems. In some cases IT support is outsourced and may not become aware of DataStage until it has been installed. Then two questions immediately arise: “what is DataStage?” and “how do we support DataStage?”.This white paper addresses the first of those questions, from the point of view of the IT support provider. Manuals, web-based resources and instructor-led training are available to help to answer the second.DataStage is actually two separate things.
In production (and, of course, in development and test environments) DataStage is just another application on the server, an application which connects to data sources and targets and processes (“transforms”) the data as they move through the application. Therefore DataStage is classed as an “ETL tool”, the initials standing for extract, transform and load respectively. DataStage “jobs”, as they are known, can execute on a singleserver or on multiple machines in a cluster or grid environment. Like all applications, DataStage jobs consume resources: CPU, memory, disk space, I/O bandwidth and network bandwidth.
DataStage also has a set of Windows-based graphical tools that allow ETL processes to be designed, the metadata associated with them managed, and the ETL processes monitored. These client tools connect to the DataStage server because all of the design information and metadata are stored on the server. On the DataStage server, work is organized into one or more “projects”.
There are also two DataStage engines, the “server engine” and the “parallel engine”.
The server engine is located in a directory called DSEngine whose location is recorded in a hidden file called /.dshome (that is, a hidden file called .dshome in the root directory) and/or as the value of the environment variable DSHOME. (On Windows-based DataStage servers the folder name is Engine, not DSEngine, and its location is recorded in the Windows registry rather than in /.dshome.)
The parallel engine is located in a sibling directory called PXEngine whose location is recorded in the environment variable APT_ORCHHOME and/or in the environment variable PXHOME
Types of DataStage Job
Setting up DataStage Environment
DataStage Administrator Properties
Defining Environment Variables
Importing Table Definitions
Creating Parallel Jobs
Design a simple Parallel job in Designer
Compile your job
Run your job in Director
View the job log
Command Line Interface (dsjob)
Accessing Sequential Data
Sequential File stage
Data Set stage
Complex Flat File stage
Create jobs that read from and write to sequential files
Read from multiple files using file patterns
Use multiple readers
Null handling in Sequential File Stage
Describe parallel processing architecture Describe pipeline & partition parallelism
List and describe partitioning and collecting algorithms
Describe configuration files
Explain OSH & Score
Combine data using the Lookup stage
Combine data using merge stage
Combine data using the Join stage
Combine data using the Funnel stage
Sorting and Aggregating Data
Sort data using in-stage sorts and Sort stage
Combine data using Aggregator stage
Remove Duplicates stage
Understand ways DataStage allows you to transform data
Create column derivations using userdefined code and system functions
Filter records based on business criteria
Control data flow based on data conditions
Perform a simple Find
Perform an Advanced Find Perform an impact analysis
Compare the differences between two Table Definitions and Jobs.
Working with Relational Data
Import Table Definitions for relational tables.
Create Data Connections.
Use Connector stages in a job.
Use SQL Builder to define SQL Select statements.
Use SQL Builder to define SQL Insert and Update statements.
Use the DB2 Enterprise stage.
Metadata in Parallel Framework:
Explain Runtime Column Propagation (RCP).
Build a job that reads data from a sequential file using a schema.
Build a shared container.
Use the DataStage Job Sequencer to build a job that controls a sequence of jobs.
Use Sequencer links and stages to control the sequence a set of jobs run in.
Use Sequencer triggers and stages to control the conditions under which jobs run.
Pass information in job parameters from the master controlling job to the controlled jobs.
Define user variables.
Handle errors and exceptions.