Post By Admin Last Updated At 2020-06-11

How DataStage parallel job processing is done?

Datastage parallel job process is a program created in Datastage Designer using a GUI. It is monitored and executed by Datastage Director. The Datastage parallel job includes individual stages where each stage explains different processes. Datastage allows the users to store reusable components in the Datastage repository. It compiles into OSH and object code from C++ which makes it highly scalable and operational.

Overall, Datastage is a comprehensive ETL tool that offers end-to-end ERP solutions to its users. Moreover, Datastage offers great business analysis by providing quality data that helps in getting business intelligence. Many large business entities use this tool as an interface between their different systems and devices.

Generally, the job development process within the DataStage takes few steps from start to end. Here, I’ll brief you about the process.

At first, we need to import technical metadata that defines all sources, and destinations. For this purpose, an import tool within the Datastage Designer also can use.

Next, add all stages stating data extractions and loading of data (series file stages, datasets, file sets, DB connection stages, etc). Name change or rename the stages so they match the development naming standards.

Later, add the data modification stages (Like-transformers, lookups, aggregators, sorts, joins, etc.)

Specify the data flow from various sources to destinations by adding links.

Now, save and compile the job as it’s finished.

Finally, run/execute the job within the Designer or Directors.

Parallel jobs run in parallel on different nodes. Further, we will see the creation of a parallel job and its process in detail.

How to create a job in Datastage?

Datastage implements different processes in creating a job. During the starting phase of job creation, there exists a Parallel engine that performs various jobs. It starts the conductor process along with other processes including the monitor process. Next, the engine builds the plan for the execution of the job. Later, it verifies the schemas including input and output for every stage, and also verifies that the stage settings are valid or not. It also creates a copy of the job design. Further, it connects to the remote servers and starts the process of selecting the Leader process including the Conductor process. The SL process receives the execution job plan and creates different Player processes that further run the job. Moreover, the communication channels open between them to record the process.

In this way, after completing all the processes the DataStage starts the execution of the job.

Moreover, other different processing stages include the creation of a Datastage. These stages include the general stage, development stage, and processing stage, file stage, database stage, restructuring, data quality, real-time, and sequence stage.

Learn practically through DataStage Online Course regarding various stages of Datastage and their activities.

General Stage

This stage includes a link, a container, and annotation. Here, the link includes three different types of links such as a stream, lookup, and reference. It shows the data flow. The container is useful to share or kept privately. It helps to make the complex database design of the job easy to use. Moreover, the annotations are useful for adding floating descriptions on different jobs. It gives a way to understand the job along with ETL process documentation.

Development stage

The development stage includes a row generator, peek, column generator, sample, head, and a write range map. Each of the stage items is useful for the development or debugging of the database or data.

Here, the Row generator makes a duplicate data set that sticks to proper metadata. This is mostly useful in testing and data development. By using the column generator user can add more than one column to the data flow. Also, the user can produce test data for the column. Within Peek, the column values are recorded and the same a user can view in the director. Moreover, it includes a single input link with multiple output links.

The sample process under this stage helps to operate on input data sets. It has two modes of operating- percent and period mode. Here, the “Head” stage holds all the first “N” rows at every partition of data. It copies the same to an output data set from an input one. The ‘tail’ stage is similar to the head stage. In this, the last “n” rows are selected from each partition. The range map writes a form where a dataset is used through the range partition method.

||{"title":"Master in Datastage", "subTitle":"Datastage Certification Training by ITGURU's", "btnTitle":"View Details","url":"https://onlineitguru.com/datastage-online-training-placement.html","boxType":"demo","videoId":"CPv-ssSAbpo"}||

Processing stage

In this stage, the data is processed using various options. Here it includes;

Aggregator: It helps to join data vertically from grouping incoming data streams. The data could be sorted out using two different methods such as hash table and pre-sort.
FTP: It implies the files transfer protocol that transfers data to another remote system.
Copy: It copies the whole input data to a single output flow.
Filter records the requirement that doesn’t meet the relevance.
Join relates the inputs according to the key column values.
The funnel helps to covert different streams into a unique one.
Lookup includes more than two key columns based on inputs but it could have many lookup tables with one source.
Modify is the stage that changes the dataset record.
Remove duplicate helps to remove all duplicate content and gives the relevant output as a single sorted dataset.
The sort is useful to sort out input columns.
The transformer is the validation stage of data, extracted data, etc.
Change capture is the stage that captures the data before and after the input. Later it converts it into two different datasets.
Compare is useful to make a search comparison between pre-sorted records.
Encode includes the encoding of data using the encode command.
Decode useful for decoding earlier encoded data.
Compress helps to compress the dataset using GZIP.
Moreover, there are many other parameters include such as Checksum, Difference, External filter, generic, switch, expand, pivot enterprise, etc.

How DataStage parallel job processing is done?| OnlineITGuru

File stage

This stage of the Datastage includes sequential file, data set, file set, lookup file set, and external source. The sequential file is useful to write data into many flat files by looking at data from another file. Similarly, the data set allows the user to see and write data into a file set. The file set includes the writing or reading data within the file set. These are useful to format data and readable by other applications. Moreover, the external source allows reading data from different source programs to output. Mostly it includes the filing of datasets and enables the user to read the files.

Database stage

The database stage includes ODBC enterprise, Oracle enterprise, Teradata, Sybase, SQL Server enterprise, Informix, DB2 UDB, and many more. These database stages include the writing and reading of the data that is included in the above databases.

Here, the Oracle enterprise permits data reading to the database in Oracle. The ODBC ent. permits looking into data and writing the same to the database. This is called the ODBC source. This is mainly useful in the data processing within MS Access and MS Excel/Spreadsheets. Moreover, the DB2/UDB ent. enables us to read and write data to the DB2 database. Similarly, Teradata also allows users to write, read, data to the Teradata database. It includes three different stages called a connector, enterprise, and multi-load. Thus, all the other databases also perform the same process as the above does.

Restructure Stage

This stage of restructuring in the Datastage Parallel job includes column imports and Column export, combine records, make a vector, promote sub-records, make sub-records, split-vector, etc. Every stage of this restructures stage serves different purposes. These used to support various rows, columns, and records and make some changes within it.

Here, using the Column export stage, we can export data to a single column of the data type string from various data type columns. The import stage of the column just acts opposite of the export. The “combine records” stage groups the rows that have the same keys. Make vector stage integrates specific vector to the columns vector. Moreover, promote sub-records provides support from the input sub-records to the top-level columns. The split-vector provides support to the fixed-length vector elements over the top-level columns.

Real-time stage

This stage also includes many functions such as;

XML input helps to converts structural XML data into flat relational data.
The XML output writes on the external structures of data.
Further, the XML transformer converts the XML docs using a stylesheet.
The Java Client stage useful as a target and lookup that includes three different public classes.
Similarly, Java transformer helps in the links such as input, output, and rejection.
Moreover, there are WISD inputs and WISD output. These are defined as Information Service Input and Output Stages respectively.

Data Quality Stage

Here it includes different stages like;

Investigate is the stage that predicts data modules of the respective columns of the records that exist in the source file. It offers different investigation methods too.

The match frequency stage obtains inputs from various sources such as from a file, from a database and helps to generate a data distribution report.

Moreover, MNS and WAVES represent Multinational Address Standardization and Worldwide Address verification and enhancement system respectively.

Sequence Stage

This stage consists of Job activity, terminator activity; sequencer, notification, and wait for file activity. Here, the job activity stage indicates the Datastage server to execute a job. The notification stage is useful for moving several emails by DataStage to the recipients mentioned by the client. The sequencer synchronizes the control flow of different actions while a job is in progress. Similarly, the terminator activity helps to shut down the entire progress whereas the wait for a file activity waits for emerging of an exact file. Moreover, it launches the dispensation or an exemption from rule also.

Furthermore, the parallelism in Datastage is achieved using the two methods- Pipeline parallelism and Partition parallelism.

Pipeline parallelism in Datastage performs transform, clean, and load processes in parallel. This stage of parallelism works like a conveyor belt moving from one end to another. Moreover, the downstream process begins while the upstream process continues working. This process helps in minimizing the risk usage for the staging area. Besides, it also minimizes the idle time held on the processors working.

In Partition parallelism, the incoming data stream gets divided into various subsets. These subsets further processed by individual processors. These subsets are called partitions and they are processed by the same operation process.

Further, there are some partitioning techniques that DataStage offers to partition the data. They are, Auto, DB2, Entire, Hash, Modulus, Random, Range, Same, etc.

Moreover, the DataStage features also include any to any, platform-independent, and node configuration other than the above. These features help DataStage to stand the most useful and powerful in the ETL market.

The above stages help in the processing of the Datastage parallel job.

Summing Up

Finally, it concludes with the details on how Datastage parallel job processing is done through various stages. The Datastage is a platform of ETL which helps in the data processing. It includes various data integration solutions that present data in the required form. Since it’s an ETL tool, it consists of various stages within processing a parallel job. To get practical knowledge of various stages and their relevance, DataStage Online Training will be useful. This learning will enhance skills and help to prosper in their usage in the actual work.

How DataStage parallel job processing is done?

How to create a job in Datastage?

General Stage

Development stage

Processing stage

File stage

Database stage

Restructure Stage

Real-time stage

Data Quality Stage

Sequence Stage

Summing Up

Related Posts

Tutorials

Interview Questions

Related Courses

Log In to start Learning

How DataStage parallel job processing is done?

How to create a job in Datastage?

General Stage

Development stage

Processing stage

File stage

Database stage

Restructure Stage

Real-time stage

Data Quality Stage

Sequence Stage

Summing Up

Related Posts

Tutorials

Interview Questions

Related Courses

Recommended Posts

Tableau Vs Spotfire

Tableau vs Power BI

Tableau Vs Looker

What is Power BI Report Server?

What does a Tableau Data Sources contain?

How to Create and Save Power BI Templates?

Explain about Tableau Order of Operations?

How power BI Visualizations were classified?

What is Tableau Data set? How to create them?

What is Power BI Slicer? How it is different fr...

Explain Informatica ETL work process

What is the Turbo Integrator (TI)?

Explain about Power BI Mobile Apps?

Explain Power BI Architecture and its working?

How to design Tableau Server processes?

Explain about Informatica PowerCenter Architect...

ETL tools Comparison: Talend Vs Informatica

Power BI Desktop: Visualize data more powerfully

How to execute Informatica ETL testing and its ...

Why Power BI is useful for finance professionals

Power BI vs QlikView: What is the difference?

What are the different Tableau versions?

Tableau vs Cognos: What is the difference?

Explain the difference between Power BI Vs SSRS

Tableau Vs QlikView: Which is better to use in ...

Explain about Informatica Workflow properties

Explain Power BI Version Control Best Practices

How to automate Power BI Testing?

Tableau vs MicroStrategy: What are the differen...

An overview of Power BI incremental refresh

Explain Power BI Security Best Practices

How to use Power BI Apps?

The best use of Tableau effectively in 2021

Understanding different Tableau Data Types

Learn about Power BI Data Modeling and its impo...

Future scope of Informatica ETL tool

MSBI Developer-Skills and career opportunities

What are the multiple data sources in Power BI?

Understanding Power BI Reports with examples

Why is Power BI a BI Revolutionary Tool?

Informatica vs DataStage. What’s the difference?

Explain Power BI Developer: Skills, Role, Salary

Tableau vs Alteryx: What’s the difference?

Difference between MSBI vs Power BI

Explain What is the Informatica tool?

Power BI's Benefits for Your Business

Who Should You Use Tableau vs. Qlik Sense?

Tableau vs Excel: Which one to choose?

Power BI vs. Qlik Sense: Which is better?

Which One Is Better: Tableau vs SAS?

What is the Difference Between ODI and Informat...

A Comprehensive Comparison of Tableau and Googl...

Power BI vs. Google Data Studio: Which One is b...

Power BI vs. Looker: Which One is better in 2022?

Which one is superior, Domo or Tableau?

Excel vs. Power BI: Which is good in 2022?

Will the Business Analysis be automated?

What is Salesforce Data Cloud for Tableau?

What is a Power BI Heat map? How it is useful?

How does a Business Analyst add value to an org...

Explain Tableau Drivers? What are its Limitations?

What are the key components of Power BI?