Print this page
Monday, 27 October 2014 10:19

Mimicking analytic functions in data flows in Talend Data Integration jobs

Written by
Rate this item
(0 votes)

Analytic functions compute an aggregate value based on a group of rows. Two common examples are Lead and Lag functions, which allow you to access the NEXT and PREVIOUS row values in a dataset (essentially, finding a value in a row a specified number of rows from a current row).

 

Oracle Docs define Lead and Lag as follows: The LAG and LEAD functions are useful for comparing values when the relative positions of rows can be known reliably. They work by specifying the count of rows which separate the target row from the current row. Because the functions provide access to more than one row of a table at the same time without a self-join, they can enhance processing speed. The LAG function provides access to a row at a given offset prior to the current position, and the LEAD function provides access to a row at a given offset after the current position.

Suppose you wanted to create a cumulative column of data of say the salary of a number of employees. And your data looked like the following:

John;1000
John;1500
John;1782
John;2104
John;1583
John;2001
John;8954
John;8451
Suzan;8944
Suzan;4541
Suzan;9854
Suzan;1244
Suzan;6587
Suzan;5687
Suzan;3258

Alice;4025
Alice;1000
Alice;2000
Alice;1584

 

But you wanted an output like this

 

John;1000;1000
John;1500;2500

The solution would entail accessing the value of the previous row on the current row and summing them together and repeating that all the way down. Unfortunately, we don't have analytic functions (just simple aggregation functions) in Talend DI. Because data manipulation is typically done in a row-by-row fashion in data integration jobs, it's a bit tricky to implement a solution that gives you access to the value on the previous row or any number of rows as an offset (essentially the logic embedded in the LAG function). The following is a simple solution for doing just that. It's implemented in a tJavaRow, but the same logic can be implemented in a tMap. If the number of offsets is other than 1 (previous row only), we could leverage the Numeric.Sequence function to create a sequence that we can use to determine how many records to offset by. We sort the fields by Fname to make sure they're consecutive and the logic works right.

demo
demo
demo
Read 13771 times Last modified on Monday, 27 April 2015 20:31
Will Munji

Will Munji is a seasoned data integration, data warehousing and business intelligence (BI) architect & developer who has been working in the DW/BI space for a while. He got his start in BI working on Brio SQR (later Hyperion SQR) and the Crystal Decisions stack (Reports, Analysis & Enterprise) and SAP BusinessObjects / Microsoft BI stacks. He currently focuses on Talend Data Management Suite, Hadoop, SAP BusinessObjects BI stack as well as Jaspersoft and Tableau. He has consulted for many organizations across a variety of industries including healthcare, manufacturing, retail, insurance and banking. At Kindle Consulting, Will delivers DW/BI/Data Integration solutions that range from front-end BI development (dashboards, reports, cube development, T-SQL/ PL/SQL ...) to data services (ETL/ DI development), data warehouse architecture and development, data integration to BI Architecture design and deployment.

Latest from Will Munji