Print this page
Wednesday, 14 October 2015 19:40

Dynamic Select SQL statement execution for moving data from DB2 to Hadoop (Design Pattern # 1)

Written by
Rate this item
(0 votes)

Data warehousing and ETL processes usually repeat common patterns across different data domains (databases, tables, subject areas etc...). One such pattern is copying data from a transactional system to Hadoop or some other data platform (Teradata, Oracle DBs) to create 'images' of those systems for downstream processing. Because these processes are repeated many times over in the design & construction of data warehouses, it is best to create repeatable patterns that reduce future technical debt in terms of support, maintenance and updates costs. 

In this post, we are going to look at job that moves data from DB2, a transactional system, to Hadoop. This data copy or OI serves as the bedrock for a slew of analytics within and beyond the Hadoop data platform, supporting not only datawarehousing tasks but also (possibly) business intelligence in the rare cases where access to raw data is necessary to answer business questions.

So here's a piece of the job:

 db22

 As shown above, the query is specified and a schema is also defined that matches the set of fields in the table CustomerSales. (Note: It's not good practice to use 'select * from table' in data integration jobs - if columns shift, the job will die).

So this design works well. The problem however, is that when we have say 100 tables to load from the source to the target, we have to repeat this process with unique queries for each table, and corresponding schemas for each table. Our job could end up being very large - to accomodate the number of tables, or we could end up with lots of jobs, one for each table or a set of tables. While these all work, there's a better way to do this.

A well documented feature in Talend is Dynamic Schema. Dynamic schema feature allows us to skip specifying the schema for each query, and simply letting Talend 'discover' the schema based on the query or table name. Combining this feature with Select SQL statements allows us to implement a pattern where a set of Select SQL statements are run again using one or two simple Talend jobs. 

 

To convert this simple job into a reusable pattern, we redesign the job to read or generate a set of Select SQL statements for each table in the source, in this case DB2. We loop through the list of Select statements and execute them using the DB2Input component. Because we don't know the schema for each Select statement, we change the type of just one column in the Schema window to Dynamic.

 db23a

 db24

 When the query is run, it will determine the schema based on the Select statement. 

 db25

 When a different Select statement is used that retrieves only a few columns...

db27

 it works as expected. 

db26

While this design works for reading and printing to console using tLogRow as shown here, it fails when writing/streaming the data to HDFS (Hadoop file system) because the component tHDFSOutput does not support dynamic schema functionality. It needs to be passed a data flow that includes the schema of the data to write to Hadoop.

One standard way to deal with components that don't support dynamic schema functionality is to use the tSetDynamicSchema  component. In this particular case, an easier and more efficient method of both moving the data to Hadoop AND handling the lack of support for dynamic schema is to change from using a tHDFSOutput (which streams to Hadoop) to a tHDFSPut, which moves the file from local to HDFS in one fell swoop. Because we don't write the column or header list at the start of the file, the output file that's placed in HDFS is compliant and queriable (through Hive etc...).

 final2

 This pattern can be repeated for dozens of tables to load them into Hadoop with the minimum number of jobs and the least complexity.

 

 

Read 12312 times Last modified on Friday, 16 October 2015 13:15
Will Munji

Will Munji is a seasoned data integration, data warehousing and business intelligence (BI) architect & developer who has been working in the DW/BI space for a while. He got his start in BI working on Brio SQR (later Hyperion SQR) and the Crystal Decisions stack (Reports, Analysis & Enterprise) and SAP BusinessObjects / Microsoft BI stacks. He currently focuses on Talend Data Management Suite, Hadoop, SAP BusinessObjects BI stack as well as Jaspersoft and Tableau. He has consulted for many organizations across a variety of industries including healthcare, manufacturing, retail, insurance and banking. At Kindle Consulting, Will delivers DW/BI/Data Integration solutions that range from front-end BI development (dashboards, reports, cube development, T-SQL/ PL/SQL ...) to data services (ETL/ DI development), data warehouse architecture and development, data integration to BI Architecture design and deployment.

Latest from Will Munji

Related items