Talend connection component retry logic & design for databases, web services etc...
For the instances where for whatever reason, a Talend job does not always connect to a backend service layer - database, web service, ftp, salesforce, dropbox - on the first try, the job can be modified to retry the connection before failing the job. This solution shows a simple design to handle intermittent connection failures.
Testing Hive or MySQL database connections directly using JDBC driver
Every once in a while, one runs into a situation that one can not connect to a database from a tool or application. When this happens, the best way to isolate the issue is to try to connect to the same database using a quick java application using standard JDBC.
Best practice for creating Talend metadata connections to NoSQL DBs and Hadoop
Starting with Talend 5.6.2, it is now possble to create metadata connections for NoSQL databases and Hadoop platforms using the metadata feature in the Studio. Even better, the Studio now allows automatic discovery of these properties using Hadoop properties site-*.xml files.
Connecting Talend to an MS SQL Server DB with 'applicationIntent=ReadOnly'
Out of the box, Talend uses the open source jTDS driver to connect to MS SQL Server databases. This driver however does not support connecting to an AlwaysOn enabled database. A generic jdbc driver would have to be used as a work-around.
Enabling Hive High Availability in Talend Studio
Starting with Talend 5.6.1, a patch was released by Talend to update Hive components to be able to connect to Hadoop Clusters configured for HA - High Availability. In HA, instead of configuring the Hive host in the component and a standard port, we now instead specify the Zookeeper quorum that in turn 'discovers' the active Hive Host.
'Not implemented by the DistributedFileSystem FileSystem implementation' error in Talend Big Data
'Not implemented by the DistributedFileSystem FileSystem implementation' error occassionally rears its head when debugging Talend Big Data jobs. This is a cryptic message that actually intends to convey that your job includes JARs from different versions of Hadoop in its classpath.
Dynamic Select SQL statement execution for moving data from DB2 to Hadoop (Design Pattern # 1)
Data warehousing and ETL processes usually repeat common patterns across different data domains (databases, tables, subject areas etc...). One such pattern is copying data from a transactional system to Hadoop or some other data platform (Teradata, Oracle DBs) to create 'images' of those systems for downstream processing. Because these processes are repeated many times over in the design & construction of data warehouses, it is best to create repeatable patterns that reduce future technical debt in terms of support, maintenance and updates costs.
Introducing Talend 6.0
Talend 6 was released in September 2015 and with it come a number of new and important features and updates, including product name changes.
Configuring Talend Hive Components: HiveServer1 vs. HiveServer2, Embedded vs. Standalone?
Talend Hive components have a number of somewhat confusing options that could be tricky to understand when making connections to a Hadoop cluster. Options include selecting between HiveServer1 and HiveServer2, Embedded vs. Standalone modes, and what ports to connect to. We explore the options in this post, pulling in information from the Hive Wiki, Talend Support and other sources.
Connecting to HiveServer2 in a Hadoop 2.x HA Cluster using DBVisualizer
In a previous post, the steps for downloading and configuring DBVisualizer to connect to Hive were presented. The connection was made using a Hive Host Name in a Hadoop cluster with a single Namenode. In this post, we look at connecting to a Hive in a Hadoop cluster that's configured for HA (High Availability), meaning it has multiple Hive hosts (and namenodes & resource managers etc...) where one Hive host is active and while others are passive.
Talend Studio: Use (Java) JRE or JDK?
Talend Studio (essentially a customized Eclipse IDE) requires that Java be installed on the client in order for the Studio to function - run jobs etc... Most often when installing Talend, the decision of installing Java is a no-brainer - basically click-through on java.com and you're done. But that doesn't always work, depending what you're doing in your Talend job.
Talend 5.x and Cassandra CQL 2 & 3
For quite some time, Talend has included a family of components for Apache Cassandra NoSQL database. However, recent versions of Talend 5.x have not yet started generating code to connect to Cassandra that leverages the newer Cassandra Query Language (CQL) version 3. CQL v3 is not backward compatible with CQL v2 and differs from it in numerous ways.
Downloading and Processing LivePerson Chat Data - Part 2
This is part 2 in a series about downloading and processing LivePerson chat data. LivePerson is a leading business chat, online messaging, marketing, and analytics platform that's integrated into many online sales channels / websites. In part 1 of the series, we looked at how to test connectivity to LivePerson API. In this part, we move on to developing the solution using Talend Data Integration Studio.
Downloading and Processing LivePerson Chat Data - Part 1
LivePerson is a leading business chat, online messaging, marketing, and analytics platform that's integrated into many online sales channels / websites. It enables companies to proactively engage online customers who, based on their site navigation patterns, are most likely to be converted into customers. LivePerson captures rich information about site visits (IP addresses, navigation paths, location information...) including detailed chat transcripts...
Hive Primers for SQL Users
Hive continues to gain prominence within the Hadoop ecosystem. This is despite the introduction of new tools in the ever-expanding Hadoop universe in the form of new Apache projects and incubators. Hive is an Apache Hadoop platform that uses a SQL-like language (called HQL or Hive Query Language)...
Accessing secured REST web services from a Talend job
Accessing secured web services from a Talend job requires that the jvm authenticate with a trust store file (.jks). Failing to do so results in a java.lang.Exception: nulljavax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed exception. The solution is to configure the Talend job to present a jks file when accessing the service.
Removing line breaks in column data using Talend DI (Part 2)
As pointed out in the previous post on this issue, there are ways of dealing with invalid line breaks in text columns. The solution presented here depends on tweaking the structure of the data being landed into the file, then applying some simple logic to remove invalid breaks.
Removing line breaks in column data using Talend DI (Part 1)
The most common line break character or row separator across most O/S platforms is "\n". Data (especially in files) are coded with line break characters at the end of the line to indicate to the application parsing or reading it that the line is complete. There are situations where data within a row can contain a line break character, resulting in incorrect parsing and presentation of data.
Mimicking analytic functions in data flows in Talend Data Integration jobs
Analytic functions compute an aggregate value based on a group of rows. Two common examples are Lead and Lag functions, which allow you to access the NEXT and PREVIOUS row values in a dataset (essentially, finding a value in a row a specified number of rows from a current row).
Connecting to Hive & other Hadoop 1.x DBs using DBVisualizer
While it's perfectly ok to interact with Hive databases using the command line (Hive shell), it's easier to display and visualize large number of columns of data using a GUI. Of the many GUI options that exist, one tool that does the job pretty well is DBVisualizer. It's great not only for Hive databases - but all most popular RMDBs as well. And it doesn't hurt that the DBVisualizer is free!
More...
Reading multi-schema XML files with nested loops using Talend Open Studio (Part 3 of 3)
In Parts 1 and 2 of this series, we looked at parsing multi-schema XML files using the tFileInputXML and tXMLMap components within Talend Open Studio. In Part 1, we concluded that the tFileInputXML component by itself could not successfully parse multi-schema XML files. In Part 2, we concluded that the tXMLMap (along with the tFileInputXML) could indeed parse multi-schema XML files.
Reading multi-schema XML files with nested loops using Talend Open Studio (Part 2 of 3)
In Part 1 of this post, we looked at parsing XML files using the tFileInputXML component and determined that it is suited for single or uniform schema XML files with a single XPath Root.
Reading multi-schema XML files with nested loops using Talend Open Studio (Part 1 of 3)
Parsing XML files within Talend is pretty straight-forward - that's when the files have a single root node and a single repeating schema. When you have multi-schema, nested XML files, you quickly run into trouble using the same approach as you do with single-schema files.
Managing JVM heap size in Talend Open Studio
Talend Open Studio (DI / Big Data and other versions as well) allows you to manage your machine's JVM heap size (allocated memory) in a number of places. Not having the right amount of memory can result in a number of errors including the following:
'Manual' Specified Order Grouping in Crystal Reports
Specified Order grouping in Crystal Reports has been around for a long time. But there are instances where using the out-of-the-box specified order is difficult at best or not possible due to the complexity of business rules that drive grouping. For example, grouping could be based on the result of evaluating say five independent or nested business rules. In these cases, manual specified order grouping is the quick and easy solution.
Adding a default logo or image to Web Intelligence Documents
Adding a logo to Web Intelligence reports/documents is pretty straightforward. But what if you wanted to skip the repetitive step of adding a logo to every new report - or what if you were looking for a way to be able quickly update the logo in all your reports with a few keystrokes? Luckily, there's a way to specify a default logo for Webi documents.
Drillable reports using the drilldowngrouplevel function
Unlike Web Intelligence which natively supports drill-down functionality thanks to it's powerful microcube and rich & intiutive design options, Crystal Reports requires a bit of set up to get drill downs to work. And when drill downs are set up correctly in they work beautifully in Crystal Reports. Because the Crystal Reports engine does not feed off of microcubes like Web Intelligence, care must be taken to ensure that aggregates (sums, counts, distinct counts etc...) are computed correctly at different drill levels in the report. The safest and simplest aggregates to use are Pass 1 group aggregates (see multi-pass reporting topic in CR Help). These aggregates are dynamically recomputed at each drill level. However manual aggregates (using formulas) or automatic running totals (all Pass 2 objects) might need to be tweaked to make sure totals are correct regardless of what level of the report is presented.
Organizing Content in a Business Intelligence Platform Using a Numbering System
Today, most organizations actively use some sort of Business Intelligence platform to manage the access and delivery of intelligence to executives, knowledge workers and front-line staff to enable them to perform their jobs. BI platforms have been around for a long time. Some organizations have gone through major conversions from previous legacy systems to adopt modern platforms while others had the luxury of starting from scratch with brand new 'empty' platforms. Whatever the scenario, BI is key to intelligence delivery and productivity.
When the propercase function just doesn't cut it
The propercase function is Crystal's equivalent of the TitleCase function in MS Word. Traditionally used for the titles of everything (books, plays, movies, etc), title case returns a capital/uppercase letter for the start of every significant word – where words like and, of, the and a are not counted as significant.