Enabling Hive High Availability in Talend Studio

Monday, 30 November 2015 18:00

Enabling Hive High Availability in Talend Studio

Written by Will Munji

font size decrease font size increase font size
Be the first to comment!

Rate this item

(0 votes)

Starting with Talend 5.6.1, a patch was released by Talend to update Hive components to be able to connect to Hadoop Clusters configured for HA - High Availability. In HA, instead of configuring the Hive host in the component and a standard port, we now instead specify the Zookeeper quorum that in turn 'discovers' the active Hive Host.

In a Hadoop 2.x High Availablity cluster, because there are multiple HiveServers, we can no longer hard-code our connection to any HiveServer because at some point, one of them could be unavailable or in passive mode. Multiple HiveServers work in an active/passive fashion. In order to always connect to the active host, we rely on the Zookeeper service running on multiple Zookeeper servers. These servers will keep track of the available HiveServers across the cluster and direct calls to the active HiveServer. Therefore in a HA environment, we now need to connect using the Zookeeper connection string instead. This connecting string is configured and stored in the yarn-site.xml file (in the hadoop.registry.zk.quorum property).

Ultimately, when configured properly, a Talend job should generate a JDBC URL that looks like the following in order to connect to Hadoop in HA:

jdbc:hive2://<zookeeperServer1>:port,<zookeeperServer2>:port,<zookeeperServer3>:port/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2

To do that, in place of specifiying the Hive Host in the Host text box (see below), we specify the Zookeeper quorum. Note that we leave out the trailing Zookeeper port because the Talend generated code appends the port from the Port field. To clarify, here's the text that's entered in the Host field below:

"zkserver01.company.com:2181,zkserver02.company.com:2181,zkserver03.company.com"

Then the Additional JDBC URL is populated with: ";serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"

hv3

Having tested this and confirmed that it's working, the best practice would be to create a metadata connection using these settings for the various lifecycles.

Read 13947 times Last modified on Thursday, 03 December 2015 19:45

Published in Talend DI

Tagged under

Will Munji

Will Munji is a seasoned data integration, data warehousing and business intelligence (BI) architect & developer who has been working in the DW/BI space for a while. He got his start in BI working on Brio SQR (later Hyperion SQR) and the Crystal Decisions stack (Reports, Analysis & Enterprise) and SAP BusinessObjects / Microsoft BI stacks. He currently focuses on Talend Data Management Suite, Hadoop, SAP BusinessObjects BI stack as well as Jaspersoft and Tableau. He has consulted for many organizations across a variety of industries including healthcare, manufacturing, retail, insurance and banking. At Kindle Consulting, Will delivers DW/BI/Data Integration solutions that range from front-end BI development (dashboards, reports, cube development, T-SQL/ PL/SQL ...) to data services (ETL/ DI development), data warehouse architecture and development, data integration to BI Architecture design and deployment.

Enabling Hive High Availability in Talend Studio

Will Munji

Latest from Will Munji

Related items