Displaying items by tag: Hive
Out of the box, Talend uses the open source jTDS driver to connect to MS SQL Server databases. This driver however does not support connecting to an AlwaysOn enabled database. A generic jdbc driver would have to be used as a work-around.
Accessing secured web services from a Talend job requires that the jvm authenticate with a trust store file (.jks). Failing to do so results in a java.lang.Exception: nulljavax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed exception. The solution is to configure the Talend job to present a jks file when accessing the service.
Starting with Talend 5.6.2, it is now possble to create metadata connections for NoSQL databases and Hadoop platforms using the metadata feature in the Studio. Even better, the Studio now allows automatic discovery of these properties using Hadoop properties site-*.xml files.
Starting with Talend 5.6.1, a patch was released by Talend to update Hive components to be able to connect to Hadoop Clusters configured for HA - High Availability. In HA, instead of configuring the Hive host in the component and a standard port, we now instead specify the Zookeeper quorum that in turn 'discovers' the active Hive Host.
Talend Hive components have a number of somewhat confusing options that could be tricky to understand when making connections to a Hadoop cluster. Options include selecting between HiveServer1 and HiveServer2, Embedded vs. Standalone modes, and what ports to connect to. We explore the options in this post, pulling in information from the Hive Wiki, Talend Support and other sources.
Talend 6 was released in September 2015 and with it come a number of new and important features and updates, including product name changes.
In a previous post, the steps for downloading and configuring DBVisualizer to connect to Hive were presented. The connection was made using a Hive Host Name in a Hadoop cluster with a single Namenode. In this post, we look at connecting to a Hive in a Hadoop cluster that's configured for HA (High Availability), meaning it has multiple Hive hosts (and namenodes & resource managers etc...) where one Hive host is active and while others are passive.
While it's perfectly ok to interact with Hive databases using the command line (Hive shell), it's easier to display and visualize large number of columns of data using a GUI. Of the many GUI options that exist, one tool that does the job pretty well is DBVisualizer. It's great not only for Hive databases - but all most popular RMDBs as well. And it doesn't hurt that the DBVisualizer is free!
Hive continues to gain prominence within the Hadoop ecosystem. This is despite the introduction of new tools in the ever-expanding Hadoop universe in the form of new Apache projects and incubators. Hive is an Apache Hadoop platform that uses a SQL-like language (called HQL or Hive Query Language)...