Print this page
Monday, 17 August 2015 17:32

Configuring Talend Hive Components: HiveServer1 vs. HiveServer2, Embedded vs. Standalone?

Written by
Rate this item
(0 votes)

Talend Hive components have a number of somewhat confusing options that could be tricky to understand when making connections to a Hadoop cluster. Options include selecting between HiveServer1 and HiveServer2, Embedded vs. Standalone modes, and what ports to connect to. We explore the options in this post, pulling in information from the Hive Wiki, Talend Support and other sources.

Hive is a system for querying and managing structured data built on top of Hadoop. The first iteration of Hive used Map Reduce (M/R) for execution and HDFS for storage (or any system that implements Hadoop FS API). Later iterations of Hadoop are leveraging other execution engines including Tez (Hortonworks) and Spark. The Hive metastore service stores the metadata for Hive tables and partitions in a relational database, and provides clients (including Hive) access to this information via the metastore service API. The subsections that follow discuss the deployment options and provide instructions for setting up a database in a recommended configuration.

As a Big Data data integration platform, Talend includes components for submitting Hive queries. Once submitted, Hive queries are parsed, optimized and converted into M/R jobs. Because Talend includes a number of options for submitting Hive queries, it can be sometimes confusing to follow what's happening under the covers, especially when it comes to troubleshooting. This post will discuss the following:

An overview of how Hive works

This post will start by looking at how Hive works (information below sourced from https://cwiki.apache.org/confluence/display/Hive/Design). The following flowchart describes the steps that occur when a Hive query is subnmitted to HiveServer.

hive system architecture

 

Hive Architecture

  • UI – The user interface for users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed.
  • Driver – The component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces.
  • Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore.
  • Metastore – The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.
  • Execution Engine – The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components.


The figure above shows how a typical query flows through the system.

  1. The UI calls the execute interface to the Driver (step 1 in figure above).
  2. The Driver creates a session handle for the query and sends the query to the compiler to generate an execution plan (step 2).
  3. The compiler gets the necessary metadata from the metastore (steps 3).
  4. This metadata is used to typecheck the expressions in the query tree as well as to prune partitions based on query predicates (steps 3).
  5. The plan generated by the compiler (step 5) is a DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers).
  6. The execution engine submits these stages to appropriate components (steps 6, 6.1, 6.2 and 6.3). In each task (mapper/reducer) the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files and these are passed through the associated operator tree. Once the output is generated, it is written to a temporary HDFS file though the serializer (this happens in the mapper in case the operation does not need a reduce). The temporary files are used to provide data to subsequent map/reduce stages of the plan. For DML operations the final temporary file is moved to the table's location. This scheme is used to ensure that dirty data is not read (file rename being an atomic operation in HDFS).
  7. For queries, the contents of the temporary file are read by the execution engine directly from HDFS as part of the fetch call from the Driver (steps 7, 8 and 9).

 

Explanation of how Talend components work and what each selection means

Talend jobs are simple JDBC UI/clients to the Hive framework, and Hive components in a job are configured to align to the Hive configuration in a cluster. It can be a bit complicated, because the Hive documentation overloads some terminology. In the Hive architecture, HQL queries are translated into MapReduce (or Tez or Spark) jobs and executed on the cluster. The "tables" that are queried might just be files on HDFS, and the metadata store determines where a named table will be found on disk, in what format, and what columns it has. The metadata store is already a source of confusion, because it can run in "embedded", "local" or "remote" mode.[1] These don't correspond to the configuration of Hive in the Talend job. 

When the metadata store is running in "remote" (Standalone) mode, it runs as its own, separate process, providing a thrift service (often port 9083). The cluster may also be configured to have HiveServer2 processes running (deliberately ignoring HiveServer1 since it's being deprecated). It also provides a thrift service (often port 10000), where a client can submit HQL queries and receive the results.

 

jdbcstandalone

 

When the cluster has a HiveServer2 process, a Talend job should use a "Standalone" Hive configuration. When the job generates HQL to submit, it's sent by thrift to the HiveServer2, which spawns a new JVM (a session) where the compiled MR job is launched.

Note that from the perspective of the compiled MR job, the local filesystem is on the machine where the session has been launched.

In the following diagram, there are either 5/6 processes active. Whether or not the metadata store is running in its own process is entirely transparent to the Talend job and depends on how the cluster is configured. 

 

 jdbcembedded

 

However, if there isn't a HiveServer2 running in the cluster, then there must be a metadata store running for the Talend job to obtain metadata. This is when a "Embedded" Hive configuration is approriate. In this case, the HQL is compiled into MapReduce jobs by the JDBC client, on the same machine as the Talend job and submitted directly to the resource manager for execution and interpretation of the results. In both cases, Talend only uses the JDBC API to access Hive, including query construction and receiving results.

Although the resulting behaviour is different, the differences in code generation are only differences in configuring the Hive JDBC driver:

  • + Standalone: the job only needs to know how to access the HiveServer2 on the cluster. That component must be correctly configured and administrated, but all of the details of the metastore / HDFS / YARN are completely transparent to the Talend job.
  • + Embedded: the job needs to be configured with enough detail that the JDBC driver can access the metastore, compile the jobs, submit them to the resource manager of the cluster and recover the results.

Note that Hive can also run in a mode where neither the HiveServer2 or metadata store are present (all is performed locally), but Talend components don't support this mode.

Recommendation: It's best to set Hive components to use Standalone / Hive2. This configuration assumes that JDBC calls are being made to a remote servers/services, and require no configuration on the machine that the Talend job is being executed on - including the Studio. Meaning that Talend jobs that read/write data to Hive running even on a Linux O/S-based Hadoop cluster can run on a Windows machine. The only warning that may be observed is the missing WinUtils package.

Other reasons that favor Standalone over Embedded are:

  • + Standalone is easier to set up for Hadoop Clusters running in HA (High Availability) mode; these are clusters that have multiple Hive Hosts working in active/passive mode.
  • + Standalone is easier to set up for Kerberos Authentication, compared to Embedded mode.

 

Read 8195 times Last modified on Wednesday, 02 December 2015 17:06
Will Munji

Will Munji is a seasoned data integration, data warehousing and business intelligence (BI) architect & developer who has been working in the DW/BI space for a while. He got his start in BI working on Brio SQR (later Hyperion SQR) and the Crystal Decisions stack (Reports, Analysis & Enterprise) and SAP BusinessObjects / Microsoft BI stacks. He currently focuses on Talend Data Management Suite, Hadoop, SAP BusinessObjects BI stack as well as Jaspersoft and Tableau. He has consulted for many organizations across a variety of industries including healthcare, manufacturing, retail, insurance and banking. At Kindle Consulting, Will delivers DW/BI/Data Integration solutions that range from front-end BI development (dashboards, reports, cube development, T-SQL/ PL/SQL ...) to data services (ETL/ DI development), data warehouse architecture and development, data integration to BI Architecture design and deployment.

Latest from Will Munji

Related items