Hive is a system for querying and managing structured data built on top of Hadoop. The first iteration of Hive used Map Reduce (M/R) for execution and HDFS for storage (or any system that implements Hadoop FS API). Later iterations of Hadoop are leveraging other execution engines including Tez (Hortonworks) and Spark. The Hive metastore service stores the metadata for Hive tables and partitions in a relational database, and provides clients (including Hive) access to this information via the metastore service API. The subsections that follow discuss the deployment options and provide instructions for setting up a database in a recommended configuration.
As a Big Data data integration platform, Talend includes components for submitting Hive queries. Once submitted, Hive queries are parsed, optimized and converted into M/R jobs. Because Talend includes a number of options for submitting Hive queries, it can be sometimes confusing to follow what's happening under the covers, especially when it comes to troubleshooting. This post will discuss the following:
An overview of how Hive works
This post will start by looking at how Hive works (information below sourced from https://cwiki.apache.org/confluence/display/Hive/Design). The following flowchart describes the steps that occur when a Hive query is subnmitted to HiveServer.
Hive Architecture
- UI – The user interface for users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed.
- Driver – The component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces.
- Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore.
- Metastore – The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.
- Execution Engine – The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components.
The figure above shows how a typical query flows through the system.
- The UI calls the execute interface to the Driver (step 1 in figure above).
- The Driver creates a session handle for the query and sends the query to the compiler to generate an execution plan (step 2).
- The compiler gets the necessary metadata from the metastore (steps 3).
- This metadata is used to typecheck the expressions in the query tree as well as to prune partitions based on query predicates (steps 3).
- The plan generated by the compiler (step 5) is a DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers).
- The execution engine submits these stages to appropriate components (steps 6, 6.1, 6.2 and 6.3). In each task (mapper/reducer) the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files and these are passed through the associated operator tree. Once the output is generated, it is written to a temporary HDFS file though the serializer (this happens in the mapper in case the operation does not need a reduce). The temporary files are used to provide data to subsequent map/reduce stages of the plan. For DML operations the final temporary file is moved to the table's location. This scheme is used to ensure that dirty data is not read (file rename being an atomic operation in HDFS).
- For queries, the contents of the temporary file are read by the execution engine directly from HDFS as part of the fetch call from the Driver (steps 7, 8 and 9).
Explanation of how Talend components work and what each selection means
Talend jobs are simple JDBC UI/clients to the Hive framework, and Hive components in a job are configured to align to the Hive configuration in a cluster. It can be a bit complicated, because the Hive documentation overloads some terminology. In the Hive architecture, HQL queries are translated into MapReduce (or Tez or Spark) jobs and executed on the cluster. The "tables" that are queried might just be files on HDFS, and the metadata store determines where a named table will be found on disk, in what format, and what columns it has. The metadata store is already a source of confusion, because it can run in "embedded", "local" or "remote" mode.[1] These don't correspond to the configuration of Hive in the Talend job.
When the metadata store is running in "remote" (Standalone) mode, it runs as its own, separate process, providing a thrift service (often port 9083). The cluster may also be configured to have HiveServer2 processes running (deliberately ignoring HiveServer1 since it's being deprecated). It also provides a thrift service (often port 10000), where a client can submit HQL queries and receive the results.
When the cluster has a HiveServer2 process, a Talend job should use a "Standalone" Hive configuration. When the job generates HQL to submit, it's sent by thrift to the HiveServer2, which spawns a new JVM (a session) where the compiled MR job is launched.
Note that from the perspective of the compiled MR job, the local filesystem is on the machine where the session has been launched.
In the following diagram, there are either 5/6 processes active. Whether or not the metadata store is running in its own process is entirely transparent to the Talend job and depends on how the cluster is configured.
However, if there isn't a HiveServer2 running in the cluster, then there must be a metadata store running for the Talend job to obtain metadata. This is when a "Embedded" Hive configuration is approriate. In this case, the HQL is compiled into MapReduce jobs by the JDBC client, on the same machine as the Talend job and submitted directly to the resource manager for execution and interpretation of the results. In both cases, Talend only uses the JDBC API to access Hive, including query construction and receiving results.
Although the resulting behaviour is different, the differences in code generation are only differences in configuring the Hive JDBC driver:
- + Standalone: the job only needs to know how to access the HiveServer2 on the cluster. That component must be correctly configured and administrated, but all of the details of the metastore / HDFS / YARN are completely transparent to the Talend job.
- + Embedded: the job needs to be configured with enough detail that the JDBC driver can access the metastore, compile the jobs, submit them to the resource manager of the cluster and recover the results.
Note that Hive can also run in a mode where neither the HiveServer2 or metadata store are present (all is performed locally), but Talend components don't support this mode.
Recommendation: It's best to set Hive components to use Standalone / Hive2. This configuration assumes that JDBC calls are being made to a remote servers/services, and require no configuration on the machine that the Talend job is being executed on - including the Studio. Meaning that Talend jobs that read/write data to Hive running even on a Linux O/S-based Hadoop cluster can run on a Windows machine. The only warning that may be observed is the missing WinUtils package.
Other reasons that favor Standalone over Embedded are:
- + Standalone is easier to set up for Hadoop Clusters running in HA (High Availability) mode; these are clusters that have multiple Hive Hosts working in active/passive mode.
- + Standalone is easier to set up for Kerberos Authentication, compared to Embedded mode.