Components of the Impala Server

The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of different daemon processes that run on specific hosts within your cluster.

The Impala Daemon

The core Impala component is the Impala daemon, physically represented by the impalad process. A few of the key functions that an Impala daemon performs are:
  • Reads and writes to data files.
  • Accepts queries transmitted from the impala-shell command, Hue, JDBC, or ODBC.
  • Parallelizes the queries and distributes work across the cluster.
  • Transmits intermediate query results back to the central coordinator.
Impala daemons can be deployed in one of the following ways:
  • HDFS and Impala are co-located, and each Impala daemon runs on the same host as a DataNode.
  • Impala is deployed separately in a compute cluster and reads remotely from HDFS, S3, ADLS, etc.

The Impala daemons are in constant communication with StateStore, to confirm which daemons are healthy and can accept new work.

They also receive broadcast messages from the catalogd daemon (introduced in Impala 1.2) whenever any Impala daemon in the cluster creates, alters, or drops any type of object, or when an INSERT or LOAD DATA statement is processed through Impala. This background communication minimizes the need for REFRESH or INVALIDATE METADATA statements that were needed to coordinate metadata across Impala daemons prior to Impala 1.2.

In Impala 2.9 and higher, you can control which hosts act as query coordinators and which act as query executors, to improve scalability for highly concurrent workloads on large clusters. See How to Configure Impala with Dedicated Coordinators for details.

Note: Impala daemons should be deployed on nodes using the same Glibc version since different Glibc version supports different Unicode standard version and also ensure that the en_US.UTF-8 locale is installed in the nodes. Not using the same Glibc version might result in inconsistent UTF-8 behavior when UTF8_MODE is set to true.

Related information: Modifying Impala Startup Options, Starting Impala, Setting the Idle Query and Idle Session Timeouts for impalad, Ports Used by Impala, Using Impala through a Proxy for High Availability

The Impala Statestore

The Impala component known as the StateStore checks on the health of all Impala daemons in a cluster, and continuously relays its findings to each of those daemons. It is physically represented by a daemon process named statestored. You only need such a process on one host in a cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue, or other reason, the StateStore informs all the other Impala daemons so that future queries can avoid making requests to the unreachable Impala daemon.

Because the StateStore's purpose is to help when things go wrong and to broadcast metadata to coordinators, it is not always critical to the normal operation of an Impala cluster. If the StateStore is not running or becomes unreachable, the Impala daemons continue running and distributing work among themselves as usual when working with the data known to Impala. The cluster just becomes less robust if other Impala daemons fail, and metadata becomes less consistent as it changes while the StateStore is offline. When the StateStore comes back online, it re-establishes communication with the Impala daemons and resumes its monitoring and broadcasting functions.

If you issue a DDL statement while the StateStore is down, the queries that access the new object the DDL created will fail.

Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and catalogd daemons do not have special requirements for high availability, because problems with those daemons do not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and restart the Impala service.

Related information:

Scalability Considerations for the Impala Statestore, Modifying Impala Startup Options, Starting Impala, Increasing the Statestore Timeout, Ports Used by Impala

The Impala Catalog Service

The Impala component known as the Catalog Service relays the metadata changes from Impala SQL statements to all the Impala coordinators in a cluster. It is physically represented by a daemon process named catalogd. You only need such a process on one host in a cluster. Because the requests are passed through the StateStore daemon, it makes sense to run the statestored and catalogd services on the same host.

The catalog service avoids the need to issue REFRESH and INVALIDATE METADATA statements when the metadata changes are performed by statements issued through Impala.

When you create a table, load data, and so on through Hive, you do need to issue REFRESH or INVALIDATE METADATA on an Impala daemon before executing a query. Performing REFRESH or INVALIDATE METADATA is not required when Automatic Invalidation/Refresh of Metadata is enabled. See Automatic Invalidation/Refresh of Metadata also known as the Hive Metastore (HMS) event processor.
Note: From Impala 4.1, Automatic Invalidation/Refresh of Metadata is enabled by default.

This feature touches a number of aspects of Impala:

  • See Installing Impala, Upgrading Impala and Starting Impala, for usage information for the catalogd daemon.

  • The REFRESH and INVALIDATE METADATA statements are not needed when the CREATE TABLE, INSERT, or other table-changing or data-changing operation is performed through Impala. These statements are still needed if such operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the statements only need to be issued on one Impala daemon rather than on all daemons. See REFRESH Statement and INVALIDATE METADATA Statement for the latest usage information for those statements.

Use ‑‑load_catalog_in_background option to control when the metadata of a table is loaded.
  • If set to false, the metadata of a table is loaded when it is referenced for the first time. This means that the first run of a particular query can be slower than subsequent runs. Starting in Impala 2.2, the default for ‑‑load_catalog_in_background is false.
  • If set to true, the catalog service attempts to load metadata for a table even if no query needed that metadata. So metadata will possibly be already loaded when the first query that would need it is run. However, for the following reasons, we recommend not to set the option to true.
    • Background load can interfere with query-specific metadata loading. This can happen on startup or after invalidating metadata, with a duration depending on the amount of metadata, and can lead to a seemingly random long running queries that are difficult to diagnose.
    • Impala may load metadata for tables that are possibly never used, potentially increasing catalog size and consequently memory usage for both catalog service and Impala Daemon.

Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and catalogd daemons do not have special requirements for high availability, because problems with those daemons do not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and restart the Impala service.

Note:

In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE METADATA after the table is created in Hive, allowing you to make individual tables visible to Impala without doing a full reload of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast mechanism faster and more responsive, especially during Impala startup. See New Features in Impala 1.2.4 for details.

Related information: Modifying Impala Startup Options, Starting Impala, Ports Used by Impala