Scalability Considerations for Impala
This section explains how the size of your cluster and the volume of data influences SQL performance and schema design for Impala tables. Typically, adding more cluster capacity reduces problems due to memory limits or disk throughput. On the other hand, larger clusters are more likely to have other kinds of scalability issues, such as a single slow node that causes performance problems for queries.
A good source of tips related to scalability and performance tuning is the Impala Cookbook presentation. These slides are updated periodically as new features come out and new benchmarks are performed.
Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage
Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized
for tables containing relatively few, large data files. Schemas containing thousands of
tables, or tables containing thousands of partitions, can encounter performance issues
during startup or during DDL operations such as ALTER TABLE
statements.
Because of a change in the default heap size for the catalogd daemon in Impala 2.5 and higher, the following procedure to increase the catalogd memory limit might be required following an upgrade to Impala 2.5 even if not needed previously.
-
Check current memory usage for the catalogd daemon by running the following commands on the host where that daemon runs on your cluster:
jcmd catalogd_pid VM.flags jmap -heap catalogd_pid
-
Decide on a large enough value for the catalogd heap. You use the
JAVA_TOOL_OPTIONS
environment variable to set the maximum heap size. For example, the following environment variable setting specifies the maximum heap size of 8 GB.JAVA_TOOL_OPTIONS="-Xmx8g"
-
On systems not using cluster management software, put this environment variable setting into the startup script for the catalogd daemon, then restart the catalogd daemon.
-
Use the same jcmd and jmap commands as earlier to verify that the new settings are in effect.
Scalability Considerations for the Impala Statestore
Before Impala 2.1, the statestore sent only one kind of message to its subscribers. This message contained all updates for any topics that a subscriber had subscribed to. It also served to let subscribers know that the statestore had not failed, and conversely the statestore used the success of sending a heartbeat to a subscriber to decide whether or not the subscriber had failed.
Combining topic updates and failure detection in a single message led to bottlenecks in clusters with large numbers of tables, partitions, and HDFS data blocks. When the statestore was overloaded with metadata updates to transmit, heartbeat messages were sent less frequently, sometimes causing subscribers to time out their connection with the statestore. Increasing the subscriber timeout and decreasing the frequency of statestore heartbeats worked around the problem, but reduced responsiveness when the statestore failed or restarted.
As of Impala 2.1, the statestore now sends topic updates and heartbeats in separate messages. This allows the statestore to send and receive a steady stream of lightweight heartbeats, and removes the requirement to send topic updates according to a fixed schedule, reducing statestore network overhead.
The statestore now has the following relevant configuration flags for the statestored daemon:
-
-statestore_num_update_threads
-
The number of threads inside the statestore dedicated to sending topic updates. You
should not typically need to change this value.
Default: 10
-
-statestore_update_frequency_ms
-
The frequency, in milliseconds, with which the statestore tries to send topic
updates to each subscriber. This is a best-effort value; if the statestore is unable
to meet this frequency, it sends topic updates as fast as it can. You should not
typically need to change this value.
Default: 2000
-
-statestore_num_heartbeat_threads
-
The number of threads inside the statestore dedicated to sending heartbeats. You
should not typically need to change this value.
Default: 10
-
-statestore_heartbeat_frequency_ms
-
The frequency, in milliseconds, with which the statestore tries to send heartbeats
to each subscriber. This value should be good for large catalogs and clusters up to
approximately 150 nodes. Beyond that, you might need to increase this value to make
the interval longer between heartbeat messages.
Default: 1000 (one heartbeat message every second)
-
-statestore_heartbeat_tcp_timeout_seconds
-
The time after which a heartbeat RPC to a subscriber will timeout. This setting
protects against badly hung machines that are not able to respond to the heartbeat
RPC in short order. Increase this if there are intermittent heartbeat RPC timeouts
shown in statestore's log. You can reference the max value of
"statestore.priority-topic-update-durations" metric on statestore to get a
reasonable value. Note that priority topic updates are assumed to be small amounts
of data that take a small amount of time to process (similar to the heartbeat
complexity).
Default: 3
-
-statestore_max_missed_heartbeats
-
Maximum number of consecutive heartbeat messages an impalad can miss before being
declared failed by the statestore. You should not typically need to change this
value.
Default: 10
-
-statestore_subscriber_timeout_secs
-
The amount of time (in seconds) that may elapse before the connection with the
statestore is considered lost by subscribers (impalad/catalogd). Impalad will
reregister itself to statestore, which may cause its absence in the next round of
cluster membership update. This will cause query failures like "Cancelled due to
unreachable impalad(s)". The value of this flag should be comparable to
(statestore_heartbeat_frequency_ms / 1000 + statestore_heartbeat_tcp_timeout_seconds) * statestore_max_missed_heartbeats
, so subscribers won't reregister themselves too early and allow statestore to resend heartbeats. You can also reference the max value of "statestore-subscriber.heartbeat-interval-time" metrics on impalads to get a reasonable value.Default: 30
If it takes a very long time for a cluster to start up, and
impala-shell consistently displays This Impala daemon is not
ready to accept user requests
, the statestore might be taking too long to send
the entire catalog topic to the cluster. In this case, consider adding
--load_catalog_in_background=false
to your catalog service
configuration. This setting stops the statestore from loading the entire catalog into
memory at cluster startup. Instead, metadata for each table is loaded when the table is
accessed for the first time.
Effect of Buffer Pool on Memory Usage (Impala 2.10 and higher)
The buffer pool feature, available in Impala 2.10 and higher, changes the way Impala allocates memory during a query. Most of the memory needed is reserved at the beginning of the query, avoiding cases where a query might run for a long time before failing with an out-of-memory error. The actual memory estimates and memory buffers are typically smaller than before, so that more queries can run concurrently or process larger volumes of data than previously.
The buffer pool feature includes some query options that you can fine-tune: BUFFER_POOL_LIMIT Query Option, DEFAULT_SPILLABLE_BUFFER_SIZE Query Option, MAX_ROW_SIZE Query Option, and MIN_SPILLABLE_BUFFER_SIZE Query Option.
Most of the effects of the buffer pool are transparent to you as an Impala user. Memory
use during spilling is now steadier and more predictable, instead of increasing rapidly
as more data is spilled to disk. The main change from a user perspective is the need to
increase the MAX_ROW_SIZE
query option setting when querying tables
with columns containing long strings, many columns, or other combinations of factors
that produce very large rows. If Impala encounters rows that are too large to process
with the default query option settings, the query fails with an error message suggesting
to increase the MAX_ROW_SIZE
setting.
SQL Operations that Spill to Disk
Certain memory-intensive operations write temporary data to disk (known as spilling to disk) when Impala is close to exceeding its memory limit on a particular host.
The result is a query that completes successfully, rather than failing with an out-of-memory error. The tradeoff is decreased performance due to the extra disk I/O to write the temporary data and read it back in. The slowdown could be potentially be significant. Thus, while this feature improves reliability, you should optimize your queries, system parameters, and hardware configuration to make this spilling a rare occurrence.
In Impala 2.10 and higher, also see Scalability Considerations for Impala for changes to Impala memory allocation that might change the details of which queries spill to disk, and how much memory and disk space is involved in the spilling operation.
What kinds of queries might spill to disk:
Several SQL clauses and constructs require memory allocations that could activat the spilling mechanism:
-
when a query uses a
GROUP BY
clause for columns with millions or billions of distinct values, Impala keeps a similar number of temporary results in memory, to accumulate the aggregate results for each value in the group. -
When large tables are joined together, Impala keeps the values of the join columns from one table in memory, to compare them to incoming values from the other table.
-
When a large result set is sorted by the
ORDER BY
clause, each node sorts its portion of the result set in memory. -
The
DISTINCT
andUNION
operators build in-memory data structures to represent all values found so far, to eliminate duplicates as the query progresses.
When the spill-to-disk feature is activated for a join node within a query, Impala does not produce any runtime filters for that join operation on that host. Other join nodes within the query are not affected.
How Impala handles scratch disk space for spilling:
By default, intermediate files used during
large sort, join, aggregation, or analytic function operations are
stored in the directory /tmp/impala-scratch, and
these intermediate files are removed when the operation finishes. You
can specify a different location by starting the
impalad daemon with the
‑‑scratch_dirs="path_to_directory"
configuration option.
Memory usage for SQL operators:
In Impala 2.10 and higher, the way SQL operators such as
GROUP BY
, DISTINCT
, and joins, transition between
using additional memory or activating the spill-to-disk feature is changed. The memory
required to spill to disk is reserved up front, and you can examine it in the
EXPLAIN
plan when the EXPLAIN_LEVEL
query option is
set to 2 or higher.
The infrastructure of the spilling feature affects the way the affected SQL operators,
such as GROUP BY
, DISTINCT
, and joins, use memory. On
each host that participates in the query, each such operator in a query requires memory
to store rows of data and other data structures. Impala reserves a certain amount of
memory up front for each operator that supports spill-to-disk that is sufficient to
execute the operator. If an operator accumulates more data than can fit in the reserved
memory, it can either reserve more memory to continue processing data in memory or start
spilling data to temporary scratch files on disk. Thus, operators with spill-to-disk
support can adapt to different memory constraints by using however much memory is
available to speed up execution, yet tolerate low memory conditions by spilling data to
disk.
The amount data depends on the portion of the data being handled by that host, and thus the operator may end up consuming different amounts of memory on different hosts.
Added in: This feature was added to the ORDER BY
clause in
Impala 1.4. This feature was extended to cover join queries, aggregation functions, and
analytic functions in Impala 2.0. The size of the memory work area required by each
operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2.
The spilling mechanism was reworked to take
advantage of the Impala buffer pool feature and be more predictable and stable in
Impala 2.10.
Avoiding queries that spill to disk:
Because the extra I/O can impose significant performance overhead on these types of queries, try to avoid this situation by using the following steps:
-
Detect how often queries spill to disk, and how much temporary data is written. Refer
to the following sources:
-
The output of the
PROFILE
command in the impala-shell interpreter. This data shows the memory usage for each host and in total across the cluster. TheWriteIoBytes
counter reports how much data was written to disk for each operator during the query. (In Impala 2.9, the counter was namedScratchBytesWritten
; in Impala 2.8 and earlier, it was namedBytesWritten
.) - The Queries tab in the Impala debug web user interface. Select the query to examine and click the corresponding Profile link. This data breaks down the memory usage for a single host within the cluster, the host whose web interface you are connected to.
-
The output of the
-
Use one or more techniques to reduce the possibility of the queries spilling to disk:
-
Increase the Impala memory limit if practical, for example, if you can increase
the available memory by more than the amount of temporary data written to disk on
a particular node. Remember that in Impala 2.0 and later, you can issue
SET MEM_LIMIT
as a SQL statement, which lets you fine-tune the memory usage for queries from JDBC and ODBC applications. - Increase the number of nodes in the cluster, to increase the aggregate memory available to Impala and reduce the amount of memory required on each node.
- Add more memory to the hosts running Impala daemons.
- On a cluster with resources shared between Impala and other Hadoop components, use resource management features to allocate more memory for Impala. See Resource Management for details.
- If the memory pressure is due to running many concurrent queries rather than a few memory-intensive ones, consider using the Impala admission control feature to lower the limit on the number of concurrent queries. By spacing out the most resource-intensive queries, you can avoid spikes in memory usage and improve overall response times. See Admission Control and Query Queuing for details.
-
Tune the queries with the highest memory requirements, using one or more of the
following techniques:
-
Run the
COMPUTE STATS
statement for all tables involved in large-scale joins and aggregation queries. -
Minimize your use of
STRING
columns in join columns. Prefer numeric values instead. -
Examine the
EXPLAIN
plan to understand the execution strategy being used for the most resource-intensive queries. See Using the EXPLAIN Plan for Performance Tuning for details. - If Impala still chooses a suboptimal execution strategy even with statistics available, or if it is impractical to keep the statistics up to date for huge or rapidly changing tables, add hints to the most resource-intensive queries to select the right execution strategy. See Optimizer Hints for details.
-
Run the
-
If your queries experience substantial performance overhead due to spilling,
enable the
DISABLE_UNSAFE_SPILLS
query option. This option prevents queries whose memory usage is likely to be exorbitant from spilling to disk. See DISABLE_UNSAFE_SPILLS Query Option (Impala 2.0 or higher only) for details. As you tune problematic queries using the preceding steps, fewer and fewer will be cancelled by this option setting.
-
Increase the Impala memory limit if practical, for example, if you can increase
the available memory by more than the amount of temporary data written to disk on
a particular node. Remember that in Impala 2.0 and later, you can issue
Testing performance implications of spilling to disk:
To artificially provoke spilling, to test this feature and understand the performance
implications, use a test environment with a memory limit of at least 2 GB. Issue the
SET
command with no arguments to check the current setting for the
MEM_LIMIT
query option. Set the query option
DISABLE_UNSAFE_SPILLS=true
. This option limits the spill-to-disk
feature to prevent runaway disk usage from queries that are known in advance to be
suboptimal. Within impala-shell, run a query that you expect to be
memory-intensive, based on the criteria explained earlier. A self-join of a large table
is a good candidate:
select count(*) from big_table a join big_table b using (column_with_many_values);
Issue the PROFILE
command to get a detailed breakdown of the memory
usage on each node during the query.
Set the MEM_LIMIT
query option to a value that is smaller than the peak
memory usage reported in the profile output. Now try the memory-intensive query again.
Check if the query fails with a message like the following:
WARNINGS: Spilling has been disabled for plans that do not have stats and are not hinted
to prevent potentially bad plans from using too many cluster resources. Compute stats on
these tables, hint the plan or disable this behavior via query options to enable spilling.
If so, the query could have consumed substantial temporary disk space, slowing down so
much that it would not complete in any reasonable time. Rather than rely on the
spill-to-disk feature in this case, issue the COMPUTE STATS
statement
for the table or tables in your sample query. Then run the query again, check the peak
memory usage again in the PROFILE
output, and adjust the memory limit
again if necessary to be lower than the peak memory usage.
At this point, you have a query that is memory-intensive, but Impala can optimize it
efficiently so that the memory usage is not exorbitant. You have set an artificial
constraint through the MEM_LIMIT
option so that the query would
normally fail with an out-of-memory error. But the automatic spill-to-disk feature means
that the query should actually succeed, at the expense of some extra disk I/O to read
and write temporary work data.
Try the query again, and confirm that it succeeds. Examine the PROFILE
output again. This time, look for lines of this form:
- SpilledPartitions: N
If you see any such lines with N greater than 0, that indicates the
query would have failed in Impala releases prior to 2.0, but now it succeeded because of
the spill-to-disk feature. Examine the total time taken by the
AGGREGATION_NODE
or other query fragments containing non-zero
SpilledPartitions
values. Compare the times to similar fragments that
did not spill, for example in the PROFILE
output when the same query is
run with a higher memory limit. This gives you an idea of the performance penalty of the
spill operation for a particular query with a particular memory limit. If you make the
memory limit just a little lower than the peak memory usage, the query only needs to
write a small amount of temporary data to disk. The lower you set the memory limit, the
more temporary data is written and the slower the query becomes.
Now repeat this procedure for actual queries used in your environment. Use the
DISABLE_UNSAFE_SPILLS
setting to identify cases where queries used more
memory than necessary due to lack of statistics on the relevant tables and columns, and
issue COMPUTE STATS
where necessary.
When to use DISABLE_UNSAFE_SPILLS:
You might wonder, why not leave DISABLE_UNSAFE_SPILLS
turned on all the
time. Whether and how frequently to use this option depends on your system environment
and workload.
DISABLE_UNSAFE_SPILLS
is suitable for an environment with ad hoc
queries whose performance characteristics and memory usage are not known in advance. It
prevents "worst-case scenario" queries that use large amounts of memory
unnecessarily. Thus, you might turn this option on within a session while developing new
SQL code, even though it is turned off for existing applications.
Organizations where table and column statistics are generally up-to-date might leave
this option turned on all the time, again to avoid worst-case scenarios for untested
queries or if a problem in the ETL pipeline results in a table with no statistics.
Turning on DISABLE_UNSAFE_SPILLS
lets you "fail fast" in this case
and immediately gather statistics or tune the problematic queries.
Some organizations might leave this option turned off. For example, you might have
tables large enough that the COMPUTE STATS
takes substantial time to
run, making it impractical to re-run after loading new data. If you have examined the
EXPLAIN
plans of your queries and know that they are operating
efficiently, you might leave DISABLE_UNSAFE_SPILLS
turned off. In that
case, you know that any queries that spill will not go overboard with their memory
consumption.
Limits on Query Size and Complexity
There are hardcoded limits on the maximum size and complexity of queries. Currently, the maximum number of expressions in a query is 2000. You might exceed the limits with large or deeply nested queries produced by business intelligence tools or other query generators.
If you have the ability to customize such queries or the query generation logic that
produces them, replace sequences of repetitive expressions with single operators such as
IN
or BETWEEN
that can represent multiple values or
ranges. For example, instead of a large number of OR
clauses:
WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ...
use a single IN
clause:
WHERE val IN (1,2,6,100,...)
Scalability Considerations for Impala I/O
Impala parallelizes its I/O operations aggressively, therefore the more disks you can attach to each host, the better. Impala retrieves data from disk so quickly using bulk read operations on large blocks, that most queries are CPU-bound rather than I/O-bound.
Because the kind of sequential scanning typically done by Impala queries does not benefit much from the random-access capabilities of SSDs, spinning disks typically provide the most cost-effective kind of storage for Impala data, with little or no performance penalty as compared to SSDs.
Resource management features such as YARN, Llama, and admission control typically constrain the amount of memory, CPU, or overall number of queries in a high-concurrency environment. Currently, there is no throttling mechanism for Impala I/O.
Scalability Considerations for Table Layout
Due to the overhead of retrieving and updating table metadata in the metastore database, try to limit the number of columns in a table to a maximum of approximately 2000. Although Impala can handle wider tables than this, the metastore overhead can become significant, leading to query performance that is slower than expected based on the actual data volume.
To minimize overhead related to the metastore database and Impala query planning, try to limit the number of partitions for any partitioned table to a few tens of thousands.
If the volume of data within a table makes it impractical to run exploratory queries,
consider using the TABLESAMPLE
clause to limit query processing to only
a percentage of data within the table. This technique reduces the overhead for query
startup, I/O to read the data, and the amount of network, CPU, and memory needed to
process intermediate results during the query. See TABLESAMPLE Clause for
details.
Kerberos-Related Network Overhead for Large Clusters
When Impala starts up, or after each kinit
refresh, Impala sends a
number of simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might
be able to process all the requests within roughly 5 seconds. For a cluster with 1000
hosts, the time to process the requests would be roughly 500 seconds. Impala also makes
a number of DNS requests at the same time as these Kerberos-related requests.
While these authentication requests are being processed, any submitted Impala queries will fail. During this period, the KDC and DNS may be slow to respond to requests from components other than Impala, so other secure services might be affected temporarily.
In Impala 2.12 or earlier, to reduce the frequency of the
kinit
renewal that initiates a new set of authentication requests,
increase the kerberos_reinit_interval
configuration setting for the
impalad
daemons. Currently, the default is 60 minutes. Consider using a
higher value such as 360 (6 hours).
The kerberos_reinit_interval
configuration setting is removed in
Impala 3.0, and the above step is no longer needed.
Avoiding CPU Hotspots for HDFS Cached Data
You can use the HDFS caching feature, described in Using HDFS Caching with Impala (Impala 2.1 or higher only), with Impala to reduce I/O and memory-to-memory copying for frequently accessed tables or partitions.
In the early days of this feature, you might have found that enabling HDFS caching resulted in little or no performance improvement, because it could result in "hotspots": instead of the I/O to read the table data being parallelized across the cluster, the I/O was reduced but the CPU load to process the data blocks might be concentrated on a single host.
To avoid hotspots, include the WITH REPLICATION
clause with the
CREATE TABLE
or ALTER TABLE
statements for tables that
use HDFS caching. This clause allows more than one host to cache the relevant data
blocks, so the CPU load can be shared, reducing the load on any one host. See
CREATE TABLE Statement and
ALTER TABLE Statement for details.
Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to
the way that Impala schedules the work of processing data blocks on different hosts. In
Impala 2.5 and higher, scheduling improvements mean that the work
for HDFS cached data is divided better among all the hosts that have cached replicas for
a particular data block. When more than one host has a cached replica for a data block,
Impala assigns the work of processing that block to whichever host has done the least
work (in terms of number of bytes read) for the current query. If hotspots persist even
with this load-based scheduling algorithm, you can enable the query option
SCHEDULE_RANDOM_REPLICA=TRUE
to further distribute the CPU load. This
setting causes Impala to randomly pick a host to process a cached data block if the
scheduling algorithm encounters a tie when deciding which host has done the least work.
Scalability Considerations for File Handle Caching
One scalability aspect that affects heavily loaded clusters is the load on the metadata layer from looking up the details as each file is opened. On HDFS, that can lead to increased load on the NameNode, and on S3, this can lead to an excessive number of S3 metadata requests. For example, a query that does a full table scan on a partitioned table may need to read thousands of partitions, each partition containing multiple data files. Accessing each column of a Parquet file also involves a separate "open" call, further increasing the load on the NameNode. High NameNode overhead can add startup time (that is, increase latency) to Impala queries, and reduce overall throughput for non-Impala workloads that also require accessing HDFS files.
You can reduce the number of calls made to your file system's metadata layer by enabling the file handle caching feature. Data files that are accessed by different queries, or even multiple times within the same query, can be accessed without a new "open" call and without fetching the file details multiple times.
-
HDFS in Impala 2.10 and higher
In Impala 3.2 and higher, file handle caching also applies to remote HDFS file handles. This is controlled by the
cache_remote_file_handles
flag for animpalad
. It is recommended that you use the default value oftrue
as this caching prevents your NameNode from overloading when your cluster has many remote HDFS reads. -
S3 in Impala 3.3 and higher
The
cache_s3_file_handles
impalad
flag controls the S3 file handle caching. The feature is enabled by default with the flag set totrue
.
The feature is enabled by default with 20,000 file handles to be cached. To change the
value, set the configuration option max_cached_file_handles
to a
non-zero value for each impalad daemon. From the initial default
value of 20000, adjust upward if NameNode request load is still significant, or downward
if it is more important to reduce the extra memory usage on each host. Each cache entry
consumes 6 KB, meaning that caching 20,000 file handles requires up to 120 MB on each
Impala executor. The exact memory usage varies depending on how many file handles have
actually been cached; memory is freed as file handles are evicted from the cache.
If a manual operation moves a file to the trashcan while the file handle is cached,
Impala still accesses the contents of that file. This is a change from prior behavior.
Previously, accessing a file that was in the trashcan would cause an error. This
behavior only applies to non-Impala methods of removing files, not the Impala mechanisms
such as TRUNCATE TABLE
or DROP TABLE
.
If files are removed, replaced, or appended by operations outside of Impala, the way to
bring the file information up to date is to run the REFRESH
statement
on the table.
File handle cache entries are evicted as the cache fills up, or based on a timeout period when they have not been accessed for some time.
To evaluate the effectiveness of file handle caching for a particular workload, issue
the PROFILE
statement in impala-shell or examine
query profiles in the Impala Web UI. Look for the ratio of
CachedFileHandlesHitCount
(ideally, should be high) to
CachedFileHandlesMissCount
(ideally, should be low). Before starting
any evaluation, run several representative queries to "warm up" the cache because
the first time each data file is accessed is always recorded as a cache miss.
To see metrics about file handle caching for each impalad instance, examine the following fields on the /metrics page in the Impala Web UI:
- impala-server.io.mgr.cached-file-handles-miss-count
- impala-server.io.mgr.num-cached-file-handles