HDFS caching provides performance and scalability benefits in production environments where Impala queries and other Hadoop jobs operate on quantities of data much larger than the physical RAM on the DataNodes, making it impractical to rely on the Linux OS cache, which only keeps the most recently used data in memory. Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when using data from the Linux OS cache.
On a small or lightly loaded cluster, HDFS caching might not produce any speedup. It might even lead to slower queries, if I/O read operations that were performed in parallel across the entire cluster are replaced by in-memory operations operating on a smaller number of hosts. The hosts where the HDFS blocks are cached can become bottlenecks because they experience high CPU load while processing the cached data blocks, while other hosts remain idle. Therefore, always compare performance with and without this feature enabled, using a realistic workload.
In Impala 2.2 and higher, you can spread the CPU load more evenly by specifying the WITH REPLICATION
clause of the CREATE TABLE
and ALTER TABLE
statements.
This clause lets you control the replication factor for
HDFS caching for a specific table or partition. By default, each cached block is
only present on a single host, which can lead to CPU contention if the same host
processes each cached block. Increasing the replication factor lets Impala choose
different hosts to process different cached blocks, to better distribute the CPU load.
Always use a WITH REPLICATION
setting of at least 3, and adjust upward
if necessary to match the replication factor for the underlying HDFS data files.
In Impala 2.5 and higher, Impala automatically randomizes which host processes
a cached HDFS block, to avoid CPU hotspots. For tables where HDFS caching is not applied,
Impala designates which host to process a data block using an algorithm that estimates
the load on each host. If CPU hotspots still arise during queries,
you can enable additional randomization for the scheduling algorithm for non-HDFS cached data
by setting the SCHEDULE_RANDOM_REPLICA
query option.
For background information about how to set up and manage HDFS caching for a cluster, see the documentation for your Apache Hadoop distribution.
In Impala 1.4 and higher, Impala can use the HDFS caching feature to make more effective use of RAM, so that repeated queries can take advantage of data "pinned" in memory regardless of how much data is processed overall. The HDFS caching feature lets you designate a subset of frequently accessed data to be pinned permanently in memory, remaining in the cache across multiple queries and never being evicted. This technique is suitable for tables or partitions that are frequently accessed and are small enough to fit entirely within the HDFS memory cache. For example, you might designate several dimension tables to be pinned in the cache, to speed up many different join queries that reference them. Or in a partitioned table, you might pin a partition holding data from the most recent time period because that data will be queried intensively; then when the next set of data arrives, you could unpin the previous partition and pin the partition holding the new data.
Because this Impala performance feature relies on HDFS infrastructure, it only applies to Impala tables that use HDFS data files. HDFS caching for Impala does not apply to HBase tables, S3 tables, Kudu tables, or Isilon tables.
To use HDFS caching with Impala, first set up that feature for your cluster:
Decide how much memory to devote to the HDFS cache on each host. Remember that the total memory available for cached data is the sum of the cache sizes on all the hosts. By default, any data block is only cached on one host, although you can cache a block across multiple hosts by increasing the replication factor.
impala
). For example:
hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000
For details about the hdfs cacheadmin command, see
the documentation for your Apache Hadoop distribution.
Once HDFS caching is enabled and one or more pools are available, see
Enabling HDFS Caching for Impala Tables and Partitions for how to choose which Impala data to load
into the HDFS cache. On the Impala side, you specify the cache pool name defined by the hdfs
cacheadmin
command in the Impala DDL statements that enable HDFS caching for a table or partition,
such as CREATE TABLE ... CACHED IN pool
or ALTER TABLE ... SET
CACHED IN pool
.
Begin by choosing which tables or partitions to cache. For example, these might be lookup tables that are accessed by many different join queries, or partitions corresponding to the most recent time period that are analyzed by different reports or ad hoc queries.
In your SQL statements, you specify logical divisions such as tables and partitions to be cached. Impala
translates these requests into HDFS-level directives that apply to particular directories and files. For
example, given a partitioned table CENSUS
with a partition key column
YEAR
, you could choose to cache all or part of the data as follows:
In Impala 2.2 and higher, the optional WITH
REPLICATION
clause for CREATE TABLE
and ALTER
TABLE
lets you specify a replication factor, the number of hosts
on which to cache the same data blocks. When Impala processes a cached data block, where
the cache replication factor is greater than 1, Impala randomly selects a host that has
a cached copy of that data block. This optimization avoids excessive CPU usage on a
single host when the same cached data block is processed multiple times. Where
practical, specify a value greater than or equal to the HDFS block replication factor.
-- Cache the entire table (all partitions).
alter table census set cached in 'pool_name';
-- Remove the entire table from the cache.
alter table census set uncached;
-- Cache a portion of the table (a single partition).
-- If the table is partitioned by multiple columns (such as year, month, day),
-- the ALTER TABLE command must specify values for all those columns.
alter table census partition (year=1960) set cached in 'pool_name';
-- Cache the data from one partition on up to 4 hosts, to minimize CPU load on any
-- single host when the same data block is processed multiple times.
alter table census partition (year=1970)
set cached in 'pool_name' with replication = 4;
-- At each stage, check the volume of cached data.
-- For large tables or partitions, the background loading might take some time,
-- so you might have to wait and reissue the statement until all the data
-- has finished being loaded into the cache.
show table stats census;
+-------+-------+--------+------+--------------+--------+
| year | #Rows | #Files | Size | Bytes Cached | Format |
+-------+-------+--------+------+--------------+--------+
| 1900 | -1 | 1 | 11B | NOT CACHED | TEXT |
| 1940 | -1 | 1 | 11B | NOT CACHED | TEXT |
| 1960 | -1 | 1 | 11B | 11B | TEXT |
| 1970 | -1 | 1 | 11B | NOT CACHED | TEXT |
| Total | -1 | 4 | 44B | 11B | |
+-------+-------+--------+------+--------------+--------+
CREATE TABLE considerations:
The HDFS caching feature affects the Impala CREATE TABLE
statement as follows:
You can put a CACHED IN 'pool_name'
clause
and optionally a WITH REPLICATION = number_of_hosts
clause
at the end of a
CREATE TABLE
statement to automatically cache the entire contents of the table,
including any partitions added later. The pool_name is a pool that you previously set
up with the hdfs cacheadmin command.
Once a table is designated for HDFS caching through the CREATE TABLE
statement, if new
partitions are added later through ALTER TABLE ... ADD PARTITION
statements, the data in
those new partitions is automatically cached in the same pool.
If you want to perform repetitive queries on a subset of data from a large table, and it is not practical
to designate the entire table or specific partitions for HDFS caching, you can create a new cached table
with just a subset of the data by using CREATE TABLE ... CACHED IN 'pool_name'
AS SELECT ... WHERE ...
. When you are finished with generating reports from this subset of data,
drop the table and both the data files and the data cached in RAM are automatically deleted.
See CREATE TABLE Statement for the full syntax.
Other memory considerations:
Certain DDL operations, such as ALTER TABLE ... SET LOCATION
, are blocked while the
underlying HDFS directories contain cached files. You must uncache the files first, before changing the
location, dropping the table, and so on.
When data is requested to be pinned in memory, that process happens in the background without blocking access to the data while the caching is in progress. Loading the data from disk could take some time. Impala reads each HDFS data block from memory if it has been pinned already, or from disk if it has not been pinned yet.
The amount of data that you can pin on each node through the HDFS caching mechanism is subject to a quota that is enforced by the underlying HDFS service. Before requesting to pin an Impala table or partition in memory, check that its size does not exceed this quota.
When HDFS caching is enabled, extra processing happens in the background when you add or remove data
through statements such as INSERT
and DROP TABLE
.
Inserting or loading data:
INSERT
or
LOAD DATA
statement for a table or
partition that is cached, the new data files are automatically cached and Impala recognizes that fact
automatically.
INSERT
or LOAD DATA
through Hive, as always, Impala
only recognizes the new data files after a REFRESH table_name
statement in Impala.
REFRESH
statement
in Impala. Impala automatically performs a REFRESH
once the new data is loaded into the
HDFS cache.
Dropping tables, partitions, or cache pools:
The HDFS caching feature interacts with the Impala
DROP TABLE
and
ALTER TABLE ... DROP PARTITION
statements as follows:
DROP TABLE
for a table that is entirely cached, or has some partitions
cached, the DROP TABLE
succeeds and all the cache directives Impala submitted for that
table are removed from the HDFS cache system.
ALTER TABLE ... DROP PARTITION
. The operation succeeds and any cache
directives are removed.
CREATE TABLE
or
ALTER TABLE
statements. It is OK to have multiple redundant cache directives pertaining
to the same files; the directives all have unique IDs and owners so that the system can tell them apart.
REFRESH
,
SHOW TABLE STATS
reports 0 bytes cached for each associated Impala table or partition.
Relocating a table or partition:
The HDFS caching feature interacts with the Impala
ALTER TABLE ... SET LOCATION
statement as follows:
CREATE TABLE
or
ALTER TABLE
statements, subsequent attempts to relocate the table or partition through
an ALTER TABLE ... SET LOCATION
statement will fail. You must issue an ALTER
TABLE ... SET UNCACHED
statement for the table or partition first. Otherwise, Impala would lose
track of some cached data files and have no way to uncache them later.
Here are the guidelines and steps to check or change the status of HDFS caching for Impala data:
hdfs cacheadmin command:
REFRESH
on the table, Impala reports the number of bytes cached as 0 for all associated
tables and partitions.
hdfs cacheadmin -listDirectives # Basic info
Found 122 entries
ID POOL REPL EXPIRY PATH
123 testPool 1 never /user/hive/warehouse/tpcds.store_sales
124 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15
125 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01
...
hdfs cacheadmin -listDirectives -stats # More details
Found 122 entries
ID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED
123 testPool 1 never /user/hive/warehouse/tpcds.store_sales 0 0 0 0
124 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15 143169 143169 1 1
125 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01 112447 112447 1 1
...
Impala SHOW statement:
SHOW TABLE STATS
or SHOW PARTITIONS
statement displays the number of bytes currently cached by the HDFS caching feature. If there are no
cache directives in place for that table or partition, the result set displays NOT
CACHED
. A value of 0, or a smaller number than the overall size of the table or partition,
indicates that the cache request has been submitted but the data has not been entirely loaded into memory
yet. See SHOW Statement for details.
Impala memory limits:
The Impala HDFS caching feature interacts with the Impala memory limits as follows:
--mem_limit
startup option,
MEM_LIMIT
query option, or further limits imposed through YARN resource management or
the Linux cgroups
mechanism.
In Impala 1.4.0 and higher, Impala supports efficient reads from data that is pinned in memory through HDFS caching. Impala takes advantage of the HDFS API and reads the data from memory rather than from disk whether the data files are pinned using Impala DDL statements, or using the command-line mechanism where you specify HDFS paths.
When you examine the output of the impala-shell SUMMARY command, or
look in the metrics report for the impalad daemon, you see how many bytes are read from
the HDFS cache. For example, this excerpt from a query profile illustrates that all the data read during a
particular phase of the query came from the HDFS cache, because the BytesRead
and
BytesReadDataNodeCache
values are identical.
HDFS_SCAN_NODE (id=0):(Total: 11s114ms, non-child: 11s114ms, % non-child: 100.00%)
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 32.75
- BytesRead: 10.47 GB (11240756479)
- BytesReadDataNodeCache: 10.47 GB (11240756479)
- BytesReadLocal: 10.47 GB (11240756479)
- BytesReadShortCircuit: 10.47 GB (11240756479)
- DecompressionTime: 27s572ms
For queries involving smaller amounts of data, or in single-user workloads, you might not notice a significant difference in query response time with or without HDFS caching. Even with HDFS caching turned off, the data for the query might still be in the Linux OS buffer cache. The benefits become clearer as data volume increases, and especially as the system processes more concurrent queries. HDFS caching improves the scalability of the overall system. That is, it prevents query performance from declining when the workload outstrips the capacity of the Linux OS cache.
Due to a limitation of HDFS, zero-copy reads are not supported with encryption. Where practical, avoid HDFS caching for Impala data files in encryption zones. The queries fall back to the normal read path during query execution, which might cause some performance overhead.
SELECT considerations:
The Impala HDFS caching feature interacts with the
SELECT
statement and query performance as
follows:
$ sync
$ echo 1 > /proc/sys/vm/drop_caches
COUNT()
of the big result set, which does all the same processing but only prints a
single line to the screen.