The following sections explain the factors affecting the performance of Impala features, and procedures for tuning, monitoring, and benchmarking Impala queries and other SQL operations.
This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance: it means that performance remains high as the system workload increases. For example, reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously. Sometimes, an optimization technique improves scalability more than performance. For example, reducing memory usage for a query might not change the query performance much, but might improve scalability by allowing more Impala queries or other kinds of jobs to run at the same time without running out of memory.
Before starting any performance tuning or benchmarking, make sure your system is configured with all the recommended minimum hardware requirements from Hardware Requirements and software settings from Post-Installation Configuration for Impala.
COMPUTE STATS
statement, helps Impala automatically optimize the performance for join
queries, without requiring changes to SQL query statements. (This process is greatly simplified in Impala
1.2.2 and higher, because the COMPUTE STATS
statement gathers both kinds of statistics in
one operation, and does not require any setup and configuration as was previously necessary for the
ANALYZE TABLE
statement in Hive.)
A good source of tips related to scalability and performance tuning is the Impala Cookbook presentation. These slides are updated periodically as new features come out and new benchmarks are performed.