Impala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with other Hadoop components, as both a consumer and a producer, so it can fit in flexible ways into your ETL and ELT pipelines.
A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new categories of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing Apache Hive infrastructure that many Hadoop users already have in place to perform long-running, batch-oriented SQL queries.
In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as the metastore, the same database where Hive keeps this type of data. Thus, Impala can access tables defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and compression codecs.
The initial focus on query features and performance means that Impala can read more types of data with the
SELECT
statement than it can write with the INSERT
statement. To query
data using the Avro, RCFile, or SequenceFile file
formats, you load the data using Hive.
The Impala query optimizer can also make use of table
statistics and column statistics.
Originally, you gathered this information with the ANALYZE TABLE
statement in Hive; in
Impala 1.2.2 and higher, use the Impala COMPUTE
STATS
statement instead. COMPUTE STATS
requires less setup, is more
reliable, and does not require switching back and forth between impala-shell
and the Hive shell.
As discussed in How Impala Works with Hive, Impala maintains information about table definitions in a central database known as the metastore. Impala also tracks other metadata for the low-level characteristics of data files:
For tables with a large volume of data and/or many partitions, retrieving all the metadata for a table can be time-consuming, taking minutes in some cases. Thus, each Impala node caches all of this metadata to reuse for future queries against the same table.
If the table definition or the data in the table is updated, all other Impala daemons in the cluster must receive the latest metadata, replacing the obsolete cached metadata, before issuing a query against that table. In Impala 1.2 and higher, the metadata update is automatic, coordinated through the catalogd daemon, for all DDL and DML statements issued through Impala. See The Impala Catalog Service for details.
For DDL and DML issued through Hive, or changes made manually to files in HDFS, you still use the
REFRESH
statement (when new data files are added to existing tables) or the
INVALIDATE METADATA
statement (for entirely new tables, or after dropping a table,
performing an HDFS rebalance operation, or deleting data files). Issuing INVALIDATE
METADATA
by itself retrieves metadata for all the tables tracked by the metastore. If you know
that only specific tables have been changed outside of Impala, you can issue REFRESH
table_name
for each affected table to only retrieve the latest metadata for
those tables.
Impala uses the distributed filesystem HDFS as its primary data storage medium. Impala relies on the redundancy provided by HDFS to guard against hardware or network outages on individual nodes. Impala table data is physically represented as data files in HDFS, using familiar HDFS file formats and compression codecs. When data files are present in the directory for a new table, Impala reads them all, regardless of file name. New data is added in files with names controlled by Impala.
HBase is an alternative to HDFS as a storage medium for Impala data. It is a database storage system built on top of HDFS, without built-in SQL support. Many Hadoop users already have it configured and store large (often sparse) data sets in it. By defining tables in Impala and mapping them to equivalent tables in HBase, you can query the contents of the HBase tables through Impala, and even perform join queries including both Impala and HBase tables. See Using Impala to Query HBase Tables for details.