To perform as expected, Impala depends on the availability of the software, hardware, and configurations described in the following sections.
Apache Impala runs on Linux systems only. See the README.md file for more information.
Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking metadata about schema objects such as tables and columns. The following components are prerequisites for Impala:
Always configure a Hive metastore service rather than connecting directly to the metastore database. The Hive metastore service is required to interoperate between different levels of metastore APIs if this is necessary for your environment, and using it avoids known issues with connecting directly to the metastore database.
See below for a summary of the metastore installation process.
To install the metastore:
/usr/share/java/
directory.
hive
user.
hive-site.xml
to include information matching your particular
database: its URL, username, and password. You will copy the
hive-site.xml
file to the Impala Configuration Directory later in the
Impala installation process.
Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop components:
JAVA_HOME
environment variable to locate the system Java libraries.
Make sure the impalad service is not run from an environment with
an incorrect setting for this variable.
impala-dependencies.jar
file, which is located at /usr/lib/impala/lib/
. These map to
everything that is built under fe/target/dependency
.
As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using network connections to work with remote data. To support this goal, Impala matches the hostname provided to each Impala daemon with the IP address of each DataNode by resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface for the DataNode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag resolves to the IP address of the DataNode. For single-homed machines, this is usually automatic, but for multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log in a message of the form:
Using hostname: impala-daemon-1.example.com
In the majority of cases, this automatic detection works correctly. If you need to
explicitly set the hostname, do so by setting the --hostname
flag.
The memory allocation should be consistent across Impala executor nodes. A single Impala executor with a lower memory limit than the rest can easily become a bottleneck and lead to suboptimal performance.
This guideline does not apply to coordinator-only nodes.
During join operations, portions of data from each joined table are loaded into memory. Data sets can be very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate completing.
While requirements vary according to data set size, the following is generally recommended:
Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors.
128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query processing on a particular node exceed the amount of memory available to Impala on that node, the query writes temporary work data to disk, which can lead to long query times. Note that because the work is parallelized, and intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join tables that are much larger than the memory available on an individual node.
4 GB or more recommended, ideally 8 GB or more, to accommodate the maximum numbers of tables, partitions, and data files you are planning to use with Impala.
DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be querying.
Impala creates and uses a user and group named impala
. Do not delete
this account or group and do not modify the account's or group's permissions and rights.
Ensure no existing systems obstruct the functioning of these accounts and groups. For
example, if you have scripts that delete user accounts not in a white-list, add these
accounts to the list of permitted accounts.
For correct file deletion during DROP TABLE
operations, Impala must be
able to move files to the HDFS trashcan. You might need to create an HDFS directory
/user/impala, writeable by the impala
user, so
that the trashcan can be created. Otherwise, data files might remain behind after a
DROP TABLE
statement.
Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not permitted to use direct reads. Therefore, running Impala as root negatively affects performance.
By default, any user can connect to Impala and access all the associated databases and
tables. You can enable authorization and authentication based on the Linux OS user who
connects to the Impala server, and the associated groups for that user.
Impala Security for details. These security features do not
change the underlying file permission requirements; the impala
user
still needs to be able to access the data files.