Impala includes features that balance and maximize resources in your Apache Hadoop cluster. This topic describes how you can improve efficiency of your a Apache Hadoop cluster using those features.
The configuration options for admission control range from the simple (a single resource pool with a single set of options) to the complex (multiple resource pools with different options, each pool handling queries for a different set of users and groups).
To configure admission control, use a combination of startup options for the Impala daemon and edit or create the configuration files fair-scheduler.xml and llama-site.xml.
For a straightforward configuration using a single resource pool named
default
, you can specify configuration options on the command line and
skip the fair-scheduler.xml and llama-site.xml
configuration files.
‑‑fair_scheduler_allocation_path
and
‑‑llama_site_path
respectively.
The Impala admission control feature uses the Fair Scheduler configuration settings to determine how to map users and groups to different resource pools. For example, you might set up different resource pools with separate memory limits, and maximum number of concurrent and queued queries, for different categories of users within your organization. For details about all the Fair Scheduler configuration settings, see the Apache wiki.
The Impala admission control feature uses a small subset of possible settings from the llama-site.xml configuration file:
llama.am.throttling.maximum.placed.reservations.queue_name
llama.am.throttling.maximum.queued.reservations.queue_name
impala.admission-control.pool-default-query-options.queue_name
impala.admission-control.pool-queue-timeout-ms.queue_name
The impala.admission-control.pool-queue-timeout-ms
setting specifies
the timeout value for this pool in milliseconds.
Theimpala.admission-control.pool-default-query-options
settings
designates the default query options for all queries that run in this pool. Its argument
value is a comma-delimited string of 'key=value' pairs, 'key1=val1,key2=val2,
...'
. For example, this is where you might set a default memory limit for all
queries in the pool, using an argument such as MEM_LIMIT=5G
.
The impala.admission-control.*
configuration settings are available in
Impala 2.5 and higher.
Here are sample fair-scheduler.xml and
llama-site.xml files that define resource pools
root.default
, root.development
, and
root.production
. These files define resource pools for Impala
admission control and are separate from the similar
fair-scheduler.xml
that defines resource pools for YARN.
fair-scheduler.xml:
Although Impala does not use the vcores
value, you must still specify
it to satisfy YARN requirements for the file contents.
Each <aclSubmitApps>
tag (other than the one for
root
) contains a comma-separated list of users, then a space, then a
comma-separated list of groups; these are the users and groups allowed to submit
Impala statements to the corresponding resource pool.
If you leave the <aclSubmitApps>
element empty for a pool,
nobody can submit directly to that pool; child pools can specify their own
<aclSubmitApps>
values to authorize users and groups to submit
to those pools.
<allocations>
<queue name="root">
<aclSubmitApps> </aclSubmitApps>
<queue name="default">
<maxResources>50000 mb, 0 vcores</maxResources>
<aclSubmitApps>*</aclSubmitApps>
</queue>
<queue name="development">
<maxResources>200000 mb, 0 vcores</maxResources>
<aclSubmitApps>user1,user2 dev,ops,admin</aclSubmitApps>
</queue>
<queue name="production">
<maxResources>1000000 mb, 0 vcores</maxResources>
<aclSubmitApps> ops,admin</aclSubmitApps>
</queue>
</queue>
<queuePlacementPolicy>
<rule name="specified" create="false"/>
<rule name="default" />
</queuePlacementPolicy>
</allocations>
llama-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.default</name>
<value>10</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.default</name>
<value>50</value>
</property>
<property>
<name>impala.admission-control.pool-default-query-options.root.default</name>
<value>mem_limit=128m,query_timeout_s=20,max_io_buffers=10</value>
</property>
<property>
<name>impala.admission-control.pool-queue-timeout-ms.root.default</name>
<value>30000</value>
</property>
<property>
<name>impala.admission-control.max-query-mem-limit.root.default.regularPool</name>
<value>1610612736</value><!--1.5GB-->
</property>
<property>
<name>impala.admission-control.min-query-mem-limit.root.default.regularPool</name>
<value>52428800</value><!--50MB-->
</property>
<property>
<name>impala.admission-control.clamp-mem-limit-query-option.root.default.regularPool</name>
<value>true</value>
</property>
<property>
<name>impala.admission-control.max-query-cpu-core-per-node-limit.root.default.regularPool</name>
<value>8</value>
</property>
<property>
<name>impala.admission-control.max-query-cpu-core-coordinator-limit.root.default.regularPool</name>
<value>8</value>
</property>
</configuration>
The following Impala configuration options let you adjust the settings of the admission
control feature. When supplying the options on the impalad command
line, prepend the option name with --
.
queue_wait_timeout_ms
Type: int64
Default: 60000
default_pool_max_requests
fair_scheduler_config_path
and llama_site_path
are
set.
Type: int64
Default: -1, meaning unlimited (prior to Impala 2.5 the default was 200)
default_pool_max_queued
fair_scheduler_config_path
and llama_site_path
are
set.
Type: int64
Default: unlimited
default_pool_mem_limit
b
(optional), m
, or g
,
either uppercase or lowercase. You can specify floating-point values for megabytes
and gigabytes, to represent fractional numbers such as 1.5
. You can
also specify it as a percentage of the physical memory by specifying the suffix
%
. 0 or no setting indicates no limit. Defaults to bytes if no unit
is given. Because this limit applies cluster-wide, but each Impala node makes
independent decisions to run queries immediately or queue them, it is a soft limit;
the overall memory used by concurrent queries might be slightly higher during times
of heavy load. Ignored if fair_scheduler_config_path
and
llama_site_path
are set.
COMPUTE STATS
statement
to estimate memory usage for each query. See
COMPUTE STATS Statement for guidelines about how
and when to use this statement.
Type: string
Default: ""
(empty string, meaning unlimited)
disable_pool_max_requests
Type: Boolean
Default: false
disable_pool_mem_limits
Type: Boolean
Default: false
fair_scheduler_allocation_path
fair-scheduler.xml
).
Type: string
Default: ""
(empty string)
Usage notes: Admission control only uses a small subset of the settings that can go in this file, as described below. For details about all the Fair Scheduler configuration settings, see the Apache wiki.
llama_site_path
llama-site.xml
). If set,
fair_scheduler_allocation_path
must also be set.
Type: string
Default: ""
(empty string)
Usage notes: Admission control only uses a few of the settings that can go in this file, as described below.