Using Impala through a Proxy for High Availability

For most clusters that have multiple users and production availability requirements, you might want to set up a load-balancing proxy server to relay requests to and from Impala.

Set up a software package of your choice to perform these functions.

Note:

Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and catalogd daemons do not have special requirements for high availability, because problems with those daemons do not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and restart the Impala service.

Overview of Proxy Usage and Load Balancing for Impala

Using a load-balancing proxy server for Impala has the following advantages:

  • Applications connect to a single well-known host and port, rather than keeping track of the hosts where the impalad daemon is running.
  • If any host running the impalad daemon becomes unavailable, application connection requests still succeed because you always connect to the proxy server rather than a specific host running the impalad daemon.
  • The coordinator node for each Impala query potentially requires more memory and CPU cycles than the other nodes that process the query. The proxy server can issue queries so that each connection uses a different coordinator node. This load-balancing technique lets the impalad nodes share this additional work, rather than concentrating it on a single machine.

The following setup steps are a general outline that apply to any load-balancing proxy software:

  1. Select and download the load-balancing proxy software or other load-balancing hardware appliance. It should only need to be installed and configured on a single host, typically on an edge node.
  2. Configure the load balancer (typically by editing a configuration file). In particular:
  3. If you are using Hue or JDBC-based applications, you typically set up load balancing for both ports 21000 and 21050 because these client applications connect through port 21050 while the impala-shell command connects through port 21000. See Ports Used by Impala for when to use port 21000, 21050, or another value depending on what type of connections you are load balancing.
  4. Run the load-balancing proxy server, pointing it at the configuration file that you set up.
  5. For any scripts, jobs, or configuration settings for applications that formerly connected to a specific impalad to run Impala SQL statements, change the connection information (such as the -i option in impala-shell) to point to the load balancer instead.
Note: The following sections use the HAProxy software as a representative example of a load balancer that you can use with Impala.

Choosing the Load-Balancing Algorithm

Load-balancing software offers a number of algorithms to distribute requests. Each algorithm has its own characteristics that make it suitable in some situations but not others.

Leastconn
Connects sessions to the coordinator with the fewest connections, to balance the load evenly. Typically used for workloads consisting of many independent, short-running queries. In configurations with only a few client machines, this setting can avoid having all requests go to only a small set of coordinators.
Recommended for Impala with F5.
Source IP Persistence

Sessions from the same IP address always go to the same coordinator. A good choice for Impala workloads containing a mix of queries and DDL statements, such as CREATE TABLE and ALTER TABLE. Because the metadata changes from a DDL statement take time to propagate across the cluster, prefer to use the Source IP Persistence in this case. If you are unable to choose Source IP Persistence, run the DDL and subsequent queries that depend on the results of the DDL through the same session, for example by running impala-shell -f script_file to submit several statements through a single session.

Required for setting up high availability with Hue.

Round-robin
Distributes connections to all coordinator nodes. Typically not recommended for Impala.

You might need to perform benchmarks and load testing to determine which setting is optimal for your use case. Always set up using two load-balancing algorithms: Source IP Persistence for Hue and Leastconn for others.

Special Proxy Considerations for Clusters Using Kerberos

In a cluster using Kerberos, applications check host credentials to verify that the host they are connecting to is the same one that is actually processing the request.

In Impala 2.11 and lower versions, once you enable a proxy server in a Kerberized cluster, users will not be able to connect to individual impala daemons directly from impala-shell.

In Impala 2.12 and higher versions, when you enable a proxy server in a Kerberized cluster, users have an option to connect to Impala daemons directly from impala-shell using the -b / --kerberos_host_fqdn impala-shell flag. This option can be used for testing or troubleshooting purposes, but not recommended for live production environments as it defeats the purpose of a load balancer/proxy.

Example:

impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com
Alternatively, with the fully qualified configurations:
impala-shell --impalad=impalad-1.mydomain.com:21000 --kerberos --kerberos_host_fqdn=loadbalancer-1.mydomain.com

See impala-shell Configuration Options for information about the option.

To validate the load-balancing proxy server, perform these extra Kerberos setup steps:

  1. This section assumes you are starting with a Kerberos-enabled cluster. See Enabling Kerberos Authentication for Impala for instructions for setting up Impala with Kerberos. See the documentation for your Apache Hadoop distribution for general steps to set up Kerberos.
  2. Choose the host you will use for the proxy server. Based on the Kerberos setup procedure, it should already have an entry impala/proxy_host@realm in its keytab. If not, go back over the initial Kerberos configuration steps for the keytab on each host running the impalad daemon.
  3. Copy the keytab file from the proxy host to all other hosts in the cluster that run the impalad daemon. Put the keytab file in a secure location on each of these other hosts.
  4. Add an entry impala/actual_hostname@realm to the keytab on each host running the impalad daemon.
  5. For each impalad node, merge the existing keytab with the proxy’s keytab using ktutil, producing a new keytab file. For example:
    $ ktutil
      ktutil: read_kt proxy.keytab
      ktutil: read_kt impala.keytab
      ktutil: write_kt proxy_impala.keytab
      ktutil: quit
  6. To verify that the keytabs are merged, run the command:
    
    klist -k keytabfile
    
    The command lists the credentials for both principal and be_principal on all nodes.
  7. Make sure that the impala user has the permission to read this merged keytab file.
  8. For each coordinator impalad host in the cluster that participates in the load balancing, add the following configuration options to receive client connections coming through the load balancer proxy server:
    
    --principal=impala/proxy_host@realm
      --be_principal=impala/actual_host@realm
      --keytab_file=path_to_merged_keytab
    

    The --principal setting prevents a client from connecting to a coordinator impalad using a principal other than one specified.

    Note: Every host has different --be_principal because the actual host name is different on each host. Specify the fully qualified domain name (FQDN) for the proxy host, not the IP address. Use the exact FQDN as returned by a reverse DNS lookup for the associated IP address.
  9. Restart Impala to make the changes take effect. Restart the impalad daemons on all hosts in the cluster, as well as the statestored and catalogd daemons.

Client Connection to Proxy Server in Kerberized Clusters

When a client connect to Impala, the service principal specified by the client must match the -principal setting of the Impala proxy server. And the client should connect to the proxy server port.

In hue.ini, set the following to configure Hue to automatically connect to the proxy server:

[impala]
server_host=proxy_host
impala_principal=impala/proxy_host

The following are the JDBC connection string formats when connecting through the load balancer with the load balancer's host name in the principal:

jdbc:hive2://proxy_host:load_balancer_port/;principal=impala/_HOST@realm
jdbc:hive2://proxy_host:load_balancer_port/;principal=impala/proxy_host@realm

When starting impala-shell, specify the service principal via the -b or --kerberos_host_fqdn flag.

Special Proxy Considerations for TLS/SSL Enabled Clusters

When TLS/SSL is enabled for Impala, the client application, whether impala-shell, Hue, or something else, expects the certificate common name (CN) to match the hostname that it is connected to. With no load balancing proxy server, the hostname and certificate CN are both that of the impalad instance. However, with a proxy server, the certificate presented by the impalad instance does not match the load balancing proxy server hostname. If you try to load-balance a TLS/SSL-enabled Impala installation without additional configuration, you see a certificate mismatch error when a client attempts to connect to the load balancing proxy host.

You can configure a proxy server in several ways to load balance TLS/SSL enabled Impala:

TLS/SSL Bridging
In this configuration, the proxy server presents a TLS/SSL certificate to the client, decrypts the client request, then re-encrypts the request before sending it to the backend impalad. The client and server certificates can be managed separately. The request or resulting payload is encrypted in transit at all times.
TLS/SSL Passthrough
In this configuration, traffic passes through to the backend impalad instance with no interaction from the load balancing proxy server. Traffic is still encrypted end-to-end.
The same server certificate, utilizing either wildcard or Subject Alternate Name (SAN), must be installed on each impalad instance.
TLS/SSL Offload
In this configuration, all traffic is decrypted on the load balancing proxy server, and traffic between the backend impalad instances is unencrypted. This configuration presumes that cluster hosts reside on a trusted network and only external client-facing communication need to be encrypted in-transit.

Refer to your load balancer documentation for the steps to set up Impala and the load balancer using one of the options above.

Example of Configuring HAProxy Load Balancer for Impala

If you are not already using a load-balancing proxy, you can experiment with HAProxy a free, open source load balancer. This example shows how you might install and configure that load balancer on a Red Hat Enterprise Linux system.

  • Install the load balancer:

    yum install haproxy
  • Set up the configuration file: /etc/haproxy/haproxy.cfg. See the following section for a sample configuration file.

  • Run the load balancer (on a single host, preferably one not running impalad):

    /usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg
  • In impala-shell, JDBC applications, or ODBC applications, connect to the listener port of the proxy host, rather than port 21000 or 21050 on a host actually running impalad. The sample configuration file sets haproxy to listen on port 25003, therefore you would send all requests to haproxy_host:25003.

This is the sample haproxy.cfg used in this example:

global
    # To have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local0
    log         127.0.0.1 local1 notice
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon

    # turn on stats unix socket
    #stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#
# You might need to adjust timing values to prevent timeouts.
#
# The timeout values should be dependant on how you use the cluster
# and how long your queries run.
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    maxconn                 3000
    timeout connect 5000
    timeout client 3600s
    timeout server 3600s

#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats :25002
    balance
    mode http
    stats enable
    stats auth username:password

# Setup for Impala.
# Impala client connect to load_balancer_host:25003.
# HAProxy will balance connections among the list of servers listed below.
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
listen impala :25003
    mode tcp
    option tcplog
    balance leastconn

    server symbolic_name_1 impala-host-1.example.com:21000 check
    server symbolic_name_2 impala-host-2.example.com:21000 check
    server symbolic_name_3 impala-host-3.example.com:21000 check
    server symbolic_name_4 impala-host-4.example.com:21000 check

# Setup for Hue or other JDBC-enabled applications.
# In particular, Hue requires sticky sessions.
# The application connects to load_balancer_host:21051, and HAProxy balances
# connections to the associated hosts, where Impala listens for
# JDBC requests at port 21050.
listen impalajdbc :21051
    mode tcp
    option tcplog
    balance source

    server symbolic_name_5 impala-host-1.example.com:21050 check
    server symbolic_name_6 impala-host-2.example.com:21050 check
    server symbolic_name_7 impala-host-3.example.com:21050 check
    server symbolic_name_8 impala-host-4.example.com:21050 check
Important: Hue requires the check option at end of each line in the above file to ensure HAProxy can detect any unreachable Impalad server, and failover can be successful. Without the TCP check, you may hit an error when the impalad daemon to which Hue tries to connect is down.
Note: If your JDBC or ODBC application connects to Impala through a load balancer such as haproxy, be cautious about reusing the connections. If the load balancer has set up connection timeout values, either check the connection frequently so that it never sits idle longer than the load balancer timeout value, or check the connection validity before using it and create a new one if the connection has been closed.