Developing Impala Applications
The core development language with Impala is SQL. You can also use Java or other languages to interact with Impala through the standard JDBC and ODBC interfaces used by many business intelligence tools. For specialized kinds of analysis, you can supplement the SQL built-in functions by writing user-defined functions (UDFs) in C++ or Java.
Overview of the Impala SQL Dialect
The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL). As such, it is familiar to users who are already familiar with running SQL queries on the Hadoop infrastructure. Currently, Impala SQL supports a subset of HiveQL statements, data types, and built-in functions. Impala also includes additional built-in functions for common industry features, to simplify porting SQL from non-Hadoop systems.
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect might seem familiar:
-
The SELECT statement includes familiar clauses such as
WHERE,GROUP BY,ORDER BY, andWITH. You will find familiar notions such as joins, built-in functions for processing strings, numbers, and dates, aggregate functions, subqueries, and comparison operators such asIN()andBETWEEN. TheSELECTstatement is the place where SQL standards compliance is most important. -
From the data warehousing world, you will recognize the notion of partitioned tables. One or more columns serve as partition keys, and the data is physically arranged so that queries that refer to the partition key columns in the
WHEREclause can skip partitions that do not match the filter conditions. For example, if you have 10 years worth of data and use a clause such asWHERE year = 2015,WHERE year > 2010, orWHERE year IN (2014, 2015), Impala skips all the data for non-matching years, greatly reducing the amount of I/O for the query. -
In Impala 1.2 and higher, UDFs let you perform custom comparisons and transformation logic during
SELECTandINSERT...SELECTstatements.
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect might require some learning and practice for you to become proficient in the Hadoop environment:
-
Impala SQL is focused on queries and includes relatively little DML. There is no
UPDATEorDELETEstatement. Stale data is typically discarded (byDROP TABLEorALTER TABLE ... DROP PARTITIONstatements) or replaced (byINSERT OVERWRITEstatements). -
All data creation is done by
INSERTstatements, which typically insert data in bulk by querying from other tables. There are two variations,INSERT INTOwhich appends to the existing data, andINSERT OVERWRITEwhich replaces the entire contents of a table or partition (similar toTRUNCATE TABLEfollowed by a newINSERT). Although there is anINSERT ... VALUESsyntax to create a small number of values in a single statement, it is far more efficient to use theINSERT ... SELECTto copy and transform large amounts of data from one table to another in a single operation. -
You often construct Impala table definitions and data files in some other environment, and then attach Impala so that it can run real-time queries. The same data files and table metadata are shared with other components of the Hadoop ecosystem. In particular, Impala can access tables created by Hive or data inserted by Hive, and Hive can access tables and data produced by Impala. Many other Hadoop components can write files in formats such as Parquet and Avro, that can then be queried by Impala.
-
Because Hadoop and Impala are focused on data warehouse-style operations on large data sets, Impala SQL includes some idioms that you might find in the import utilities for traditional database systems. For example, you can create a table that reads comma-separated or tab-separated text files, specifying the separator in the
CREATE TABLEstatement. You can create external tables that read existing data files but do not move or transform them. -
Because Impala reads large quantities of data that might not be perfectly tidy and predictable, it does not require length constraints on string data types. For example, you can define a database column as
STRINGwith unlimited length, rather thanCHAR(1)orVARCHAR(64). (Although in Impala 2.0 and later, you can also use length-constrainedCHARandVARCHARtypes.)
Related information: Impala SQL Language Reference, especially Impala SQL Statements and Impala Built-In Functions
Overview of Impala Programming Interfaces
You can connect and submit requests to the Impala daemons through:
-
The
impala-shellinteractive command interpreter. - The Hue web-based user interface.
- JDBC.
- ODBC.
With these options, you can use Impala in heterogeneous environments, with JDBC or ODBC applications running on non-Linux platforms. You can also use Impala on combination with various Business Intelligence tools that use the JDBC and ODBC interfaces.
Each impalad daemon process, running on separate nodes in a cluster, listens to
several ports for incoming requests. Requests from
impala-shell and Hue are routed to the impalad daemons through the same
port. The impalad daemons listen on separate ports for JDBC and ODBC requests.