Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications to Impala, due to variations in data types, built-in functions, vendor language extensions, and Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for best performance.
When adapting SQL code from a traditional database system to Impala, expect to find a number of differences in the DDL statements that you use to set up the schema. Clauses related to physical layout of files, tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to account for the Impala partitioning scheme and Hadoop file formats.
Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor extensions and features not yet supported in Impala, you might be able to run identical or almost-identical query text on both systems.
Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse and ongoing tuning efforts on the code for SQL queries.
Change any VARCHAR
, VARCHAR2
, and CHAR
columns to
STRING
. Remove any length constraints from the column declarations; for example,
change VARCHAR(32)
or CHAR(1)
to STRING
. Impala is
very flexible about the length of string values; it does not impose any length constraints
or do any special processing (such as blank-padding) for STRING
columns.
(In Impala 2.0 and higher, there are data types VARCHAR
and CHAR
,
with length constraints for both types and blank-padding for CHAR
.
However, for performance reasons, it is still preferable to use STRING
columns where practical.)
For national language character types such as NCHAR
, NVARCHAR
, or
NCLOB
, be aware that while Impala can store and query UTF-8 character data, currently
some string manipulation operations only work correctly with ASCII data. See
STRING Data Type for details.
Change any DATE
, DATETIME
, or TIME
columns to
TIMESTAMP
. Remove any precision constraints. Remove any timezone clauses, and make
sure your application logic or ETL process accounts for the fact that Impala expects all
TIMESTAMP
values to be in
Coordinated
Universal Time (UTC). See TIMESTAMP Data Type for information about
the TIMESTAMP
data type, and
Impala Date and Time Functions for conversion functions for different
date and time formats.
You might also need to adapt date- and time-related literal values and format strings to use the
supported Impala date and time formats. If you have date and time literals with different separators or
different numbers of YY
, MM
, and so on placeholders than Impala
expects, consider using calls to regexp_replace()
to transform those values to the
Impala-compatible format. See TIMESTAMP Data Type for information about the
allowed formats for date and time literals, and
Impala String Functions for string conversion functions such as
regexp_replace()
.
Instead of SYSDATE
, call the function NOW()
.
Instead of adding or subtracting directly from a date value to produce a value N
days in the past or future, use an INTERVAL
expression, for example NOW() +
INTERVAL 30 DAYS
.
Although Impala supports INTERVAL
expressions for datetime arithmetic, as shown in
TIMESTAMP Data Type, INTERVAL
is not available as a column
data type in Impala. For any INTERVAL
values stored in tables, convert them to numeric
values that you can add or subtract using the functions in
Impala Date and Time Functions. For example, if you had a table
DEADLINES
with an INT
column TIME_PERIOD
, you could
construct dates N days in the future like so:
SELECT NOW() + INTERVAL time_period DAYS from deadlines;
For YEAR
columns, change to the smallest Impala integer type that has sufficient
range. See Data Types for details about ranges, casting, and so on
for the various numeric data types.
Change any DECIMAL
and NUMBER
types. If fixed-point precision is not
required, you can use FLOAT
or DOUBLE
on the Impala side depending on
the range of values. For applications that require precise decimal values, such as financial data, you
might need to make more extensive changes to table structure and application logic, such as using
separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs
to manipulate them. See Data Types for details about ranges,
casting, and so on for the various numeric data types.
FLOAT
, DOUBLE
, and REAL
types are supported in
Impala. Remove any precision and scale specifications. (In Impala, REAL
is just an
alias for DOUBLE
; columns declared as REAL
are turned into
DOUBLE
behind the scenes.) See Data Types for
details about ranges, casting, and so on for the various numeric data types.
Most integer types from other systems have equivalents in Impala, perhaps under different names such as
BIGINT
instead of INT8
. For any that are unavailable, for example
MEDIUMINT
, switch to the smallest Impala integer type that has sufficient range.
Remove any precision specifications. See Data Types for details
about ranges, casting, and so on for the various numeric data types.
Remove any UNSIGNED
constraints. All Impala numeric types are signed. See
Data Types for details about ranges, casting, and so on for the
various numeric data types.
For any types holding bitwise values, use an integer type with enough range to hold all the relevant bits within a positive integer. See Data Types for details about ranges, casting, and so on for the various numeric data types.
For example, TINYINT
has a maximum positive value of 127, not 256, so to manipulate
8-bit bitfields as positive numbers switch to the next largest type SMALLINT
.
[localhost:21000] > select cast(127*2 as tinyint);
+--------------------------+
| cast(127 * 2 as tinyint) |
+--------------------------+
| -2 |
+--------------------------+
[localhost:21000] > select cast(128 as tinyint);
+----------------------+
| cast(128 as tinyint) |
+----------------------+
| -128 |
+----------------------+
[localhost:21000] > select cast(127*2 as smallint);
+---------------------------+
| cast(127 * 2 as smallint) |
+---------------------------+
| 254 |
+---------------------------+
Impala does not support notation such as b'0101'
for bit literals.
For BLOB values, use STRING
to represent CLOB
or
TEXT
types (character based large objects) up to 32 KB in size. Binary large objects
such as BLOB
, RAW
BINARY
, and
VARBINARY
do not currently have an equivalent in Impala.
For Boolean-like types such as BOOL
, use the Impala BOOLEAN
type.
Because Impala currently does not support composite or nested types, any spatial data types in other database systems do not have direct equivalents in Impala. You could represent spatial values in string format and write UDFs to process them. See User-Defined Functions (UDFs) for details. Where practical, separate spatial types into separate tables so that Impala can still work with the non-spatial data.
Take out any DEFAULT
clauses. Impala can use data files produced from many different
sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of LOAD DATA
and external tables mean that Impala is flexible about the format of data files, and Impala does not
necessarily validate or cleanse data before querying it. When copying data through Impala
INSERT
statements, you can use conditional functions such as CASE
or
NVL
to substitute some other value for NULL
fields; see
Impala Conditional Functions for details.
Take out any constraints from your CREATE TABLE
and ALTER TABLE
statements, for example PRIMARY KEY
, FOREIGN KEY
,
UNIQUE
, NOT NULL
, UNSIGNED
, or
CHECK
constraints. Impala can use data files produced from many different sources,
such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen
earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries
to test for NULL
values. When copying data through Impala INSERT
statements, you can use conditional functions such as CASE
or NVL
to
substitute some other value for NULL
fields; see
Impala Conditional Functions for details.
Do as much verification as practical before loading data into Impala. After data is loaded into Impala,
you can do further verification using SQL queries to check if values have expected ranges, if values
are NULL
or not, and so on. If there is a problem with the data, you will need to
re-run earlier stages of the ETL process, or do an INSERT ... SELECT
statement in
Impala to copy the faulty data to a new table and transform or filter out the bad values.
Take out any CREATE INDEX
, DROP INDEX
, and ALTER
INDEX
statements, and equivalent ALTER TABLE
statements. Remove any
INDEX
, KEY
, or PRIMARY KEY
clauses from
CREATE TABLE
and ALTER TABLE
statements. Impala is optimized for bulk
read operations for data warehouse-style queries, and therefore does not support indexes for its
tables.
Calls to built-in functions with out-of-range or otherwise incorrect arguments, return
NULL
in Impala as opposed to raising exceptions. (This rule applies even when the
ABORT_ON_ERROR=true
query option is in effect.) Run small-scale queries using
representative data to doublecheck that calls to built-in functions are returning expected values
rather than NULL
. For example, unsupported CAST
operations do not
raise an error in Impala:
select cast('foo' as int);
+--------------------+
| cast('foo' as int) |
+--------------------+
| NULL |
+--------------------+
For any other type not supported in Impala, you could represent their values in string format and write UDFs to process them. See User-Defined Functions (UDFs) for details.
To detect the presence of unsupported or unconvertable data types in data files, do initial testing
with the ABORT_ON_ERROR=true
query option in effect. This option causes queries to
fail immediately if they encounter disallowed type conversions. See
ABORT_ON_ERROR Query Option for details. For example:
set abort_on_error=true;
select count(*) from (select * from t1);
-- The above query will fail if the data files for T1 contain any
-- values that can't be converted to the expected Impala data types.
-- For example, if T1.C1 is defined as INT but the column contains
-- floating-point values like 1.1, the query will return an error.
The following SQL statements or clauses are not currently supported or supported with limitations in Impala:
Impala supports the DELETE
statement only for
Kudu tables.
Impala is intended for data warehouse-style operations where you do
bulk moves and transforms of large quantities of data. When not
using Kudu tables, instead of DELETE
, use
INSERT OVERWRITE
to entirely replace the contents
of a table or partition, or use INSERT ... SELECT
to copy a subset of data (everything but the rows you intended to
delete) from one table to another. See DML Statements for an overview of Impala DML
statements.
Impala supports the UPDATE
statement only for
Kudu tables.
When not using Kudu tables, instead of UPDATE
, do
all necessary transformations early in the ETL process, such as in
the job that generates the original data, or when copying from one
table to another to convert to a particular file format or
partitioning scheme. See DML Statements for an
overview of Impala DML statements.
Impala has no transactional statements, such as
COMMIT
or ROLLBACK
.
Impala effectively works like the AUTOCOMMIT
mode
in some database systems, where changes take effect as soon as they
are made.
If your database, table, column, or other names conflict with Impala reserved words, use different names or quote the names with backticks.
See Impala Reserved Words for the current list of Impala reserved words.
Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or column alias.
For example, in SELECT * FROM t1 NATURAL JOIN t2
,
Impala does not recognize the NATURAL
keyword and
interprets it as an alias for the table t1
. If you
experience any unexpected behavior with queries, check the list of
reserved words to make sure all keywords in join and
WHERE
clauses are supported keywords in Impala.
Impala has some restrictions on subquery support. See Subqueries in Impala SELECT Statements for the current details.
Impala supports UNION
and UNION
ALL
set operators, but not INTERSECT
.
Prefer UNION ALL
over
UNION
when you know the data sets are disjoint or duplicate values are
not a problem; UNION ALL
is more efficient because it avoids
materializing and sorting the entire result set to eliminate duplicate values.
Impala requires query aliases for the subqueries used as inline
views in the FROM
clause.
contents_of_t1
at the end, the
following query gives a syntax
error:SELECT COUNT(*) FROM (SELECT * FROM t1) contents_of_t1;
SELECT * FROM functional.alltypes WHERE id = (SELECT MIN(id) FROM functional.alltypes);
When an alias is declared for an expression in a query, that alias
cannot be referenced again within the same SELECT
list.
For example, the average
alias cannot be
referenced twice in the SELECT
list as below. You
will receive an error:
SELECT AVG(x) AS average, average+1 FROM t1 GROUP BY x;
SELECT
list. For example, the
average
alias can be referenced twice as shown
below:SELECT AVG(x) AS average FROM t1 GROUP BY x HAVING average > 3;
Impala does not support NATURAL JOIN
, and it does
not support the USING
clause in joins. See Joins in Impala SELECT Statements for details on the syntax for
Impala join clauses.
Impala supports a limited choice of partitioning types.
Partitions are defined based on each distinct combination of values
for one or more partition key columns. Impala does not redistribute
or check data to create evenly distributed partitions. You must
choose partition key columns based on your knowledge of the data
volume and distribution. Adapt any tables that use range, list,
hash, or key partitioning to use the Impala partition syntax for
CREATE TABLE
and ALTER TABLE
statements.
Impala partitioning is similar to range partitioning where every range has exactly one value, or key partitioning where the hash function produces a separate bucket for every combination of key values. See Partitioning for Impala Tables for usage details, and CREATE TABLE Statement and ALTER TABLE Statement for syntax.
To distribute work for a query across a cluster, you need at least one HDFS block per node. HDFS blocks are typically multiple megabytes, especially for Parquet files. Therefore, if each partition holds only a few megabytes of data, you are unlikely to see much parallelism in the query because such a small amount of data is typically processed by a single node.
For the "top-N" queries, Impala uses the
LIMIT
clause rather than comparing against a
pseudo column named ROWNUM
or
ROW_NUM
.
See LIMIT Clause for details.
Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, examine and possibly update the following where appropriate:
A CREATE TABLE
statement with no STORED AS
clause creates data files
in plain text format, which is convenient for data interchange but not a good choice for high-volume
data with high-performance queries. See How Impala Works with Hadoop File Formats for why and
how to use specific file formats for compact data and high-performance queries. Especially see
Using the Parquet File Format with Impala Tables, for details about the file format most heavily optimized for
large-scale data warehouse queries.
Adapting tables that were already partitioned in a different database system could produce an Impala table with a high number of partitions and not enough data in each one, leading to underutilization of Impala's parallel query features.
See Partitioning for Impala Tables for details about setting up partitioning and tuning the performance of queries on partitioned tables.
The INSERT ... VALUES
syntax is suitable for
setting up toy tables with a few rows for functional testing when
used with HDFS. Each such statement creates a separate tiny file in
HDFS, and it is not a scalable technique for loading megabytes or
gigabytes (let alone petabytes) of data.
Consider revising your data load process to produce raw data files
outside of Impala, then setting up Impala external tables or using
the LOAD DATA
statement to use those data files
instantly in Impala tables, with no conversion or indexing stage.
See External Tables and LOAD DATA Statement for details about the
Impala techniques for working with data files produced outside of
Impala; see Data Loading and Querying Examples for
examples of ETL workflow for Impala.
INSERT
works fine for Kudu tables even though not
particularly fast.
If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data
files, or a single giant data file that cannot take advantage of distributed parallel queries or
partitioning. In this case, use an INSERT ... SELECT
statement to copy the data into a
new table and reorganize into a more efficient layout in the same operation. See
INSERT Statement for details about the INSERT
statement.
You can do INSERT ... SELECT
into a table with a
more efficient file format (see How Impala Works with Hadoop File Formats) or from an
unpartitioned table into a partitioned one. See Partitioning for Impala Tables.
Complex queries may have high codegen time. As a workaround, set
the query option DISABLE_CODEGEN=true
if queries
fail for this reason. See DISABLE_CODEGEN Query Option for details.
If practical, rewrite UNION
queries to use the UNION ALL
operator
instead. Prefer UNION ALL
over
UNION
when you know the data sets are disjoint or duplicate values are
not a problem; UNION ALL
is more efficient because it avoids
materializing and sorting the entire result set to eliminate duplicate values.
Some of the decisions you make during the porting process can have an impact on performance. After your SQL code is ported and working correctly, examine the performance-related aspects of your schema design, physical layout, and queries to make sure that the ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and integration with Hadoop components. The following are a few of the areas you should examine:
COMPUTE STATS
on all tables.WHERE
clauses.See Tuning Impala for Performance for details about the performance tuning process.