You can use Impala to query data files that reside on Apache Ozone distributed storage, rather than in HDFS. The combination of the Impala query engine and Apache Ozone storage is certified on Impala 4.2 or higher.
For more information on Ozone, see the Apache Ozone site.
The typical use case for Impala and Ozone together is to use Ozone for the default
filesystem, replacing HDFS entirely. In this configuration, when you create a database,
table, or partition, the data always resides on Ozone storage and you do not need to
specify any special LOCATION
attribute. If you do specify a
LOCATION
attribute, its value refers to a path within the Ozone
filesystem. For example:
-- If the default filesystem is Ozone, all Impala data resides there
-- and all Impala databases and tables are located there.
CREATE TABLE t1 (x INT, s STRING);
-- You can specify LOCATION for database, table, or partition,
-- using values from the Ozone filesystem.
CREATE DATABASE d1 LOCATION '/some/path/on/ozone/server/d1.db';
CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
Impala can write to, delete, and rename data files and database, table, and partition
directories on Ozone storage. Therefore, Impala statements such as CREATE
TABLE
, DROP TABLE
, CREATE DATABASE
,
DROP DATABASE
, ALTER TABLE
, and INSERT
work the same with Ozone storage as with HDFS.
Ozone supports multiple protocols: ofs
, o3fs
, and
s3a
. Impala supports reading ofs
and o3fs
.
Impala can also read s3a
(see Using Impala with Amazon S3 Object Store). However
ofs
is their newer protocol, and the only one Impala supports as a default
filesystem. We recommend using it for DDL Statements to avoid access
limitations, and for DML Statements and
SELECT Statement for performance.
Because Apache Ozone storage buckets use a global value for the block size rather than
a configurable value for each file, the PARQUET_FILE_SIZE
query option
has no effect when Impala inserts data into a table or partition residing on Ozone
storage.
Impala's spill-to-disk feature may be configured to use Ozone storage by specifying a full
URI (e.g. ofs://host:port/volume/bucket/key
) for the spill location. See
Managing Disk Space for Impala Data for details on configuring remote
spill-to-disk.