Creates a user-defined function (UDF), which you can use to implement
custom logic during SELECT
or INSERT
operations.
Syntax:
The syntax is different depending on whether you create a scalar UDF, which is called once for each row and implemented by a single function, or a user-defined aggregate function (UDA), which is implemented by multiple functions that compute intermediate results across sets of rows.
In Impala 2.5 and higher, the syntax is also different for creating or dropping scalar Java-based UDFs. The statements for Java UDFs use a new syntax, without any argument types or return type specified. Java-based UDFs created using the new syntax persist across restarts of the Impala catalog server, and can be shared transparently between Impala and Hive.
To create a persistent scalar C++ UDF with CREATE FUNCTION
:
CREATE FUNCTION [IF NOT EXISTS] [db_name.]function_name([arg_type[, arg_type...])
RETURNS return_type
LOCATION 'hdfs_path_to_dot_so'
SYMBOL='symbol_name'
CREATE FUNCTION
:
CREATE FUNCTION [IF NOT EXISTS] [db_name.]function_name
LOCATION 'hdfs_path_to_jar'
SYMBOL='class_name'
To create a persistent UDA, which must be written in C++, issue a CREATE AGGREGATE FUNCTION
statement:
CREATE [AGGREGATE] FUNCTION [IF NOT EXISTS] [db_name.]function_name([arg_type[, arg_type...])
RETURNS return_type
[INTERMEDIATE type_spec]
LOCATION 'hdfs_path'
[INIT_FN='function]
UPDATE_FN='function
MERGE_FN='function
[PREPARE_FN='function]
[CLOSEFN='function]
[SERIALIZE_FN='function]
[FINALIZE_FN='function]
Statement type: DDL
Varargs notation:
Variable-length argument lists are supported for C++ UDFs, but currently not for Java UDFs.
If the underlying implementation of your function accepts a variable number of arguments:
CREATE FUNCTION
statement
by including ...
immediately after the type name of the first variable argument. For
example, to create a function that accepts an INT
argument, followed by a
BOOLEAN
, followed by one or more STRING
arguments, your CREATE
FUNCTION
statement would look like:
CREATE FUNCTION func_name (INT, BOOLEAN, STRING ...)
RETURNS type LOCATION 'path' SYMBOL='entry_point';
See Variable-Length Argument Lists for how to code a C++ UDF to accept variable-length argument lists.
Scalar and aggregate functions:
The simplest kind of user-defined function returns a single scalar value each time it is called, typically
once for each row in the result set. This general kind of function is what is usually meant by UDF.
User-defined aggregate functions (UDAs) are a specialized kind of UDF that produce a single value based on
the contents of multiple rows. You usually use UDAs in combination with a GROUP BY
clause to
condense a large result set into a smaller one, or even a single row summarizing column values across an
entire table.
You create UDAs by using the CREATE AGGREGATE FUNCTION
syntax. The clauses
INIT_FN
, UPDATE_FN
, MERGE_FN
,
SERIALIZE_FN
, FINALIZE_FN
, and
INTERMEDIATE
only apply when you create a UDA rather than a scalar UDF.
The *_FN
clauses specify functions to call at different phases of function processing.
INIT_FN
clause does any initial
setup, such as initializing member variables in internal data structures. This function is often a stub for
simple UDAs. You can omit this clause and a default (no-op) function will be used.
UPDATE_FN
clause is called once for each
row in the original result set, that is, before any GROUP BY
clause is applied. A separate
instance of the function is called for each different value returned by the GROUP BY
clause. The final argument passed to this function is a pointer, to which you write an updated value based
on its original value and the value of the first argument.
MERGE_FN
clause is called an arbitrary
number of times, to combine intermediate values produced by different nodes or different threads as Impala
reads and processes data files in parallel. The final argument passed to this function is a pointer, to
which you write an updated value based on its original value and the value of the first argument.
SERIALIZE_FN
clause frees memory
allocated to intermediate results. It is required if any memory was allocated by the Allocate function in
the Init, Update, or Merge functions, or if the intermediate type contains any pointers. See
the UDA code samples for details.
FINALIZE_FN
clause does any required
teardown for resources acquired by your UDF, such as freeing memory, closing file handles if you explicitly
opened any files, and so on. This function is often a stub for simple UDAs. You can omit this clause and a
default (no-op) function will be used. It is required in UDAs where the final return type is different than
the intermediate type. or if any memory was allocated by the Allocate function in the Init, Update, or
Merge functions. See the UDA code samples for details.
If you use a consistent naming convention for each of the underlying functions, Impala can automatically determine the names based on the first such clause, so the others are optional.
For end-to-end examples of UDAs, see User-Defined Functions (UDFs).
Complex type considerations:
Currently, Impala UDFs cannot accept arguments or return values of the Impala complex
types (STRUCT
, ARRAY
, or MAP
).
Usage notes:
CREATE FUNCTION
statement requires:CREATE
privilege on the database.ALL
privilege on URI where URI is the value
you specified for the LOCATION
in the
CREATE FUNCTION
statement. .so
or .jar
file, which you store
in HDFS and the CREATE FUNCTION
statement distributes to each Impala node.
struct
. This
struct
contains 2 fields. The first field is a boolean
representing
whether the value is NULL
or not. (When this field is true
, the return
value is interpreted as NULL
.) The second field is the same type as the specified function
return type, and holds the return value when the function returns something other than
NULL
.
struct
s, corresponding to each of the arguments.
Each struct
has the same 2 fields as with the return value, a boolean
field representing whether the argument is NULL
, and a field of the appropriate type
holding any non-NULL
argument value.
ALTER FUNCTION
statement, if you need to rename
a function, move it to a different database, or change its signature or other properties, issue a
DROP FUNCTION
statement for the original function followed by a CREATE
FUNCTION
with the desired properties.
USE
statement
before doing any CREATE FUNCTION
statements, or specify the name of the function as
db_name.function_name
.
If you connect to different Impala nodes within an impala-shell
session for load-balancing purposes, you can enable the SYNC_DDL
query
option to make each DDL statement wait before returning, until the new or changed
metadata has been received by all the Impala nodes. See
SYNC_DDL Query Option for details.
Compatibility:
Impala can run UDFs that were created through Hive, as long as they refer to Impala-compatible data types (not composite or nested column types). Hive can run Java-based UDFs that were created through Impala, but not Impala UDFs written in C++.
The Hive current_user()
function cannot be called from a Java UDF
through Impala.
Persistence:
In Impala 2.5 and higher, Impala UDFs and UDAs written in C++ are
persisted in the metastore database. Java UDFs are also persisted, if they were created
with the new CREATE FUNCTION
syntax for Java UDFs, where the Java
function argument and return types are omitted. Java-based UDFs created with the old
CREATE FUNCTION
syntax do not persist across restarts because they are
held in the memory of the catalogd daemon. Until you re-create such
Java UDFs using the new CREATE FUNCTION
syntax, you must reload those
Java-based UDFs by running the original CREATE FUNCTION
statements
again each time you restart the catalogd daemon. Prior to
Impala 2.5 the requirement to reload functions after a restart
applied to both C++ and Java functions.
Cancellation: Cannot be cancelled.
HDFS permissions: This statement does not touch any HDFS files or directories, therefore no HDFS permissions are required.
Examples:
For additional examples of all kinds of user-defined functions, see User-Defined Functions (UDFs).
The following example shows how to take a Java jar file and make all the functions inside one of its classes
into UDFs under a single (overloaded) function name in Impala. Each CREATE FUNCTION
or
DROP FUNCTION
statement applies to all the overloaded Java functions with the same name.
This example uses the signatureless syntax for CREATE FUNCTION
and DROP FUNCTION
,
which is available in Impala 2.5 and higher.
At the start, the jar file is in the local filesystem. Then it is copied into HDFS, so that it is
available for Impala to reference through the CREATE FUNCTION
statement and
queries that refer to the Impala function name.
$ jar -tvf udf-examples.jar
0 Mon Feb 22 04:06:50 PST 2016 META-INF/
122 Mon Feb 22 04:06:48 PST 2016 META-INF/MANIFEST.MF
0 Mon Feb 22 04:06:46 PST 2016 org/
0 Mon Feb 22 04:06:46 PST 2016 org/apache/
0 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/
2460 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/IncompatibleUdfTest.class
541 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/TestUdfException.class
3438 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/JavaUdfTest.class
5872 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/TestUdf.class
...
$ hdfs dfs -put udf-examples.jar /user/impala/udfs
$ hdfs dfs -ls /user/impala/udfs
Found 2 items
-rw-r--r-- 3 jrussell supergroup 853 2015-10-09 14:05 /user/impala/udfs/hello_world.jar
-rw-r--r-- 3 jrussell supergroup 7366 2016-06-08 14:25 /user/impala/udfs/udf-examples.jar
In impala-shell, the CREATE FUNCTION
refers to the HDFS path of the jar file
and the fully qualified class name inside the jar. Each of the functions inside the class becomes an
Impala function, each one overloaded under the specified Impala function name.
[localhost:21000] > create function testudf location '/user/impala/udfs/udf-examples.jar' symbol='org.apache.impala.TestUdf';
[localhost:21000] > show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT | testudf(BIGINT) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN, BOOLEAN) | JAVA | true |
| DOUBLE | testudf(DOUBLE) | JAVA | true |
| DOUBLE | testudf(DOUBLE, DOUBLE) | JAVA | true |
| DOUBLE | testudf(DOUBLE, DOUBLE, DOUBLE) | JAVA | true |
| FLOAT | testudf(FLOAT) | JAVA | true |
| FLOAT | testudf(FLOAT, FLOAT) | JAVA | true |
| FLOAT | testudf(FLOAT, FLOAT, FLOAT) | JAVA | true |
| INT | testudf(INT) | JAVA | true |
| DOUBLE | testudf(INT, DOUBLE) | JAVA | true |
| INT | testudf(INT, INT) | JAVA | true |
| INT | testudf(INT, INT, INT) | JAVA | true |
| SMALLINT | testudf(SMALLINT) | JAVA | true |
| SMALLINT | testudf(SMALLINT, SMALLINT) | JAVA | true |
| SMALLINT | testudf(SMALLINT, SMALLINT, SMALLINT) | JAVA | true |
| STRING | testudf(STRING) | JAVA | true |
| STRING | testudf(STRING, STRING) | JAVA | true |
| STRING | testudf(STRING, STRING, STRING) | JAVA | true |
| TINYINT | testudf(TINYINT) | JAVA | true |
+-------------+---------------------------------------+-------------+---------------+
These are all simple functions that return their single arguments, or sum, concatenate, and so on their multiple arguments. Impala determines which overloaded function to use based on the number and types of the arguments.
insert into bigint_x values (1), (2), (4), (3);
select testudf(x) from bigint_x;
+-----------------+
| udfs.testudf(x) |
+-----------------+
| 1 |
| 2 |
| 4 |
| 3 |
+-----------------+
insert into int_x values (1), (2), (4), (3);
select testudf(x, x+1, x*x) from int_x;
+-------------------------------+
| udfs.testudf(x, x + 1, x * x) |
+-------------------------------+
| 4 |
| 9 |
| 25 |
| 16 |
+-------------------------------+
select testudf(x) from string_x;
+-----------------+
| udfs.testudf(x) |
+-----------------+
| one |
| two |
| four |
| three |
+-----------------+
select testudf(x,x) from string_x;
+--------------------+
| udfs.testudf(x, x) |
+--------------------+
| oneone |
| twotwo |
| fourfour |
| threethree |
+--------------------+
The previous example used the same Impala function name as the name of the class.
This example shows how the Impala function name is independent of the underlying
Java class or function names. A second CREATE FUNCTION
statement
results in a set of overloaded functions all named my_func
,
to go along with the overloaded functions all named testudf
.
create function my_func location '/user/impala/udfs/udf-examples.jar'
symbol='org.apache.impala.TestUdf';
show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT | my_func(BIGINT) | JAVA | true |
| BOOLEAN | my_func(BOOLEAN) | JAVA | true |
| BOOLEAN | my_func(BOOLEAN, BOOLEAN) | JAVA | true |
...
| BIGINT | testudf(BIGINT) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
...
The corresponding DROP FUNCTION
statement with no signature
drops all the overloaded functions with that name.
drop function my_func;
show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT | testudf(BIGINT) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN) | JAVA | true |
| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true |
...
The signatureless CREATE FUNCTION
syntax for Java UDFs ensures that
the functions shown in this example remain available after the Impala service
(specifically, the Catalog Server) are restarted.
Related information:
User-Defined Functions (UDFs) for more background information, usage instructions, and examples for Impala UDFs; DROP FUNCTION Statement