Impala supports using RCFile data files.
| File Type | Format | Compression Codecs | Impala Can CREATE? | Impala Can INSERT? | 
|---|---|---|---|---|
| RCFile | Structured | Snappy, gzip, deflate, bzip2 | Yes. | No. Import data by using LOAD DATAon data files already in the
              right format, or useINSERTin Hive followed byREFRESH
              table_namein Impala. | 
If you do not have an existing data file to use, begin by creating one in the appropriate format.
To create an RCFile table:
        In the impala-shell interpreter, issue a command similar to:
      
create table rcfile_table (column_specs) stored as rcfile;
        Because Impala can query some kinds of tables that it cannot currently write to, after creating tables of
        certain file formats, you might use the Hive shell to load the data. See
        How Impala Works with Hadoop File Formats for details. After loading data into a table through
        Hive or other mechanism outside of Impala, issue a REFRESH table_name
        statement the next time you connect to the Impala node, before querying the table, to make Impala recognize
        the new data.
      
For example, here is how you might create some RCFile tables in Impala (by specifying the columns explicitly, or cloning the structure of another table), load data through Hive, and query them through Impala:
$ impala-shell -i localhost
[localhost:21000] > create table rcfile_table (x int) stored as rcfile;
[localhost:21000] > create table rcfile_clone like some_other_table stored as rcfile;
[localhost:21000] > quit;
$ hive
hive> insert into table rcfile_table select x from some_other_table;
3 Rows loaded to rcfile_table
Time taken: 19.015 seconds
hive> quit;
$ impala-shell -i localhost
[localhost:21000] > select * from rcfile_table;
Returned 0 row(s) in 0.23s
[localhost:21000] > -- Make Impala recognize the data loaded through Hive;
[localhost:21000] > refresh rcfile_table;
[localhost:21000] > select * from rcfile_table;
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.23s
        Complex type considerations: Although you can create tables in this file format
        using the complex types (ARRAY, STRUCT, and
        MAP) available in Impala 2.3 and higher,
        currently, Impala can query these types only in Parquet tables. 
        The one exception to the preceding rule is COUNT(*) queries on RCFile
        tables that include complex types. Such queries are allowed in
        Impala 2.6 and higher. 
      
You may want to enable compression on existing tables. Enabling compression provides performance gains in most cases and is supported for RCFile tables. For example, to enable Snappy compression, you would specify the following additional settings when loading data through the Hive shell:
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> INSERT OVERWRITE TABLE new_table SELECT * FROM old_table;If you are converting partitioned tables, you must complete additional steps. In such a case, specify additional settings similar to the following:
hive> CREATE TABLE new_table (your_cols) PARTITIONED BY (partition_cols) STORED AS new_format;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE new_table PARTITION(comma_separated_partition_cols) SELECT * FROM old_table;
        Remember that Hive does not require that you specify a source format for it. Consider the case of
        converting a table with two partition columns called year and month to a
        Snappy compressed RCFile. Combining the components outlined previously to complete this table conversion,
        you would specify settings similar to the following:
      
hive> CREATE TABLE tbl_rc (int_col INT, string_col STRING) STORED AS RCFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_rc SELECT * FROM tbl;To complete a similar process for a table that includes partitions, you would specify settings similar to the following:
hive> CREATE TABLE tbl_rc (int_col INT, string_col STRING) PARTITIONED BY (year INT) STORED AS RCFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_rc PARTITION(year) SELECT * FROM tbl;The compression type is specified in the following command:
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
          You could elect to specify alternative codecs such as GzipCodec here.
        
In general, expect query performance with RCFile tables to be faster than with tables using text data, but slower than with Parquet tables. See Using the Parquet File Format with Impala Tables for information about using the Parquet file format for high-performance analytic queries.
        In Impala 2.6 and higher, Impala queries are optimized for files
        stored in Amazon S3. For Impala tables that use the file formats Parquet, ORC, RCFile,
        SequenceFile, Avro, and uncompressed text, the setting
        fs.s3a.block.size in the core-site.xml
        configuration file determines how Impala divides the I/O work of reading the data files.
        This configuration setting is specified in bytes. By default, this value is 33554432 (32
        MB), meaning that Impala parallelizes S3 read operations on the files as if they were
        made up of 32 MB blocks. For example, if your S3 queries primarily access Parquet files
        written by MapReduce or Hive, increase fs.s3a.block.size to 134217728
        (128 MB) to match the row group size of those files. If most S3 queries involve Parquet
        files written by Impala, increase fs.s3a.block.size to 268435456 (256
        MB) to match the row group size produced by Impala.