All articles


Impala 2.5 performance overview

Impala has proven to be a high-performance analytics query engine since the beginning. Even as an initial production release in 2013, it demonstrated performance 2x faster than a traditional DBMS, and each subsequent release has continued to demonstrate the wide performance gap between Impala’s analytic-database architecture and SQL-on-Apache Hadoop alternatives. Today, we are excited to continue that track record via some important performance gains for Impala 2.5 (with more to come on the roadmap), summarized below.

Overall, compared to Impala 2.3, in Impala 2.5:

  • TPC-DS queries run on average 4.3x faster.
  • TPC-H queries run 2.2x faster on flat tables, and 1.71x faster on nested tables.

Nested Types in Impala

This document discusses nested data types in Impala, including structs, maps, and arrays. It provides an example schema using these types, describes Impala's SQL syntax extensions for querying nested data, and discusses techniques for advanced querying capabilities like correlated subqueries. The execution model materializes minimal nested structures in memory and uses new execution nodes to handle nested data types.

Presented in Impala Meetup, PA, March 24th, 2015


Impala: A Modern, Open-Source SQL Engine for Hadoop

Presented at The Conference on Innovative Data Systems Research (CIDR) 2015.

ABSTRACT

Cloudera Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Apache Hive. This paper presents Impala from a user’s perspective, gives an overview of its architecture and main components and briefly demonstrates its superior performance compared against other popular SQL-on-Hadoop systems.

Paper | Slides