Apache Kudu is Open Source software. A Kudu cluster stores tables that look just like tables you’re used to from relational (SQL) databases. A table can be as simple as an binary
value, or as complex as a few hundred different strongly-typed attributes.
Just like SQL, every table has a
PRIMARY KEYmade up of one or more columns. This might be a single column like a unique user identifier, or a compound key such as a
(host, metric, timestamp)tuple for a machine time series database. Rows can be efficiently read, updated, or deleted by their primary key.
Kudu’s simple data model makes it breeze to port legacy applications or build new ones: no need to worry about how to encode your data into binary blobs or make sense of a huge database full of hard-to-interpret JSON. Tables are self-describing, so you can use standard tools like SQL engines or Spark to analyze your data.
Apache Kudu 1.9.0 released
- Support for Java 7 has been deprecated since Kudu 1.5.0 and may be removed in the next major release.
- Kudu now supports location awareness. When configured, Kudu will make a best effort to avoid placing a majority of replicas for a given tablet at the same location. The
kudu cluster rebalancetool has been updated to act in accordance with the placement policy of a location-aware Kudu. The administrative documentation has been updated to detail the usage of this feature.
- Docker scripts have been introduced to build and run Kudu on various operating systems. See the
/dockersubdirectory of the source repository for more details. An official repository has been created for Apache Kudu Docker artifacts.
- Developers integrating with Kudu can now write Java tests that start a Kudu mini cluster without having to first locally build and install Kudu. This is made possible by the Kudu team providing platform-specific binaries available to Gradle or Maven for download and install at test time. More information on this feature can be found here. This binary test artifact is currently considered to be experimental.
- When creating a table, the master now enforces a restriction on the total number of replicas rather than the total number of partitions. If manually overriding
--max_create_tablets_per_ts, the maximum size of a new table has effectively been cut by a factor of its replication factor. Note that partitions can still be added after table creation.
- The compaction policy has been updated to favor reducing the number of rowsets. This can lead to faster scans and lower bootup times, particularly in the face of a “trickling inserts” workload, where rows are inserted slowly in primary key order (see KUDU-1400).
- A tablet-level metric
average_diskrowset_heighthas been added to indicate how much a replica needs to be compacted, as indicated by the average number of rowsets per unit of keyspace.
- Scans which read multiple columns of tables undergoing a heavy
UPDATEworkload are now more CPU efficient. In some cases, scan performance of such tables may be several times faster upon upgrading to this release.
- Kudu-Spark users can now provide the short “kudu” format alias to Spark. This enables using
.format(“kudu”)in places where you would have needed to provide the fully qualified name like
org.apache.kudu.spark.kudu._and used the implicit
.kudufunctions. The Spark integration documentation has been updated to reflect this improvement.
KuduSinkclass has been added to the Spark integration as a
StreamSinkProvider, allowing structured streaming writes into Kudu (see KUDU-2640).
- The amount of server-side logging has been greatly reduced for Kudu’s consensus implementation and background processes. This logging was determined to be not useful and unnecessarily verbose.