Apache Kudu 1.9.0 release, Hadoop data storage system

Apache Kudu

Apache Kudu is Open Source software. A Kudu cluster stores tables that look just like tables you’re used to from relational (SQL) databases. A table can be as simple as an binary keyand value, or as complex as a few hundred different strongly-typed attributes.

Just like SQL, every table has a PRIMARY KEY made up of one or more columns. This might be a single column like a unique user identifier, or a compound key such as a (host, metric, timestamp) tuple for a machine time series database. Rows can be efficiently read, updated, or deleted by their primary key.

Kudu’s simple data model makes it breeze to port legacy applications or build new ones: no need to worry about how to encode your data into binary blobs or make sense of a huge database full of hard-to-interpret JSON. Tables are self-describing, so you can use standard tools like SQL engines or Spark to analyze your data.

Apache Kudu

Apache Kudu 1.9.0 released

Changelog:

Deprecations

  • Support for Java 7 has been deprecated since Kudu 1.5.0 and may be removed in the next major release.

New features

  • Kudu now supports location awareness. When configured, Kudu will make a best effort to avoid placing a majority of replicas for a given tablet at the same location. The kudu cluster rebalance tool has been updated to act in accordance with the placement policy of a location-aware Kudu. The administrative documentation has been updated to detail the usage of this feature.
  • Docker scripts have been introduced to build and run Kudu on various operating systems. See the /docker subdirectory of the source repository for more details. An official repository has been created for Apache Kudu Docker artifacts.
  • Developers integrating with Kudu can now write Java tests that start a Kudu mini cluster without having to first locally build and install Kudu. This is made possible by the Kudu team providing platform-specific binaries available to Gradle or Maven for download and install at test time. More information on this feature can be found here. This binary test artifact is currently considered to be experimental.

Optimizations and improvements

  • When creating a table, the master now enforces a restriction on the total number of replicas rather than the total number of partitions. If manually overriding --max_create_tablets_per_ts, the maximum size of a new table has effectively been cut by a factor of its replication factor. Note that partitions can still be added after table creation.
  • The compaction policy has been updated to favor reducing the number of rowsets. This can lead to faster scans and lower bootup times, particularly in the face of a “trickling inserts” workload, where rows are inserted slowly in primary key order (see KUDU-1400).
  • A tablet-level metric average_diskrowset_height has been added to indicate how much a replica needs to be compacted, as indicated by the average number of rowsets per unit of keyspace.
  • Scans which read multiple columns of tables undergoing a heavy UPDATE workload are now more CPU efficient. In some cases, scan performance of such tables may be several times faster upon upgrading to this release.
  • Kudu-Spark users can now provide the short “kudu” format alias to Spark. This enables using .format(“kudu”) in places where you would have needed to provide the fully qualified name like .format(“org.apache.kudu.spark.kudu") or imported org.apache.kudu.spark.kudu._ and used the implicit .kudu functions. The Spark integration documentation has been updated to reflect this improvement.
  • The KuduSink class has been added to the Spark integration as a StreamSinkProvider, allowing structured streaming writes into Kudu (see KUDU-2640).
  • The amount of server-side logging has been greatly reduced for Kudu’s consensus implementation and background processes. This logging was determined to be not useful and unnecessarily verbose.
  • More

Download