September 27, 2020

Apache Kylin v4.0.0-alpha released, Open source distributed analytics engine

4 min read

Apache Kylin is an open-source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, originally contributed from eBay Inc.

Apache Kylin lets you query massive datasets at sub-second latency in 3 steps.

  1. Identify a Star Schema on Hadoop.
  2. Build Cube from the identified tables.
  3. Query with ANSI-SQL and get results in sub-second, via ODBC, JDBC or RESTful API.

WHAT IS KYLIN?

– Extremely Fast OLAP Engine at Scale: 

Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data

– ANSI SQL Interface on Hadoop: 

Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions

– Interactive Query Capability: 

Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset

– MOLAP Cube:

User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records

– Seamless Integration with BI Tools:

Kylin currently offers integration capability with BI Tools like Tableau, PowerBI, and Excel. Integration with Microstrategy is coming soon

– Other Highlights: 

– Job Management and Monitoring
– Compression and Encoding Support
– Incremental Refresh of Cubes
– Leverage HBase Coprocessor for query latency
– Both approximate and precise Query Capabilities for Distinct Count
– Approximate Top-N Query Capability
– Easy Web interface to manage, build, monitor and query cubes
– Security capability to set ACL at Cube/Project Level
– Support LDAP and SAML Integration

Apache Kylin v4.0.0-alpha was released.

Changelog

This is a major release after 3.1.0, with 35 new features/improvements and 22 bug fixes.

New Feature

  • [KYLIN-4188] – Parquet as Cube storage V2
  • [KYLIN-4213] – The new build engine with Spark-SQL
  • [KYLIN-4452] – Kylin on Parquet with Docker
  • [KYLIN-4462] – Support Count Distinct,TopN and Percentile by kylin on Parquet
  • [KYLIN-4659] – Prepare a technical preview version for Parquet Storage

Improvement

  • [KYLIN-4449] – A running build job will still running when cancel from front end
  • [KYLIN-4450] – Add the feature that adjusting spark driver memory adaptively
  • [KYLIN-4456] – Temporary files generated by UT or Integration Tests need to be deleted
  • [KYLIN-4458] – FilePruner prune shards
  • [KYLIN-4459] – Continuous print warning log-DFSInputStream has been closed already
  • [KYLIN-4467] – Support TopN by kylin on Parquet
  • [KYLIN-4468] – Support Percentile by kylin on Parquet
  • [KYLIN-4474] – Support window function for Kylin on Parquet
  • [KYLIN-4475] – Support intersect count for Kylin on Parquet
  • [KYLIN-4541] – Kylin.log output error information during build job
  • [KYLIN-4542] – After downloading spark with bin/download-spark.sh , still need set SPARK_HOME manually .
  • [KYLIN-4621] – Avoid annoying log message when build cube and query
  • [KYLIN-4625] – Debug the code of Kylin on Parquet without hadoop environment
  • [KYLIN-4631] – Set the default build engine type to spark for Kylin on Parquet
  • [KYLIN-4644] – New tool to clean up intermediate files for Kylin 4.0
  • [KYLIN-4680] – Avoid annoying log messages of unit test and integration test
  • [KYLIN-4695] – Automatically start sparder (for query) application when start kylin instance.
  • [KYLIN-4699] – Delete job_tmp path after build/merge successfully
  • [KYLIN-4713] – Support use diff spark schedule pool for diff query
  • [KYLIN-4722] – Add more statistics to the query results
  • [KYLIN-4723] – Set the configurations about shard by to cube level
  • [KYLIN-4744] – Add tracking URL for build spark job on yarn
  • [KYLIN-4746] – Improve build performance by reducing the count of calling ‘count()’ function
  • [KYLIN-4747] – Use the first dimension column as sort column within a partition

Bug Fix

  • [KYLIN-4444] – Error when refresh segment
  • [KYLIN-4451] – ClassCastException when querying on cluster with binary package
  • [KYLIN-4453] – Query on refreshed cube failed with FileNotFoundException
  • [KYLIN-4454] – Query snapshot table failed
  • [KYLIN-4455] – Query will fail when set calcite.debug=true
  • [KYLIN-4457] – Query cube result doesn’t math with spark sql
  • [KYLIN-4461] – When querying with measure whose return type is decimal, it will throw type cast exception
  • [KYLIN-4465] – Will get direct parent and ancestor cuboids with method findDirectParentCandidates
  • [KYLIN-4466] – Cannot unload table which is loaded from CSV source
  • [KYLIN-4469] – Cannot clone model
  • [KYLIN-4471] – Cannot query sql about left join
  • [KYLIN-4482] – Too many logging segment info with CubeBuildJob step
  • [KYLIN-4483] – Avoid to build global dictionaries with empty ColumnDesc collection
  • [KYLIN-4632] – No such element exception:spark.driver.cores
  • [KYLIN-4681] – Use KylinSession instead of SparkSession for some test cases
  • [KYLIN-4694] – Fix ‘NoClassDefFoundError: Lcom/esotericsoftware/kryo/io/Output’ when query with sparder on yarn
  • [KYLIN-4698] – Delete segment storage path after merging segment, deleting segment and droping cube
  • [KYLIN-4721] – The default source source type should be CSV not Hive with the local debug mode
  • [KYLIN-4732] – The cube size is wrong after disabling the cube
  • [KYLIN-4733] – the cube size is inconsistent with the size of all segments
  • [KYLIN-4734] – the duration is still increasing after discarding the job
  • [KYLIN-4742] – NullPointerException when auto merge segments if exist discard jobs*

Download