Uber open source deep learning distribution training library Petastorm

Uber recently announced the open source Petastorm, a data access library developed by Uber ATG that enables stand-alone or distributed training and deep learning model evaluation directly based on the terabytes of  Apache Parquet format datasets. Petastorm supports popular Python-based machine learning (ML) frameworks such as  TensorflowPytorch,  and  PySpark, and can be used directly in Python code.

Typically, we generate data sets by connecting records from multiple data sources. This data set is generated by Apache Spark’s Python interface PySpark and will be used later in machine learning training. Petastorm provides a simple feature that extends the standard Parquet with Petastorm-specific metadata to make it compatible with Petastorm.

With Petastorm, consuming data is as simple as creating and iterating over objects in HDFS or file system paths. Petastorm uses the  PyArrow library to read Parquet files. The process overview is as follows:

Petastorm combines features to support the training of automated driving algorithms, including line filtering, data sharding, shuffle, access to subsets of fields, and support for time-series data (n-gram).

For other contexts, the structure of a typical data set includes:

  • Multiple columns of sensor data collected during autonomous vehicle test runs, including cameras, laser locators, and radar.
  • Manually generated tags are stored as fields in the row.

The row data is arranged in chronological order of row grouping, and the row group size is usually in the range of 30-100.

Petastorm’s design goals include:

  • Single data schema definition driving both encoding and decoding of data.
  • High data loading bandwidth available to ML frameworks and pure Python code.
  • Leveraging Apache Spark as a distributed cluster-compute framework for generating datasets.
  • Pure Python, ML platform-agnostic implementation of core Petastorm components.
  • Native look and feel of the interface presented to Tensorflow and PyTorch frameworks.