September 27, 2020

Uber open source large-scale indicator platform M3

3 min read
Uber recently opened up a metrics platform that has been used internally for many years – M3, a metrics platform built on the distributed time series database M3DB that aggregates 500 million metrics per second and continue to store these results at 20 million per second.

Uber said that to promote global operations, they need to be able to quickly store and access billions of metrics on back-end systems at any given time. Until the end of 2014, all of Uber’s services, infrastructure, and servers sent parameters to a Graphite- based system that stored the data in a Whisper file format into a fragmented Carbon cluster. Also, Grafana is used for dashboards, Nagios is used for alerts, and Graphite threshold checks are issued via source control scripts. However, because the expansion of the Carbon cluster requires manual resharding, and due to the lack of replicas, any single node disk failure will result in permanent loss of its associated metrics. In short, as the company continues to grow, this solution can no longer meet its needs.

After evaluating existing solutions, Uber did not find an open source alternative that would meet its resource efficiency or scale goals and be able to operate as a self-service platform. So in 2015, M3 was born. Initially, M3 is almost exclusively fully open source components to complete the essential character, such as for polymerisation statsite, includes timing for storing Date Tiered Compaction Strategy of Cassandra, and for indexing elasticsearch. Based on operational burdens, cost efficiencies and a growing set of features, M3 is gradually forming its components that function beyond the original ones.

M3 currently has more than 6.6 billion time-series data, aggregates 500 million metrics per second, and continuously stores 20 million parameters per second (using M3DB), and batch writes each metric to three different regions. In the copy. It also allows engineers to write metrics to save data to varying lengths of time and at different granularities. This allows engineers and data scientists to be fine- and intelligently store time-series data with different requirements of retention with varying rules of retention.

At Uber, it’s crucial to use Prometheus extensively because of the extensive use of Prometheus. With a sidecar component M3 Coordinator, M3 integrates Prometheus. This component writes data to the M3DB instance in the local zone and extends the query to the inter-regional coordinator.

Based on Uber’s growing experience in measuring storage workloads, M3 has the following features:

  • Optimize each part of the metric pipeline to provide engineers with as much storage space as possible to achieve the lowest hardware cost.
  • The data compression is guaranteed to be as high as possible with the custom compression algorithm M3TSZ to reduce hardware footprint.
  • Since most data is “write once, never read”, try to keep a streamlined memory footprint to avoid memory becoming a bottleneck.
  • Avoid compression as much as possible, reduce sampling by increasing the utilisation of host resources, enabling more concurrent writes and providing stable write/read latency.
  • Time-series data is stored locally, eliminating the need for constant high-write operations.