Zipkin 2.14.2 releases, distributed tracing system

zipkin

Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures. It manages both the collection and lookup of this data. Zipkin’s design is based on the Google Dapper paper.

This project includes a dependency-free library and a spring-boot server. Storage options include in-memory, JDBC (mysql), Cassandra, and Elasticsearch.

zipkin

Applications are instrumented to report timing data to Zipkin. The Zipkin UI also presents a Dependency diagram showing how many traced requests went through each application. If you are troubleshooting latency problems or errors, you can filter or sort all traces based on the application, length of trace, annotation, or timestamp. Once you select a trace, you can see the percentage of the total trace time each span takes which allows you to identify the problem application.

There are 4 components that make up Zipkin:

  • collector
  • storage
  • search
  • web UI

Zipkin Collector

Once the trace data arrives at the Zipkin collector daemon, it is validated, stored, and indexed for lookups by the Zipkin collector.

Storage

Zipkin was initially built to store data on Cassandra since Cassandra is scalable, has a flexible schema, and is heavily used within Twitter. However, we made this component pluggable. In addition to Cassandra, we natively support ElasticSearch and MySQL. Other back-ends might be offered as third party extensions.

Zipkin Query Service

Once the data is stored and indexed, we need a way to extract it. The query daemon provides a simple JSON API for finding and retrieving traces. The primary consumer of this API is the Web UI.

Web UI

We created a GUI that presents a nice interface for viewing traces. The web UI provides a method for viewing traces based on service, time, and annotations. Note: there is no built-in authentication in the UI!

Zipkin 2.14.0 has been released, this version focuses on operational improvements:

Zipkin 2.14 adds storage throttling and Elasticsearch 7 support. We’ve also improved efficiency around span collection and enhanced the UI. As mentioned last time, this release drops support for Elasticsearch v2.x and Kafka v0.8.x. Here’s a run-down of what’s new.

Storage Throttling (Experimental)

How to manage surge problems in collector architecture is non-trivial. While we’ve collected resources for years about this, only recently we had a champion to take on some mechanics in practical ways. @Logic-32fleshed out concerns in collector surge handling and did an excellent job evaluating options for those running pure http sites.

Towards that end, @Logic-32 created an experimental storage throttling feature (bundled for your convenience). When STORAGE_THROTTLE_ENABLED=true calls to store spans pay attention to storage errors and adjust backlog accordingly. Under the hood, this uses Netfix concurrency limits.

Craig tested this at his Elasticsearch site, and it resulted in far less dropped spans than before. If you are interested in helping test this feature, please see the configuration notes and join gitter to let us know how it works for you.

Elasticsearch 7.x

Our server now supports Elasticsearch 6-7.x formally (and 5.x as best efforts). Most notably, you’ll no longer see colons in your index patterns if using Elasticsearch 7.x. Thank to @making and @chefky for the early testing of this feature as quite a lot changed under the hood!

Lens UI improvements

@tacigar continues to improve Lens so that it can become the default user interface. He’s helped tune the trace detail screen, notably displaying the minimap more intuitively based on how many spans are in the trace. You’ll also notice the minimap has a slider now, which can help stabilize the area of the trace you are investigating.

Significant efficiency improvements

Our Armeria collectors (http and grpc) now work natively using pooled buffers as opposed to byte arrays with renovated protobuf parsers. The sum this is more efficient trace collection when using protobuf encoding. Thanks very much to @anuraaga for leading and closely reviewing the most important parts of this work.

No more support for Elasticsearch 2.x and Kafka 0.8.x

We no longer support Elasticsearch 2.x or Kafka 0.8.x. Please see advice mentioned in our last release if you are still on these products.

Scribe is now bundled (again)

We used to bundle Scribe (Thrift RPC span collector), but eventually moved it to a separate module due to it being archived technology with library conflicts. Our server is now powered by Armeria, which natively supports thrift. Thanks to help from @anuraaga, the server has built-in scribe support for those running legacy applications. set SCRIBE_ENABLED=true to use this.

Other notable updates

  • Elasticsearch span documents are written with ID ${traceID}-${MD5(json)} to allow for server-side deduplication
  • Zipkin Server is now using the latest Spring Boot 2.1.5 and Armeria 0.85.0

Downloads