Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures. It manages both the collection and lookup of this data. Zipkin’s design is based on the Google Dapper paper.
Applications are instrumented to report timing data to Zipkin. The Zipkin UI also presents a Dependency diagram showing how many traced requests went through each application. If you are troubleshooting latency problems or errors, you can filter or sort all traces based on the application, length of trace, annotation, or timestamp. Once you select a trace, you can see the percentage of the total trace time each span takes which allows you to identify the problem application.
There are 4 components that make up Zipkin:
- web UI
Once the trace data arrives at the Zipkin collector daemon, it is validated, stored, and indexed for lookups by the Zipkin collector.
Zipkin was initially built to store data on Cassandra since Cassandra is scalable, has a flexible schema, and is heavily used within Twitter. However, we made this component pluggable. In addition to Cassandra, we natively support ElasticSearch and MySQL. Other back-ends might be offered as third party extensions.
Zipkin Query Service
Once the data is stored and indexed, we need a way to extract it. The query daemon provides a simple JSON API for finding and retrieving traces. The primary consumer of this API is the Web UI.
We created a GUI that presents a nice interface for viewing traces. The web UI provides a method for viewing traces based on service, time, and annotations. Note: there is no built-in authentication in the UI!
This is a patch release that fixes a regression in handling of certain span names and adds a feature to control server timeout of queries.
- JSON API response that includes whitespace in names (such as service names) correctly escapes strings #2915
QUERY_TIMEOUTenvironment variable can be set to a duration string to control how long the server will wait while serving a query (e.g., search for traces) before failing with a timeout. For example, when increasing the client timeout of Elasticsearch with
ES_TIMEOUTit is often helpful to increase
QUERY_TIMEOUTto a larger number so the server waits longer than the Elasticsearch client. #2809