LongTale: Toward Automatic Performance Anomaly Explanation in Microservices
Proceedings of the 13th ACM/SPEC International Conference on Performance Engineering (ICPE) 2022.
Performance troubleshooting is notoriously difficult for distributed microservices-based applications. A typical root-cause diagnosis for performance anomaly by an analyst starts by narrowing down the scope of slow services, investigates into high-level performance metrics or available logs in the slow components, and finally drills down to an actual cause. This process can be long, tedious, and sometimes aimless due to the lack of domain knowledge and the sheer number of possible culprits. This paper introduces a new machine-learning-driven performance analysis system called LongTale that automates the troubleshooting process for latency-related performance anomalies to facilitate the root cause diagnosis and explanation. LongTale builds on existing application-layer tracing in two significant aspects. First, it stitches application-layer traces with corresponding system stack traces, which enables more informative root-cause analysis. Second, it utilizes a novel machine-learning-driven analysis that feeds on the combined data to automatically uncover the most likely contributing factor(s) for given performance slowdown. We demonstrate how LongTale can be utilized in different scenarios, including abnormal long-tail latency explanation and performance interference analysis.