measuring, predicting, and optimizing software performance in Java using JMH
Performance is one of the most central non-functional properties of modern software. And yet we all experience the applications we use on a daily basis to continuously become slower, less reliable, and more bloated.
One of the reasons for this is that actually testing performance is much harder than testing functional correctness, and hence much more rarely done.
For the last 10 years, ICET-lab has studied how Java developers can use the Java Microbenchmark Harness (JMH) to continuously benchmark their system, for example as part of their CI pipeline.
Concrete research results include detecting anti-patterns in JMH benchmarks which can lead to misleading measurement results (Costa et al., 2021), demonstrating that statistical methods can be used to significantly reduce required benchmark repetitions (Laaber et al., 2020), or experiments with coverage-based benchmark selection (Laaber et al., 2021).
In this line of research, we have also developed multiple open source tools that can support benchmarking research and practice, including Junit-to-JMH, a tool to generate performance benchmark suites from unit tests (Jangali et al., 2023), and Bencher, a tool to analyse static and dynamic coverage of JMH benchmarks.
The impact of bad JMH practices on benchmark results
Dynamically reconfiguring JMH to reduce benchmark execution time
In our ongoing work in this research theme, we are particularly interested in:
How to bootstrap performance testing in a project by generating (initial) performance test suites. Junit-to-JMH(Jangali et al., 2023) is a first stab into this direction.
How to predict the execution time of benchmarks (and, hence, performance) prior to execution. We have already achieved initial success predicting the execution time of small pieces of code using graph-based neural networks (Samoaa et al., 2022). The ultimate vision, of course, is to be able to warn developers before committing slow code, without the need for expensive performance testing.
Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner
In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , Virtual Event, USA, Feb 2020
Performance is a crucial non-functional requirement of many software systems. Despite the widespread use of performance testing, developers still struggle to construct and evaluate the quality of performance tests. To address these two major challenges, we implement a framework, dubbed ju2jmh, to automatically generate performance microbenchmarks from JUnit tests and use mutation testing to study the quality of generated microbenchmarks. Specifically, we compare our ju2jmh generated benchmarks to manually written JMH benchmarks and to automatically generated JMH benchmarks using the AutoJMH framework, as well as directly measuring system performance with JUnit tests. For this purpose, we have conducted a study on three subjects (Rxjava, Eclipse-collections, and Zipkin) with sim∼454K source lines of code (SLOC), 2,417 JMH benchmarks (including manually written and generated AutoJMH benchmarks) and 35,084 JUnit tests. Our results show that the ju2jmh generated JMH benchmarks consistently outperform using the execution time and throughput of JUnit tests as a proxy of performance and JMH benchmarks automatically generated using the AutoJMH framework while being comparable to JMH benchmarks manually written by developers in terms of tests’ stability and ability to detect performance bugs. Nevertheless, ju2jmh benchmarks are able to cover more of the software applications than manually written JMH benchmarks during the microbenchmark execution. Furthermore, ju2jmh benchmarks are generated automatically, while manually written JMH benchmarks require many hours of hard work and attention; therefore our study can reduce developers’ effort to construct microbenchmarks. In addition, we identify three factors (too low test workload, unstable tests and limited mutant coverage) that affect a benchmark’s ability to detect performance bugs. To the best of our knowledge, this is the first study aimed at assisting developers in fully automated microbenchmark creation and assessing microbenchmark quality for performance testing.
Predicting the performance of production code prior to actual execution is known to be highly challenging. In this paper, we propose a predictive model, dubbed TEP-GNN, which demonstrates that high-accuracy performance prediction is possible for the special case of predicting unit test execution times. TEP-GNN uses FA-ASTs, or flow-augmented ASTs, as a graph-based code representation approach, and predicts test execution times using a powerful graph neural network (GNN) deep learning model. We evaluate TEP-GNN using four real-life Java open source programs, based on 922 test files mined from the projects’ public repositories. We find that our approach achieves a high Pearson correlation of 0.789, considerable outperforming a baseline deep learning model. Our work demonstrates that FA-ASTs and GNNs are a feasible approach for predicting absolute performance values, and serves as an important intermediary step towards being able to predict the performance of arbitrary code prior to execution.