Two identical cloud instances from AWS exhibit wildly different IO performance
With the cloud, it's hard to know precisely what you are paying for in terms of performance. Even computing resources that are configured identically (and which cost the same) may differ dramatically in actual performance when used.
Not only is your performance inherently dependent on what other tenants are doing on the same servers, even the hardware itself may vary from instance to instance or service to service.
ICET-lab has conducted extensive benchmarking studies of real-life cloud providers, particularly of Infrastructure-as-a-Service clouds such as AWS EC2 and function (serverless) environments such as AWS Lambda, pioneering tools and methods that allow customers to predict better what to expect from the services they are using.
For Infrastructure-as-a-Service clouds, we have compared the performance of three industry-leading services (EC2, Microsoft Azure, and IBM Softlayer), with a particular focus on how predictable performance is. We have executed over 50.000 individual benchmark runs, making this the largest study of it's kind to our knowledge (Leitner & Cito, 2016).
To enable this work, we have built a custom open-source cloud benchmarking toolkit dubbed Cloud Workbench (CWB)(Scheuner et al., 2014). CWB is flexible enough to support a wide range of benchmarking studies - for example, we used the same toolkit to compare software performance in different cloud environments (Laaber et al., 2019).
Comparison of the performance variation in different cloud providers
The web-based user interface of Cloud Workbench
For benchmarking serverless systems, we are a member of and collaborate extensively with the SPEC Research Group, a worldwide research collaboration under the umbrella of the Standard Performance Evaluation Corporation (SPEC), a non-profit consortium that is responsible for most state-of-the-art benchmarks that computer systems are compared with.
We are currently working on a toolkit for benchmarking serverless applications, along with an empirical study of AWS Lambda performance. More details about this work will be added soon!
Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to >\thinspace100%). However, executing test and control experiments on the same instances (in randomized order) allows us to detect slowdowns of 10% or less with high confidence, using state-of-the-art statistical tests (i.e., Wilcoxon rank-sum and overlapping bootstrapped confidence intervals). Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns in cloud environments.