Prioritizing Tests with the Feedback Pyramid- 7 minutes read - 1437 words
A fair bit has been written about the idea of a “Testing Pyramid”.
I tweeted an idle thought about reshaping this concept into a “Feedback Pyramid”.
I wonder if "Feedback Pyramid" is a better way to think about the testing pyramid? pic.twitter.com/4GUr2an1Ez— Brian St. Pierre (@bstpierre) March 2, 2022
In this post I will:
- provide a more complete picture of the feedback pyramid, including some things that aren’t typically included in the testing pyramid,
- show how to prioritize tests and other checks based on how they provide feedback, and
- show how to set up your tests and other checks so that they provide you the right kind of feedback, at the right time.
Levels of the Pyramid
There are seven levels in this pyramid, including “production” at the top. At each level there are multiple “blocks”. Each block corresponds to a category of test – more generally a “check”. Checks are any part of the development process that provides some feedback to developers (or the development organization in general). This includes things like unit tests which show up in the Testing Pyramid; but the Feedback Pyramid also includes checks like lint, benchmarks, and manual app audits or inspections.
At the bottom of the pyramid are tests or other feedback mechanisms that tend to be:
- broad (spanning most of the code and feature set)
- cheap (in terms of time, CPU cycles, etc)
- shallow (e.g. surface scans like linters,
go vet, etc)
As you move up the pyramid, the feedback mechanisms become more:
- targeted (checking single features or aspects, like vulnerability scanning)
- expensive (in terms of time, staffing, CPU cycles, service fees, etc)
- deep (more thorough checks, like fuzzing or pentesting)
Keep in mind throughout this discussion that your feedback pyramid may look quite a bit different from mine. It all depends on what kind of feedback you need to collect and how you need to manage it.
At the bottom (level 1) of the pyramid, the feedback mechanisms are near-instantaneous. These show up in your editor/IDE. Things like syntax squiggles/annotations, auto completion, snippets, doc popups, etc. are all examples of this.
The next level up the pyramid (level 2) are things that can take a few seconds to show up for a developer. You might have these show up in your editor, but it’s not as tightly integrated or as rapid as the bottom layer. This layer is also the lowest layer at which CI operates.
These quick jobs should run first in CI pipelines so that any failures are reflected quickly before the longer (and more expensive) jobs run.
Level 3 is the highest level that individual developers are expected to have run on their local machine before opening a PR. Devs might not run these in their tight loops, because these checks take longer to provide feedback. But they are still fast enough that they can be run frequently, and provide valuable feedback to devs before they are ready to share their work.
At level 4 we put checks that are either slower, require extra setup or tooling, or need extra resources that we don’t need to provide full-time for each dev. For example, running builds and tests for multiple platforms.
This is the highest level of the pyramid that is run by the CI pipeline. Since they are in CI, some care should be taken that they do not take excessively long to run, and that they do not incur large costs. For example, scale tests that spin up a fleet of cloud resources should be moved outside the standard CI – in the next level up the pyramid.
Level 5 is the highest level that is still primarily automation-driven. These checks may be very slow and/or consume large amounts of compute or storage resources. Browser-based end-to-end testing, black-box “live” vulnerability scans, and load or scale testing fit here. These tests run against a staging or similar test deployment, prior to releasing to production.
Near the top of the pyramid (level 6) are the most expensive and/or slowest-cycle feedback options. Here is where we would put fuzzing – which can be long-running and compute-intensive – and is also non-deterministic. We also have manual checks. This could be internal manual system test, or it could be external audits or pentests, or a release to beta testers.
The top of the pyramid (level 7) is production deployment. You will get feedback here (as long as you have users), but hopefully it’s the kind of high-level feedback you want to get from users and the marketplace, instead of things that should have been brought to light by lower levels of the pyramid. Feedback on defects at this stage of the pyramid is very expensive to acquire, and is generally much more expensive to fix than when they are found at lower levels.
Moving Blocks Between Levels
In the diagram and the description above it may seem like each block has to live at exactly one level. In practice some blocks can be implemented at different levels. Or they could span multiple levels. Or – and this is the cool part – we can move them around on demand to suit our needs. (If we set things up with a little care.)
As a concrete implementation-choice example, accessibility (“a11y”) scanning is shown in the diagram at level 4. But we could choose to implement a11y checks – at least a subset – as unit tests or perhaps even as semgrep rules at level 2. If our app is sensitive to usability, this might be a choice we make to provide rapid and early developer feedback.
And then we might still have a subset of a11y checks that only run at level 4 – so that that this category spans levels.
In terms of moving blocks around on the fly, here are two examples.
First, we could “move” a single benchmark from level 3 to level 2. Assume that we’re trying to optimize a bit of code. We want to rapidly iterate and rerun just our benchmark. We can isolate that benchmark with flags to our Makefile so that we skip everything else get new results in a few seconds rather than waiting for a whole suite to run.
Here’s the relevant part of the rules that run unit tests from the Makefile I shared previously:
./.coverage/$(PROJECT).out: $(ALLGO) $(ALLHTML) Makefile go test $(TESTFLAGS) -coverprofile=./.coverage/$(PROJECT).out ./...
When you run
make check it will run the
go test command above. Note
that TESTFLAGS variable… it’s not actually set anywhere in the Makefile,
so by default it’s empty and has no effect on the tests. But you can set
variables on the make command line, so if you want to restrict tests to a
subet you can run something like
make check TESTFLAGS="-v -run BenchmarkFoo" to verbosely run any tests that match the pattern
Make will also grab variables from the environment, so if you don’t want to set it on the command line, you can just set an environment variable.
As a second example, a dev team could decide to move a subset of unit tests from level 2 to 3. If a certain set of tests is known to run somewhat slowly then these can be taken out of the “every rebuild” category using a build flag or similar mechanism. That way they will only be run when developers run the “full test” category. This makes it so that devs can get fast feedback from the normal test suite that they rerun all the time, but can still get more comprehensive feedback when they are ready to wait a bit longer.
Next Week: Semgrep
I’ve spent some time recently learning how to write rules for semgrep. I’m far from an expert but I will share some rules that seem to be useful, and how I integrate this into my workflow.
On Friday we’ll look at handling POST form data in Gin so that users can add books to their database, plus a little something extra for subscribers.