Pros and Cons of Quarantined Tests

Flaky tests, i.e., those that only fail sometimes, are the bane of any end-to-end automated test suite.

Another type of problem test is one that fails every time but which tests something that is deemed not important enough to fix right now. If you have to ignore some of the failed tests sooner or later you’re going to ignore one that you should have paid attention to. Or worse, you might decide to ignore them all because clearly no-one is fixing the bugs.

If a test is broken, fixing it should always be the first course of action, if possible. But what if some other task has a higher priority? If you’re confident that the problem is the test and not the software being tested, it might be reasonable to allow the test to keep failing, at least temporarily.

When you frequently ignore some failing tests, the whole suite is at risk of being seen as unreliable. A common way to prevent that is to quarantine the flaky/failing tests. Quarantine in this context refers to isolating the troublesome tests from the rest of the test suite. Not for fear of contagion, except in the sense of the negative impact they can have on the perception of the rest of the tests.

I think I first came across the concept in an article by Martin Fowler. It’s a great read on the topic of flaky tests and how to identify and resolve the causes of their flakiness. This post isn’t about how to fix them so check out that article if you’re after that kind of info.

More recently, an article on the Google Testing Blog mentioned the same technique for dealing with the same types of troublesome tests.

Even though quarantining tests can be a good temporary solution, if you don’t fix the tests (or the bugs) you can end up in the situation I mentioned before; a few failing tests create the impression that the entire suite is unreliable, enough so that you might consider them a death sentence.

My team and try to avoid that death sentence in a few ways:

  1. Report quarantined test results separately from the rest of the test suite.

    That way everyone can see the results of the reliable tests and know that a failure there is something that should be looked at immediately. We don’t have to try to identify the “true” failures amongst the flaky ones.

  2. Tag quarantined tests with a reason they’re quarantined.

    So flaky tests get tagged as such. Failing tests that aren’t going to get fixed for a while get reported and tagged with the issue number. Comments can be added if the tag isn’t sufficient. This isn’t enough to rescue a quarantined test from oblivion, but it can help avoid the potential problem of losing track of why a test was quarantined.

  3. Schedule a regular review of quarantined tests.

    If it’s not scheduled it’s not likely to happen. Failing tests can be assigned to someone to fix if priorities change, and time can be invested in fixing a flaky test if we decide it’s more important than we first thought.

  4. Delete the test

    If any test stays in quarantine for a long time it would be worthwhile rethinking the value the test provides. Maybe it turns out that unit tests, or even exploratory tests, provide enough coverage. Or the test might cover a part of the software that rarely changes, or which doesn’t get much use. In that case if there is a regression it’s not a big deal. We might @Ignore the test and leave a comment explaining why—instead of deleting it—if it seems likely someone might decide to write the test again.

How do you deal with flaky or failing tests that don’t get fixed quickly?

Updated:

Inattentional Blindness and Scripted Tests

My previous workplace was a large organisation in which many testers were employed to evaluate the quality of the software we developed and maintained. We had a collection of scripted test cases that testers would follow step-by-step. That worked reasonably well, although I was employed to help automate our processes, including testing, which contributed significantly to the quality of our software.

When I started working at my current workplace I found similarly detailed scripted test cases. Part of my responsibilities included manual testing, so I thought I could test the way I was familiar with, and how the rest of my colleagues tested—just follow the test cases. It hasn’t worked well. We find bugs, for sure, but as I’ve grown in experience I’ve found more and more problems with the software that had been there for a long time, through many versions of the software and through many executions of test cases that should have revealed them.

There are at least a few things that I think explain why we, including my past self, failed to identify problems:

  1. Out-of-date test cases. Change happens constantly and we have too many tests with too much detail for our small QA team to keep up with.
  2. Treating test cases like an instruction manual. It’s relatively easy for an experienced tester to follow the steps of a scripted test case to the letter, assuming the steps are accurate. That was our standard practice. But it’s even easier to miss out on opportunities to reveal bugs if you do that.
  3. Overwhelming detail. Many test cases are so long, verbose, and complicated that it’s very easy to miss important details in the steps and expected results, especially when you’re under pressure to get the job done quickly.
  4. Unnecessarily specific detail. Often a test case instructs the tester to use a particular element of the UI in a particular way. E.g., “enter a value in the Account text field and click the Validate button at the bottom of the panel.” That sort of specificity means the other fields are likely to be ignored, as well as ignoring all the other ways that validation could be triggered. And that problem is in addition to making it hard to keep the test case up to date (because sooner or later, that Validate button is going to move, or be removed entirely).

The last three points have something in particular in common. They all trigger a cognitive phenomenon called inattentional blindness. It’s something that we all experience, whether we’re aware of it or not1. A well-known demonstration of the phenomenon comes from a psychological study and you can perform the experiment from the study yourself by watching a video and following the instructions at the start:


If you haven’t done so already, I strongly recommend you do the experiment first before you read on—this is something you really only get to experience once, although there are variations of it. Although it’s likely something you’ll experience again and again in real life.

The study and others like it find that half of the time on average, people fail to notice the unexpected element. They’re asked to perform a task and they’re focused so intently on it that they fail to perceive something they’re looking right at. It’s one of the reasons using your phone while driving is so dangerous, even hands-free; if your attention is on the text/app/call it’s not on the road.

That kind of inattentional blindness is exactly what can happen when you follow a scripted test. You focus on the steps you have to follow and the results you have to check for and you fail to notice anything else. The software can behave in unexpected ways, but you might miss it if you’re only paying attention to what the test case says should happen. Even if you’re looking right at the problem. Missing the unexpected becomes more likely the longer you’re doing the task and the more anxious you are about completing it quickly.

I’m certainly not the first to draw the link between inattentional blindness and scripted tests. Michael Bolton, for one, has mentioned it a few times on his blog. But it’s particularly relevant to me now as I try to improve our testing practices. Whatever changes we make, we need to be aware of the potential for inattentional blindness.

For now it’s clear to me that the scripted, highly-detailed test cases we’re used to at my workplace are getting in the way of us improving the quality of our software. Part of the solution is a more exploratory approach to testing. One suggestion is that instead of following a test case step-by-step, you could:

glance over the test case; try to discern the task that’s being modeled or the information that’s being sought; then try to fulfill the task without referring to the test case. That puts the tester in command to try the task and stumble over problems in getting to the goal. Since the end user is not going to be following the test case either, those problems are likely to be bugs.

For those interested, there is more information about inattentional blindness on Scholarpedia, including references to the original research publications.

How do you avoid inattentional blindness while you’re testing? I haven’t set up a commenting system here yet but this article is cross-posted to dev.to and you’re welcome to comment there.

1. Technically, we're never aware of it. That's the inattentional part.

Updated:

From GitHub to GitLab pages

I started this site on GitHub Pages because:

  1. I was already using GitHub
  2. I didn’t want to have to deal with hosting
  3. It’s free

But I’d been meaning to set up my own domain for a while1 and although GitHub pages supports custom domains it didn’t seem to support HTTPS on custom domains. And there’s no way I’m setting up a web site in 2018 without HTTPS.

It turns out that support for HTTPS with custom domains is gradually being rolled out now, and it’s possible to set it up yourself. But there’s no mention of this in the documentation, and word from GitHub Support doesn’t suggest it’ll be official any time soon. And it hasn’t been rolled out to my account yet.

So I looked into GitLab as an alternative. I liked what I found. Not only is there extensive documentation, but there’s also an official tutorial.

I won’t say it was easy, but unless something has gone terribly wrong you’re reading this over a secure connection certified by Let’s Encrypt.

I followed the instructions to add my domain to my GitLab pages settings, and to configure the DNS records for my domain. There are a few helpful links to how-to pages for specific hosts, but none for my registration service. Fortunately, the instruction provided were sufficient.

Before I delved into HTTPS I wanted to make sure the domain setup was behaving as expected. It was not. When I tried to open a link to my site I got a 404 back. The DNS record was directing the request to the correct server, but the server wasn’t associating the request with my GitLab Pages repository. It was then that I realised the project name still ended in github.com, not gitlab.com. When I imported my GitHub repository, the import function on GitLab copied the original repo’s name, ignoring the new name it allowed me to enter. No matter, I thought, there’s an option to change the name. So I did that. But still my site threw up a 404. It seems that I’d only changed the name of my project, i.e., the display name. I had to hunt down the setting to change the name of the repository itself (Settings > Advanced Settings > Rename repository). But once I’d updated that I could finally access my site via marklapierre.net.

And, finally, the HTTPS configuration. I followed the tutorial and ran the letsencrypt-auto CLI to set up a challenge response that would confirm for Let’s Encrypt that I control marklapierre.net. Unfortunately, at some point after the instructions were written GitLab Pages and Jekyll stopped accepting permalinks with dots in them. Fortunately, someone pointed this out in the comments on the tutorial along with the solution—end the permalink with / not .html.

So now I have a secure site on my own domain. Huzzah!

[Update: on the same day I published this, GitHub announced support for HTTPS on custom domains. Nice.]

1. Someone else got marklapierre.com. Last update in 2014? C'mon Mark, you're killing me!

Updated:

Science and software testing

Software testing, particularly manual software testing, is sometimes thought of as nothing more than following a script to confirm that the software does what it was designed to do. From that perspective, testing might seem like a boring and relatively mindless task. And to be honest, that is the traditional view of testing as part of the Waterfall method of software development in large organisations. Division of labour meant that there were some people who did nothing but follow scripts someone else had written, and report bugs that someone else would fix.

Science, on the other hand, is undeniably interesting and challenging. So if you share the impression that software testing is boring, you might be suprised to know that I find both engaging and worth spending my time and effort on1. Having worked as a software tester, and having studied a scientific field (cognitive science), I’ve noticed some similarities that help explain why I’m drawn to both pursuits despite their apparent lack of similarity.

Science can be defined as:

“The intellectual and practical activity encompassing the systematic study of the structure and behaviour of the physical and natural world through observation and experiment.”

That doesn’t seem to describe following testing scripts at all. Even if you swap “the physical and natural world” for “the software under test”, and even if you include the task of writing scripts. But if you consider the entire process of software testing you’ll see similarities emerge. For one thing, test scripts have to be written based on something, and in today’s world of agile software development, than something is usually not requirements handed down from designers, but rather requirements explored, developed, and refined iteratively. Observation and experiment are a big part of that iterative process. This is especially the case when working on existing software that doesn’t have good documentation—how else could you figure out how the software works except through observation and experiment? Even if you have access to the code, it’s unlikely you could read the code and know exactly how the software will behave. And there isn’t always someone else around to ask.

The reality of software testing is a lot more than following a script. A more complete definition of testing is that:

“Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.”

When defined that way, it’s much clearer how testing and science are similar. Questioning, study, modeling, observation, and inference are all core aspects of science and testing.

In testing, we question whether the software does what we expect it to do. We question whether it does what customers want it to do and in the way they want. We question whether a code change has unintended effects. We study how the software behaves under various conditions. We construct models of how we believe the software performs, even if they’re only mental models. We observe how the software responds to input. And ultimately we make inferences about the quality of the software.

Another similarity between science and software testing is that neither process truly has an end. There is always more to discover through science, even at the end of a project that has produced significant insights. And there is always more to learn about any but the simplest software. In the case of science and testing, it’s not meaningful to think of the entire process as having a goal, but it is necessary to define a reasonable milestone as the completion of a project. We don’t finish testing when there are no bugs, because that will never happen, but we can consider testing complete when the software behaves well under a reasonable range of scenarios.

Science is often described as trying to prove things2. That is not the aim of science, nor is it how science works. Science is, in part, a way of trying to better understand the world. And science is the knowledge produced by that process. The scientific method involves making a hypothesis and then gathering evidence and analysing data to draw conclusions about whether the hypothesis is supported. It’s possible to find evidence that rules out a hypothesis, but it’s not possible find evidence that a particular hypothesis is the only explanation for the data. This is because other hypotheses might explain that evidence just as well, including hypotheses that no-one has come up with yet. But after carefully analysing the results of many experiments a clearer understanding can begin to emerge (in the form of a theory). In that way you can think of science as showing what doesn’t work until there’s a reasonably solid explanation left. It’s not about being right; it’s about being less wrong.

Similarly, testing isn’t about proving that the software is bug free; it’s about providing evidence that you can use the software without any significant issues, so that what’s left is reasonably solid. It’s also not about proving that the software does exactly what the customer wants, but it is about helping to iteratively improve the customer’s satisfaction with the software. This is an important part of software testing that’s sometimes forgotten—the aim isn’t solely to find bugs, but also to find unexpected, unusual, or confusing behaviour.

On the other hand, there are plenty of ways in which science and testing are different. But I’ll leave that for another post.

1. Not so much manual testing specifically, but a comprehensive approach to testing that includes exploratory testing and automation.

2. Do a search for "science proves". It's enough to make a scientist or philosopher or mathematician cry.

Updated:

We're neither rock stars nor impostors

Recently, Rach Smith raised some important points about how we tend to talk about impostor syndrome:

  • it minimizes the impact that this experience has on people that really do suffer from it.
  • we’re labelling what should be considered positive personality traits - humility, an acceptance that we can’t be right all the time, a desire to know more, as a “syndrome” that we need to “deal with”, “get over” or “get past”.

If you haven’t read her post yet I highly recommend you do. The issue came up again during Rach’s chat with Dave on Developer on Fire.

I can’t truly say I’ve experienced impostor syndrome, although I suspect that’s mostly because I’ve often been in small teams where everyone was similarly skilled. For example, I was once one of two novice web developers in a product development team. We really didn’t know what we were doing. I did feel unqualified, but since there was no one more experienced to compare myself against I didn’t feel like an impostor. But I did suffer from low self-confidence and a huge pile of self-doubt. Fortunately, experience and education has helped me come to grips with the limits of my knowledge and ability. I’m sure that self-awareness has contributed to better performance independently of any increase in my skills.

It all got me thinking about my experience with how jobs are advertised and how interviews are conducted, about the pressure to elevate one’s technical skills, about the growing awareness of the importance of “soft” skills, and about the rock star culture that’s promoted in some parts of the industry.

Rach noted that even highly successful senior developers sometimes experience self-doubt and the awareness of gaps in their knowledge. This is something that is all too often missing from discussions about preparing for interviews, especially for highly sought-after positions. We’re always told to prepare extensively (good advice), and to project confidence (sure, projecting a lack of confidence is understandably unhelpful), but the highest quality advice also points out the importance of awareness of the limits of one’s skills and knowledge so that they can be appropriately managed. Much of the advice I remember from my early days suggested I should do my best to cover up my weaknesses. I don’t believe that did anything but lead to feelings of insecurity and inevitably falling apart when the limits of my knowledge were revealed. Later, I received much better advice; to be able to say “I don’t know,” and then to work through the problem aloud, asking questions to fill in the gaps until I do have enough understanding to give a reasonable answer. And isn’t that more or less how we work each day? If anyone actually had the supreme skills and confidence we’re naively advised to portray during interviews, I’m pretty sure they wouldn’t find the job challenging or interesting enough (and would likely inflict their arrogance and the consequences of their boredom on the rest of us).

Another topic missing from good career advice, fortunately less common these days, is the importance of soft skills. As Rach noted, “the most accomplished developers [have] constant awareness of the ‘gap’ in their knowledge and willingness to work towards closing it.” That sort of awareness is as important a soft skill as general social and communication skills. It’s a key part of metacognition. The people I’ve experienced most joy in working with are those who freely admit their limitations and strive daily towards eliminating them. That effort shows in their contributions at work that go above and beyond the explicit requirements of their role. Among the worst people to work with are those who do the minimum work required, without any awareness of the opportunities for improvement that pass them by every day. Even worse are those who perform at a similar level while believing that they are in fact contributing much more and at a much greater degree of competence1. The latter type of person is unlikely to experience anything that might be called “impostor syndrome”, although if anyone were truly an impostor, it would be them.

Beyond a growing understanding of the importance of interpersonal soft skills, there are many other non-technical skills that make a solid team member. For example, the O*NET database shows active learning towards the top of a list of skills seen as important for a programmer2. And yet typical hiring practices overwhelmingly reflect the prioritisation of immediate technical skills. I’m confident that’s a big part of the reason “rock star” developers are those seen as having the greatest skills rather than being most able to learn or improve. And yet the former doesn’t imply the latter, especially if those great skills lie in one highly specific domain; you can learn to do one thing really well without being able to generalise that skill, nor does it mean you possess other distinct but important skills. Other downsides of specialisation are a topic for another post.

Similarly, the poor attitudes and bad behaviours of some workers are accepted because of their technical skills, despite the negative impact they have on the people around them. I suspect this might be a subtle influence on feeling like an imposter; we provide a perverse incentive for people to behave in ways that no reasonable person wants to. Our industry favours those who promote themselves as the best coder, the most knowledgeable developer, the ideal technical candidate, and we (at least implicitly) discourage people from embracing their range of skills and their ability to improve.


1. The Dunning-Kruger effect in effect, so to speak.

2. Although communication skills are apparently the #1 requirement in computing-related job ads, other soft skills and transferable technical skills are far less frequently mentioned.

Updated: