Making a Testing Plan

When contributing to a project as large and open-ended as WPT, it’s easy to get lost in the details. It can be helpful to start by making a rough list of tests you intend to write. That plan will let you anticipate how much work will be involved, and it will help you stay focused once you begin.

Many people come to WPT with a general testing goal in mind:

  • specification authors often want to test for new spec text
  • browser maintainers often want to test new features or fixes to existing features
  • web developers often want to test discrepancies between browsers on their web applications

(If you don’t have any particular goal, we can help you get started. Check out the issues labeled with type:missing-coverage on Leave a comment if you’d like to get started with one, and don’t hesitate to ask clarifying questions!)

This guide will help you write a testing plan by:

  1. showing you how to use the specifications to learn what kinds of tests will be most helpful
  2. developing your sense for what doesn’t need to be tested
  3. demonstrating methods for figuring out which tests (if any) have already been written for WPT

The level of detail in useful testing plans can vary widely. From a list of specific cases, to an outline of important coverage areas, to an annotated version of the specification under test, the appropriate fidelity depends on your needs, so you can be as precise as you feel is helpful.

Understanding the “testing surface”

Web platform specifications are instructions about how a feature should work. They’re critical for implementers to “build the right thing,” but they are also important for anyone writing tests. We can use the same instructions to infer what kinds of tests would be likely to detect mistakes. Here are a few common patterns in specification text and the kind of tests they suggest.

Input sources

Algorithms may accept input from many sources. Modifying the input is the most direct way we can influence the browser’s behavior and verify that it matches the specifications. That’s why it’s helpful to be able to recognize different sources of input.

Type of feature Potential input sources
JavaScript parameters, context object
HTML element content, attributes, attribute values
CSS selector strings, property values, markup

Determine which input sources are relevant for your chosen feature, and build a list of values which seem worthwhile to test (keep reading for advice on identifying worthwhile values). For features that accept multiple sources of input, remember that the interaction between values can often produce interesting results. Every value you identify should go into your testing plan.

Example: This is the first step of the Notification constructor from the Notifications standard:

The Notification(title, options) constructor, when invoked, must run these steps:

  1. If the current global object is a ServiceWorkerGlobalScope object, then throw a TypeError exception.
  2. Let notification be the result of creating a notification given title and options. Rethrow any exceptions.


A thorough test suite for this constructor will include tests for the behavior of many different values of the title parameter and the options parameter. Choosing those values can be a challenge unto itself–see Avoid Excessive Breadth for advice.

Browser state

The state of the browser may also influence algorithm behavior. Examples include the current document, the dimensions of the viewport, and the entries in the browsing history. Just like with direct input, a thorough set of tests will likely need to control these values. Browser state is often more expensive to manipulate (whether in terms of code, execution time, or system resources), and you may want to design your tests to mitigate these costs (e.g. by writing many subtests from the same state).

You may not be able to control all relevant aspects of the browser’s state. The type:untestable label includes issues for web platform features which cannot be controlled in a cross-browser way. You should include tests like these in your plan both to communicate your intention and to remind you when/if testing solutions become available.

Example: In the Notification constructor referenced above, the type of “the current global object” is also a form of input. The test suite should include tests which execute with different types of global objects.


When an algorithm branches based on some condition, that’s an indication of an interesting behavior that might be missed. Your testing plan should have at least one test that verifies the behavior when the branch is taken and at least one more test that verifies the behavior when the branch is not taken.

Example: The following algorithm from the HTML standard describes how the localStorage.getItem method works:

The getItem(key) method must return the current value associated with the given key. If the given key does not exist in the list associated with the object then this method must return null.

This algorithm exhibits different behavior depending on whether or not an item exists at the provided key. To test this thoroughly, we would write two tests: one test would verify that null is returned when there is no item at the provided key, and the other test would verify that an item we previously stored was correctly retrieved when we called the method with its name.


Even without branching, the interplay between sequential algorithm steps can suggest interesting test cases. If two steps have observable side-effects, then it can be useful to verify they happen in the correct order.

Most of the time, step sequence is implicit in the nature of the algorithm–each step operates on the result of the step that precedes it, so verifying the end result implicitly verifies the sequence of the steps. But sometimes, the order of two steps isn’t particularly relevant to the result of the overall algorithm. This makes it easier for implementations to diverge.

There are many common patterns where step sequence is observable but not necessarily inherent to the correctness of the algorithm:

  • input validation (when an algorithm verifies that two or more input values satisfy some criteria)
  • event dispatch (when an algorithm fires two or more events)
  • object property access (when an algorithm retrieves two or more property values from an object provided as input)

Example: The following text is an abbreviated excerpt of the algorithm that runs during drag operations (from the HTML specification):

[…] 4. Otherwise, if the user ended the drag-and-drop operation (e.g. by releasing the mouse button in a mouse-driven drag-and-drop interface), or if the drag event was canceled, then this will be the last iteration. Run the following steps, then stop the drag-and-drop operation:

  1. If the current drag operation is “none” (no drag operation) […] Otherwise, the drag operation might be a success; run these substeps:
    1. Let dropped be true.
    2. If the current target element is a DOM element, fire a DND event named drop at it; otherwise, use platform-specific conventions for indicating a drop.
    3. […]
  2. Fire a DND event named dragend at the source node.
  3. […]

A thorough test suite will verify that the drop event is fired as specified, and it will also verify that the dragend event is fired as specified. An even better test suite will also verify that the drop event is fired before the dragend event.

In September of 2019, Chromium accidentally changed the ordering of the drop and dragend events, and as a result, real web applications stopped functioning. If there had been a test for the sequence of these events, then this confusion would have been avoided.

When making your testing plan, be sure to look carefully for event dispatch and the other patterns listed above. They won’t always be as clear as the “drag” example!

Optional behavior

Specifications occasionally allow browsers discretion in how they implement certain features. These are described using RFC 2119 terms like “MAY” and “OPTIONAL”. Although browsers should not be penalized for deciding not to implement such behavior, WPT offers tests that verify the correctness of the browsers which do. Be sure to label the test as optional according to WPT’s conventions so that people reviewing test results know how to interpret failures.

Example: The algorithm underpinning document.getElementsByTagName includes the following paragraph:

When invoked with the same argument, and as long as root’s node document’s type has not changed, the same HTMLCollection object may be returned as returned by an earlier call.

That statement uses the word “may,” so even though it modifies the behavior of the preceding algorithm, it is strictly optional. The test we write for this should be designated accordingly.

It’s important to read these sections carefully because the distinction between “mandatory” behavior and “optional” behavior can be nuanced. In this case, the optional behavior is never allowed if the document’s type has changed. That makes for a mandatory test, one that verifies browsers don’t return the same result when the document’s type changes.

Exercising Restraint

When writing conformance tests, choosing what not to test is sometimes just as hard as finding what needs testing.

Don’t dive too deep

Algorithms are composed of many other algorithms which themselves are defined in terms of still more algorithms. It can be intimidating to consider exhaustively testing one of those “nested” algorithms, especially when they are shared by many different APIs.

In general, you should plan to write “surface tests” for the nested algorithms. That means only verifying that they exhibit the basic behavior you are expecting.

It’s definitely important to test exhaustively, but it’s just as important to do so in a structured way. Reach out to the test suite’s maintainers to learn if and how they have already tested those algorithms. In many cases, it’s acceptable to test them in just one place (and maybe through a different API entirely), and rely only on surface-level testing everywhere else. While it’s always possible for more tests to uncover new bugs, the chances may be slim. The time we spend writing tests is highly valuable, so we have to be efficient!

Example: The following algorithm from the DOM standard powers document.querySelector:

To scope-match a selectors string selectors against a node, run these steps:

  1. Let s be the result of parse a selector selectors.
  2. If s is failure, then throw a “SyntaxErrorDOMException.
  3. Return the result of match a selector against a tree with s and node’s root using scoping root node.

As described earlier in this guide, we’d certainly want to test the branch regarding the parsing failure. However, there are many ways a string might fail to parse–should we verify them all in the tests for document.querySelector? What about document.querySelectorAll? Should we test them all there, too?

The answers depend on the current state of the test suite: whether or not tests for selector parsing exist and where they are located. That’s why it’s best to confer with the people who are maintaining the tests.

Avoid excessive breadth

When the set of input values is finite, it can be tempting to test them all exhaustively. When the set is very large, test authors can reduce repetition by defining tests programmatically in loops.

Using advanced control flow techniques to dynamically generate tests can actually reduce test quality. It may obscure the intent of the tests since readers have to mentally “unwind” the iteration to determine what is actually being verified. The practice is more susceptible to bugs. These bugs may not be obvious–they may not cause failures, and they may exercise fewer cases than intended. Finally, tests authored using this approach often take a relatively long time to complete, and that puts a burden on people who collect test results in large numbers.

The severity of these drawbacks varies with the complexity of the generation logic. For example, it would be pronounced in a test which conditionally made different assertions within many nested loops. Conversely, the severity would be low in a test which only iterated over a list of values in order to make the same assertions about each. Recognizing when the benefits outweigh the risks requires discretion, so once you understand them, you should use your best judgement.

Example: We can see this consideration in the very first step of the Response constructor from the Fetch standard

The Response(body, init) constructor, when invoked, must run these steps:

  1. If init[“status”] is not in the range 200 to 599, inclusive, then throw a RangeError.


This function accepts exactly 400 values for the “status.” With WPT’s testharness.js, it’s easy to dynamically create one test for each value. Unless we have reason to believe that a browser may exhibit drastically different behavior for any of those values (e.g. correctly accepting 546 but incorrectly rejecting 547), then the complexity of testing those cases probably isn’t warranted.

Instead, focus on writing declarative tests for specific values which are novel in the context of the algorithm. For ranges like in this example, testing the boundaries is a good idea. 200 and 599 should not produce an error while 199 and 600 should produce an error. Feel free to use what you know about the feature to choose additional values. In this case, HTTP response status codes are classified by the “hundred” order of magnitude, so we might also want to test a “3xx” value and a “4xx” value.

Assessing coverage

It’s very likely that WPT already has some tests for the feature (or at least the specification) that you’re interesting in testing. In that case, you’ll have to learn what’s already been done before starting to write new tests. Understanding the design of existing tests will let you avoid duplicating effort, and it will also help you integrate your work more logically.

Even if the feature you’re testing does not have any tests, you should still keep these guidelines in mind. Sooner or later, someone else will want to extend your work, so you ought to give them a good starting point!

File names

The names of existing files and folders in the repository can help you find tests that are relevant to your work. This page on the design of WPT goes into detail about how files are generally laid out in the repository.

Generally speaking, every conformance tests is stored in a subdirectory dedicated to the specification it verifies. The structure of these subdirectories vary. Some organize tests in directories related to algorithms or behaviors. Others have a more “flat” layout, where all tests are listed together.

Whatever the case, test authors try to choose names that communicate the behavior under test, so you can use them to make an educated guess about where your tests should go.

Example: Imagine you wanted to write a test to verify that headers were made immutable by the Request.error method defined in the Fetch standard. Here’s the algorithm:

The static error() method, when invoked, must run these steps:

  1. Let r be a new Response object, whose response is a new network error.
  2. Set r’s headers to a new Headers object whose guard is “immutable”.
  3. Return r.

In order to figure out where to write the test (and whether it’s needed at all), you can review the contents of the fetch/ directory in WPT. Here’s how that looks on a UNIX-like command line:

$ ls fetch
api/                           data-urls/   range/
content-encoding/              http-cache/
content-length/                images/      redirect-navigate/
content-type/                  metadata/    security/
corb/                          META.yml     stale-while-revalidate/
cors-rfc1918/                  nosniff/
cross-origin-resource-policy/  origin/

This test is for a behavior directly exposed through the API, so we should look in the api/ directory:

$ ls fetch/api
abort/  cors/         headers/           policies/  request/    response/
basic/  credentials/  idlharness.any.js  redirect/  resources/

And since this is a static method on the Response constructor, we would expect the test to belong in the response/ directory:

$ ls fetch/api/response
multi-globals/                   response-static-error.html
response-cancel-stream.html      response-static-redirect.html
response-clone.html              response-stream-disturbed-1.html
response-consume-empty.html      response-stream-disturbed-2.html
response-consume.html            response-stream-disturbed-3.html
response-consume-stream.html     response-stream-disturbed-4.html
response-error-from-stream.html  response-stream-disturbed-5.html
response-error.html              response-stream-disturbed-6.html
response-from-stream.any.js      response-stream-with-broken-then.any.js
response-init-001.html           response-trailer.html

There seems to be a test file for the error method: response-static-error.html. We can open that to decide if the behavior is already covered. If not, then we know where to write the test!

Failures on

There are many behaviors that are difficult to describe in a succinct file name. That’s commonly the case with low-level rendering details of CSS specifications. Test authors may resort to generic number-based naming schemes for their files, e.g. feature-001.html, feature-002.html, etc. This makes it difficult to determine if a test case exists judging only by the names of files.

If the behavior you want to test is demonstrated by some browsers but not by others, you may be able to use the results of the tests to locate the relevant test. is a website which publishes results of WPT in various browsers. Because most browsers pass most tests, the pass/fail characteristics of the behavior you’re testing can help you filter through a large number of highly similar tests.

Example: Imagine you’ve found a bug in the way Safari renders the top CSS border of HTML tables. By searching through directory names and file names, you’ve determined the probable location for the test: the css/CSS2/borders/ directory. However, there are three hundred files that begin with border-top-! None of the names mention the <table> element, so any one of the files may already be testing the case you found.

Luckily, you also know that Firefox and Chrome do not exhibit this bug. You could find such tests by visual inspection of the results overview, but the website’s “search” feature includes operators that let you query for this information directly. To find the tests which begin with border-top-, pass in Chrome, pass in Firefox, and fail in Safari, you could write `border-top- chrome:pass firefox:pass safari:fail. The results show only three such tests exist:

  • border-top-applies-to-005.xht
  • border-top-color-applies-to-005.xht
  • border-top-width-applies-to-005.xht

These may not describe the behavior you’re interested in testing; the only way to know for sure is to review their contents. However, this is a much more manageable set to work with!

Querying file contents

Some web platform features are enabled with a predictable pattern. For example, HTML attributes follow a fairly consistent format. If you’re interested in testing a feature like this, you may be able to learn where your tests belong by querying the contents of the files in WPT.

You may be able to perform such a search on the web. WPT is hosted on, and GitHub offers some basic functionality for querying code. If your search criteria are short and distinctive (e.g. all files containing “querySelectorAll”), then this interface may be sufficient. However, more complicated criteria may require regular expressions. For that, you can download the WPT repository and use git to perform more powerful searches.

The following table lists some common search criteria and examples of how they can be expressed using regular expressions:

Criteria Example match Example regular expression
JavaScript identifier references \bfoo\b
JavaScript string literals x = "foo"; (["'])foo\1
HTML tag names <foo attr> <foo(\s|>|$)
HTML attributes <div foo=3> <[a-zA-Z][^>]*\sfoo(\s|>|=|$)
CSS property name style="foo: 4" ([{;=\"']|\s|^)foo\s+:

Bear in mind that searches like this are not necessarily exhaustive. Depending on the feature, it may be difficult (or even impossible) to write a query that correctly identifies all relevant tests. This strategy can give a helpful guide, but the results may not be conclusive.

Example: Imagine you’re interested in testing how the src attribute of the iframe element works with javascript: URLs. Judging only from the names of directories, you’ve found a lot of potential locations for such a test. You also know many tests use javascript: URLs without describing that in their name. How can you find where to contribute new tests?

You can design a regular expression that matches many cases where a javascript: URL is assigned to the src property in HTML. You can use the git grep command to query the contents of the html/ directory:

$ git grep -lE "src\s*=\s*[\"']?javascript:" html

You will still have to review the contents to know which are relevant for your purposes (if any), but compared to the 5,000 files in the html/ directory, this list is far more approachable!

Writing the Tests

With a complete testing plan in hand, you now have a good idea of the scope of your work. It’s finally time to write the tests! There’s a lot to say about how this is done technically. To learn more, check out the WPT “reftest” tutorial and the testharness.js tutorial.