How to estimate the number of false positives

Prelude: the topological gap protocol preprint

In arXiv:2207.02472 Microsoft Quantum reported validating and then applying the topological gap protocol. I summarized my overall opinion about it in a commentary. However, even after finishing the commentary, I was confused about how the manuscript estimates the chance that they found false positives. Because I am not trained in statistics, I've used this opportunity to learn how to make such an analysis properly. Here's what I found.

Background and the problem

The preprint states that the authors did the following:

  1. Designed a protocol for detecting whether a system is topological
  2. Designed a simulation that can tell the same with certainty
  3. Applied the protocol to the simulation $N=61$ times and saw only true positives or negatives (numbers not reported)
  4. Applied the protocol to experiments, where validation is available
  5. Saw the protocol "click" on 3 experiments
  6. Concluded that with probability $1 - \log(2) / 61 \approx 98.8\%$ each of these experiments is topological

Because I was unable to find the source for $\log(2) / N$ estimate, I decided to learn how one should do this. I like that the study aims to deal with protocol errors and having multiple samples quantitatively, and I think this is approach is transferable to other research in our field.

A straightforward consideration also shows that $\log(2) / N$ cannot be the right answer. Imagine that there were neither false nor true positives found in the simulations. Then we would have no reason to favor true positives over false positives, and an unbiased estimate should predict a 50% chance that each sample is a false positive.

Important disclaimers

  1. My adventures must be rather boring to a trained statistician, but nobody in my immediate circles knew this stuff, and therefore I expect it to be interesting at least to some people.
  2. This is how far I arrived after about a day of poking around. I did my best, but I have no idea if it's enough.
  3. I have multiple conceptual concerns about the preprint that I described in the commentary. This whole exploration disregards these concerns and focuses on the narrow question of how to deal with false positives.

Specific formulation

In a controlled experiment we find $x_{tp}$ true positives, $x_{fp}$ false positives, and $x_n$ negatives. A follow-up uncontrolled experiment yield $x'_p$ positives and $x'_n$ negatives. Both experiments have independent and identically distributed trials. Given this information we want to answer the following questions:

  • How likely it is that we found a certain amount of true positives?
  • How likely it is that the controlled experiment is similar to the uncontrolled one?

For illustration purposes we will use $x_{tp} = 10$, $x_{fp} = 0$, $x_n = 28$, $x'_p = 3$, and varied $x'_n$. The numbers for $x$ were quoted in a non-public presentation by Microsoft Quantum, the one for $x'_p$ is reported in the manuscript, and $x'_n$ is unreported. The numbers for $x$ add up to 38 instead of 61 in an apparent contradiction with the preprint, however, in a different place, the manuscript also quotes 38 simulation runs instead of 61.

Digging in

Because all samples are independent, we immediately realize that the distribution of the outcomes must be a multinomial—a generalization of a binomial to multiple categories. Let's call its unknown outcome probabilities $p_{tp}$, $p_{fp}$, and $p_{n}$. Our goal is to use the controlled experiment to estimate the outcome probabilities and then interpret the observed data.

While in general, I find statistics confusing, I find its Bayesian formulation much more straightforward, so let us try to use that one. The approach is as follows:

  1. Define a prior distribution—our advance expectation about what the parameter values may be. This is a subtle step, especially if we have very specific expectations, however, there are well-known good choices for common cases.
  2. Use the Bayes formula to compute the posterior distribution—the joint probability distribution of model parameters given our beliefs (the prior distribution) and the observation.
  3. Compute the probabilities of the quantities of interest given the posterior distribution. We are once again interested in:
    • Finding a certain amount of true positives
    • Finding whether the two experiments—controlled and uncontrolled—are likely the same

While the approach sounds straightforward, Bayesian statistics problems are frequently intractable and require numerics. I found Probabilistic Programming and Bayesian Methods for Hackers a nice introduction both to Bayesian statistics and pymc—a package for numerical Bayesian analysis. Here, however, we won't need numerics at all.

Prior choice

While I knew Bayesian statistics, I had no idea about the beautiful Dirichlet distribution $\textrm{Dir}(\mathbf{\alpha})$ with $\mathbf{\alpha} > 0$ the distribution parameters. It is a distribution of a vector quantity $\mathbf{p}$ such that $\sum_i p_i = 1$—in other words, it is a family of distributions of probabilities of mutually exclusive events. The Dirichlet distribution is, however, especially important in analyzing multinomial data because it is the so-called conjugate prior to the multinomial. Conjugate priors are families of distributions that don't change their shape when extra observations are added. In our case if the prior is $\textrm{Dir}(\mathbf{\alpha})$ and the observations are $\mathbf{x}$, then the posterior is just $\textrm{Dir}(\mathbf{\alpha} + \mathbf{x})$. In other words, the Dirichlet distribution makes our expectations about probabilities of different events $\mathbf{\alpha}$ add to the counts of different outcomes—for that reason $\mathbf{\alpha}$ is also called pseudocounts in this context. We also immediately have a closed form for the posterior distribution without doing any computations whatsoever!

Choosing $\mathbf{\alpha}$ appropriately is still nontrivial, and that's up to us. For example, to be extra cautious we could choose $\alpha_{fp}$ to be large, to reflect our skepticism. More standard choices that I found, however, choose all $\alpha$ to be equal to each other. This still leaves a few options, namely:

  • Maximal ignorance $\alpha = 0$ source
  • Non-informative / Jeffreys prior $\alpha = 1/2$ source
  • Homogeneous probability density distribution $\alpha = 1$

There is no clear winner: maximal ignorance is an improper (non-normalizable) distribution, and it assigns a probability of $0$ for the outcomes we never saw. The non-informative prior seems to be bad when many outcomes are missing, as explained in arXiv:1504.02689, and the homogeneous probability density likewise. The preprint also talks about hyperpriors, but that seemed too complicated, and I decided to go with Jeffreys prior—I've seen it being mentioned as reasonable, and we only have one missing type of outcome.

Let us visualize the posterior likelihood of the probabilities of different outcomes. To make the plot more interesting to look at, I've added a couple of false positive samples.

# Code copyright Anton Akhmerov, 2022.
# License: BSD simplified for everyone except Microsoft employees
# GPL 3.0 for Microsoft employees. LOL


import numpy as np
import ternary  # pip install python-ternary
from scipy.stats import dirichlet

x_tp = 10
x_fp = 0
x_p = x_tp + x_fp
x_n = 28
x_pp = 3

fig, ax = ternary.figure(scale=100)
fig.set_size_inches(8, 6)
ax.get_axes().axis('off')
counts = np.array([x_tp, x_fp + 2, x_n])
ax.heatmapf(dirichlet(counts + 1/2).pdf, boundary=False, style="triangular")
ax.bottom_axis_label("$p_{tp}$", fontsize=20)
ax.right_axis_label("$p_{fp}$", fontsize=20)
ax.left_axis_label("$p_n$", fontsize=20)
ax.boundary(linewidth=2.0)
ax.set_title("$x_{{tp}}={}, x_{{fp}}={}, x_n={}$".format(*counts), fontsize=20)

Posterior predictive and a further simplification

Now that we decided on our prior and by that immediately obtained the posterior, $\textrm{Dir}(\mathbf{x} + 1/2)$, we can ask what is the probability of various outcomes of the uncontrolled experiment. Specifically, we need the posterior predictive distribution: the probability of outcomes of the multinomial distribution whose probabilities are distributed according to Dirichlet distribution with parameters $\mathbf{x} + 1/2$. Because Dirichlet and multinomial distributions are basically BFFs, their joint distribution has closed-form expressions for anything one would want to know, and it's known as Dirichlet-multinomial. It's easy enough to code (the outcome probabilitoes are products of a bunch of $\Gamma$ functions), so that for any $x'_p$ and $x'_n$ we can compute the probabilities of numbers of true and false positives.

We do, however, have a problem: we don't know what $x'_n$ should be. Not having better ideas, I've tried a bunch of possible values of $x'_n$ and figured out that the probabilities of the false positive outcomes don't change. Turns out all our distributions also have aggregation and neutrality properties. Aggregation property means that combining two categories into one simply adds up the corresponding parameters $\alpha$ of $p$. Neutrality means that discarding a category just rescales the remaining parameters. Both of the properties are important for us:

  • ✔️ because of neutrality we realize that we can completely disregard negative rates when estimating the number of false positives
  • ✔️ because of aggregation we only need to keep track of positive and negative outcomes when comparing samples

This also makes intuitive sense: imagine that we are drawing red, blue, and green balls from a large bin. The number of green balls we drew shouldn't impact the ratio between red and blue as long as we know how many red and blue we drew combined (hence neutrality). Likewise, if we recolor red and blue into pink, the probability of drawing a pink ball is the sum of the red and blue ball probabilities (that's aggregation).

Results

We now realize that everything we need is described by the two-variable versions of the distributions. These are accordingly called binomial, beta, and beta-binomial distributions—all available in SciPy. Let's use those and see what they predict for the chances to have at least a certain number of true positives.

from scipy.stats.distributions import binom, beta, betabinom
from pandas import DataFrame


result = DataFrame(
    np.array([
        distro.cdf(np.arange(x_pp))
        for distro in (
            binom(x_pp, 1 - np.log(2)/61),
            binom(x_pp, 1 - np.log(2)/38),
            betabinom(x_pp, x_tp + .5, x_fp + .5),
        )
    ]).T
)
result.index.name = "≤N true positives"
result.columns = ("preprint estimate", "61 → 38 samples", "bayesian")
(result * 100).style.format('{:.2g}%')
  preprint estimate 61 → 38 samples bayesian
≤N true positives      
0 0.00015% 0.00061% 0.11%
1 0.038% 0.099% 1.5%
2 3.4% 5.4% 12%

Unsurprisingly, the estimates from the preprint are much more optimistic because they take both positive and negative samples to determine the false positive rate while applying it to the samples that are known to be positive.

The Bayesian 95% credibility interval of the false positive rate—the range within which $p_{fp}$ lies with 95% chance—is the Bayesian analog of the confidence interval. To obtain it we numerically find where the cumulative distribution function of the false positive rate equals 95%.

from IPython.display import Markdown

Markdown(
    "95% credibility interval for false positive rate: "
    f"**{beta(x_fp + .5, x_tp + .5).isf(0.05):.0%}**"
)

95% credibility interval for false positive rate: 17%

Sample similarity

Now we only need to answer the final question: how to decide whether the controlled and uncontrolled experiments are likely to have a similar positive rate?

The answer to that is the Bayes factor $K$: $$ K = \frac{\int P(D|M_1, \theta_1)P(\theta_1 | M_1)\mathrm{d}\theta_1}{\int P(D|M_2, \theta_1)P(\theta_2 | M_2)\mathrm{d}\theta_2} $$ This is the ratio of the probabilities that the observed data $D$ was produced by one of two models $M_1$ or $M_2$, each with its own parameters—$\theta_1$ or $\theta_2$. For us, $M_1$ is the hypothesis that two samples were produced by the same distribution, while $M_2$ is the null hypothesis that two samples were produced by two different binomial distributions. The Bayes factor needs to be larger than 1 for $M_1$ to be preferable (and larger than 3 for substantially favoring $M_1$).

To be completely honest, I don't fully grok how the Bayes factor works, but it is intuitive and it seems to take the model complexity into account by favoring simple models over more complex ones.

The null hypothesis probability we can evaluate right away: it is the product of two beta-binomial distributions. The alternative is a bit more complicated, but carefully examining the expression we realize it is another beta-binomial with an extra combinatorial factor.

from scipy.special import comb
from matplotlib import pyplot

x_pns = np.arange(0, 40)
Ks = [
    (  # P(D | M_1)
        betabinom(x_p + x_pp + x_n + x_pn, 1/2, 1/2).pmf(x_p + x_pp)
        * comb(x_p + x_n, x_p) * comb(x_pp + x_pn, x_pp)
        / comb(x_p + x_pp + x_n + x_pn, x_p + x_pp)
    )
    /
    (  # P(D | M_2)
        betabinom(x_p + x_n, 1/2, 1/2).pmf(x_p)
        * betabinom(x_pp + x_pn, 1/2, 1/2).pmf(x_pp)
    )
    for x_pn in x_pns
]

pyplot.plot(x_pns, Ks)
pyplot.ylabel("$K$")
pyplot.xlabel("Number of negative samples");
pyplot.hlines(1, 0, 40, alpha=0.5, linestyle="--");

Therefore using our example values, if the second experiment had between 3 and 28 negative outcomes, then based only on data we can consider the two experiments likely to be similar.

Outlook

During this investigation we learned how one likely (see disclaimer 2) should treat false positive rates in controlled trials. We also learned how to determine whether two samples are likely to originate from the same distribution. Both of these we have done using Bayesian statistics, but we managed to avoid its main workhorse—numerical Monte-Carlo integration. And finally, we have figured out that in determining false positive rates one must exclude negative counts, thereby invalidating the $\log(2)/61$ expression.

Importantly, in our analysis we chose unbiased priors. By setting the threshold of $K = 1$ we effectively say that there is an advance 50% chance that controlled and uncontrolled experiments agree. By choosing $\alpha = 1/2$ we effectively say "there is no good reason why the protocol would give false positives". Bayesian statistics makes these assumptions explicit. Given the high value nature of the claim, it would be correct to bias our expectations in favor of the null hypothesis and false positives.