Fit PDF with conditional variable

In this example, we show an unusual fit where the total sample is not drawn form a single probability distribution, but each individual sample \(x\) is drawn from a different distribution, whose parameters are determined by a conditional variable \(y\).

In our example, we are drawing samples \(x\) from varying Gaussian distributions. The location of each Gaussian is a function of the conditional variable \(y\), but all share the same width parameter \(\sigma\). We fit the shared parameter \(\sigma\), but also the parameters \(a\) and \(b\) which determine how the location of each gaussian depends on \(y\), assuming a line function \(\mu = a + b y\).

This tutorial reproduces a corresponding one from RooFit.

import iminuit
from iminuit.cost import UnbinnedNLL
from iminuit import Minuit
import numpy as np
import numba as nb
import boost_histogram as bh
import matplotlib.pyplot as plt
from scipy.stats import norm
from numba_stats import norm as norm_nb
print("iminuit version", iminuit.__version__)
iminuit version 2.19.0
rng = np.random.default_rng(1)

# conditional variable: each sample is paired with a random y parameter
y = rng.normal(0, 10, size=10000)
y = y[np.abs(y) < 10]  # truncate at 10

# location of each gaussian is a function of y
def mu(y, a, b):
    return a + b * y

# draw samples from Gaussians whose locations depend on y
truth = {"a": 0, "b": 0.5, "sigma": 1.0}
x = rng.normal(mu(y, truth["a"], truth["b"]), truth["sigma"])

The distribution in \(x\) is more broad than the usual Gaussian because it is a convolution of many Gaussian distributions with varying means. We can visualise this by binning the data in \(x\) and \(y\).

ax_x = bh.axis.Regular(100, -10, 10)
ax_y = bh.axis.Regular(5, -10, 10)
h = bh.Histogram(ax_x, ax_y)
h.fill(x, y)
for i, (a, b) in enumerate(ax_y):
    plt.stairs(h.values()[:,i], ax_x.edges, label=f"[{a}, {b})",
               fill=True, alpha=0.2)
h1 = h[:, sum]
plt.stairs(h1.values(), ax_x.edges, color="k", label="total")
plt.legend(title="y interval", frameon=False, handlelength=1.2);

Fit with conditional variable

The random distribution of \(x\) depends on the value of \(y\). We can exploit that information in the likelihood function to obtain a more accurate estimate of the parameters.

def model(xy, a, b, sigma):
    x, y = xy
    mu = a + b * y
    # cannot use norm.pdf from numba_stats here, because it is not vectorized in mu
    return norm.pdf(x, mu, sigma)

nll = UnbinnedNLL((x, y), model)

m = Minuit(nll, 0.0, 0.0, 2.0)
m.limits["sigma"] = (0, None)
FCN = 1.93e+04 Nfcn = 130
EDM = 2.25e-06 (Goal: 0.0002)
Valid Minimum No Parameters at limit
Below EDM threshold (goal x 10) Below call limit
Covariance Hesse ok Accurate Pos. def. Not forced
Name Value Hesse Error Minos Error- Minos Error+ Limit- Limit+ Fixed
0 a -0.010 0.012
1 b 0.4993 0.0022
2 sigma 0.986 0.008 0
a b sigma
a 0.000142 1.31e-08 5.73e-08
b 1.31e-08 4.76e-06 1.02e-08
sigma 5.73e-08 1.02e-08 7.08e-05
# construct model representation for comparison with data histogram
a, b, sigma = m.values

# get expected content per bin from cdf, sum over the individual cdfs
v = np.diff(np.sum(norm.cdf(ax_x.edges[:,np.newaxis],
                            mu(y, a, b), sigma), axis=1))

plt.stairs(v, ax_x.edges, label="model", zorder=5, lw=2)
plt.errorbar(ax_x.centers, h1.values(), h1.variances() ** 0.5,
             fmt="ok", label="data")

Fit without conditional variable

We can also ignore the dependence of \(x\) and \(y\) and just fit the total \(x\) distribution with a model built from the distribution of \(y\) values. This also works in this case, but information is lost and therefore the parameter uncertainties become larger than in the previous case.

On top of that, the calculation is much slower, because building the pdf is more expensive. We parallelise the computation with numba.

nb.config.THREADING_LAYER = 'workqueue'

@nb.njit(parallel=True, fastmath=True)
def model(x, a, b, sigma):
    mu = a + b * y
    total = np.zeros_like(x)
    for i in nb.prange(len(mu)):
        total += norm_nb.pdf(x, mu[i], sigma)
    return total

nll = UnbinnedNLL(x, model)
m2 = Minuit(nll, 0.0, 0.0, 2.0)
m2.limits["sigma"] = (0, None)
FCN = -8.774e+04 Nfcn = 95
EDM = 1.33e-06 (Goal: 0.0002) time = 5.2 sec
Valid Minimum No Parameters at limit
Below EDM threshold (goal x 10) Below call limit
Covariance Hesse ok Accurate Pos. def. Not forced
Name Value Hesse Error Minos Error- Minos Error+ Limit- Limit+ Fixed
0 a 0.002 0.029
1 b 0.500 0.005
2 sigma 0.98 0.04 0
a b sigma
a 0.000839 4.35e-06 (0.030) -3.16e-05 (-0.027)
b 4.35e-06 (0.030) 2.43e-05 -0.000141 (-0.718)
sigma -3.16e-05 (-0.027) -0.000141 (-0.718) 0.0016
fig, ax = plt.subplots(1, 3, figsize=(8, 2), constrained_layout=True)
for par, axi in zip(m.parameters, ax):
    t = truth[par]
    axi.axhline(t, ls="--", color="0.5")
    axi.errorbar(["with\n conditional"], m.values[par],
                 m.errors[par], fmt="ok")
    axi.errorbar(["without\n conditional"], m2.values[par],
                 m2.errors[par], fmt="or")
    axi.set_xlim(-0.5, 1.5)
    dt = 2 * m2.errors[par]
    axi.set_ylim(t - dt, t + dt)