Cost functions

We give an in-depth guide on how to use the builtin cost functions.

The iminuit package comes with a couple of common cost functions that you can import from iminuit.cost for convenience. Of course, you can write your own cost functions to use with iminuit, but most of the cost function is always the same. What really varies is the statistical model which predicts the probability density as a function of the parameter values. This you still have to provide yourself and the iminuit package will not include machinery to build statistical models (that is out of scope).

Using the builtin cost functions is not only convenient, they also have some extra features.

  • Support of fitted weighted histograms.

  • Technical tricks improve numerical stability.

  • Optional numba acceleration (if numba is installed).

  • Cost functions can be added to fit data sets with shared parameters.

  • Temporarily mask data.

We demonstrate each cost function on an standard example from high-energy physics, the fit of a peak over some smooth background (here taken to be constant).

[1]:
from iminuit import cost, Minuit
# faster than scipy.stats functions
from numba_stats import truncnorm, truncexpon, norm, expon
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import multivariate_normal as mvnorm
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File __init__.pxd:942, in numpy.import_array()

RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [1], in <cell line: 3>()
      1 from iminuit import cost, Minuit
      2 # faster than scipy.stats functions
----> 3 from numba_stats import truncnorm, truncexpon, norm, expon
      4 import numpy as np
      5 from matplotlib import pyplot as plt

File ~/python-iminuit/src/python-iminuit/test-env/lib/python3.10/site-packages/numba_stats/truncnorm.py:10, in <module>
      1 """
      2 Truncated normal distribution.
      3
   (...)
      6 scipy.stats.truncnorm: Scipy equivalent.
      7 """
      9 import numpy as np
---> 10 from . import norm as _norm
     11 from ._util import _jit, _generate_wrappers, _prange
     13 _doc_par = """
     14 x: ArrayLike
     15     Random variate.
   (...)
     23     Width parameter.
     24 """

File ~/python-iminuit/src/python-iminuit/test-env/lib/python3.10/site-packages/numba_stats/norm.py:9, in <module>
      1 """
      2 Normal distribution.
      3
   (...)
      6 scipy.stats.norm: Scipy equivalent.
      7 """
      8 import numpy as np
----> 9 from ._special import erfinv as _erfinv
     10 from ._util import _jit, _trans, _generate_wrappers, _prange
     11 from math import erf as _erf

File ~/python-iminuit/src/python-iminuit/test-env/lib/python3.10/site-packages/numba_stats/_special.py:7, in <module>
      5 from numba.extending import get_cython_function_address
      6 from numba.types import WrapperAddressProtocol, float64
----> 7 import scipy.special.cython_special as cysp
     10 def get(name, signature):
     11     # create new function object with correct signature that numba can call by extracting
     12     # function pointer from scipy.special.cython_special; uses scipy/cython internals
     13     index = 1 if signature.return_type is float64 else 0

File /usr/lib/python3.10/site-packages/scipy/special/__init__.py:649, in <module>
      1 """
      2 ========================================
      3 Special functions (:mod:`scipy.special`)
   (...)
    644
    645 """
    647 from ._sf_error import SpecialFunctionWarning, SpecialFunctionError
--> 649 from . import _ufuncs
    650 from ._ufuncs import *
    652 from . import _basic

File /usr/lib/python3.10/site-packages/scipy/special/_ufuncs.pyx:1, in init scipy.special._ufuncs()

File scipy/special/_ufuncs_extra_code_common.pxi:34, in init scipy.special._ufuncs_cxx()

File __init__.pxd:944, in numpy.import_array()

ImportError: numpy.core.multiarray failed to import

We generate our data. We sample from a Gaussian peak and from exponential background in the range 0 to 2. We then bin the original data. One can fit the original or the binned data.

[2]:
xr = (0, 2)  # xrange

rng = np.random.default_rng(1)

xdata = rng.normal(1, 0.1, size=1000)
ydata = rng.exponential(size=len(xdata))
xmix = np.append(xdata, ydata)
xmix = xmix[(xr[0] < xmix) & (xmix < xr[1])]

n, xe = np.histogram(xmix, bins=50, range=xr)
cx = 0.5 * (xe[1:] + xe[:-1])
dx = np.diff(xe)

plt.errorbar(cx, n, n ** 0.5, fmt="ok")
plt.plot(xmix, np.zeros_like(xmix), "|", alpha=0.1);
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [2], in <cell line: 3>()
      1 xr = (0, 2)  # xrange
----> 3 rng = np.random.default_rng(1)
      5 xdata = rng.normal(1, 0.1, size=1000)
      6 ydata = rng.exponential(size=len(xdata))

NameError: name 'np' is not defined

We also generate some 2D data to demonstrate multivariate fits. In this case, a gaussian along axis 1 and independently an exponential along axis 2. In this case, the distributions are not restricted to some range in x and y.

[3]:
n2, _, ye = np.histogram2d(xdata, ydata, bins=(50, 20), range=(xr, (0, np.max(ydata))))

plt.pcolormesh(xe, ye, n2.T)
plt.scatter(xdata, ydata, marker=".", color="w", s=1);
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 n2, _, ye = np.histogram2d(xdata, ydata, bins=(50, 20), range=(xr, (0, np.max(ydata))))
      3 plt.pcolormesh(xe, ye, n2.T)
      4 plt.scatter(xdata, ydata, marker=".", color="w", s=1)

NameError: name 'np' is not defined

Maximum-likelihood fits

Maximum-likelihood fits are the state-of-the-art when it comes to fitting models to data. The can be applied to unbinned and binned data (histograms).

  • Unbinned fits are the easiest to use, because they can be apply directly to the raw sample. They become slow when the sample size is large.

  • Binned fits require you to appropriately bin the data. The binning has to be fine enough to retain all essential information. Binned fits are much faster when the sample size is large.

Unbinned fit

Unbinned fits are ideal when the data samples are not too large or very high dimensional. There is no need to worry about the appropriate binning of the data. Unbinned fits are inefficient when the samples are very large and can become numerically unstable, too. Binned fits are a better choice then.

The cost function for an unbinned maximum-likelihood fit is really simple, it is the sum of the logarithm of the pdf evaluated at each sample point (times -1 to turn maximimization into minimization). You can easily write this yourself, but a naive implementation will suffer from instabilities when the pdf becomes locally zero. Our implementation mitigates the instabilities to some extend.

To perform the unbinned fit you need to provide the pdf of the model, which must be vectorized (a numpy ufunc). The pdf must be normalized, which means that the integral over the sample value range must be a constant for any combination of model parameters.

The model pdf in this case is a linear combination of the normal and the exponential pdfs. The parameters are \(z\) (the weight), \(\mu\) and \(\sigma\) of the normal distribution and \(\tau\) of the exponential. The cost function detects the parameter names.

It is important to put appropriate limits on the parameters, so that the problem does not become mathematically undefined. * \(0 < z < 1\), * \(\sigma > 0\), * \(\tau > 0\).

In addition, it can be beneficial to use \(-1 < \mu < 1\) (optional), but it is not required. We use truncnorm and truncexpon, which are normalised inside the data range (0, 2).

[4]:
def pdf(x, z, mu, sigma, tau):
    return (z * truncnorm.pdf(x, *xr, mu, sigma) +
            (1 - z) * truncexpon.pdf(x, *xr, 0.0, tau))

c = cost.UnbinnedNLL(xmix, pdf)

m = Minuit(c, z=0.4, mu=0.1, sigma=0.2, tau=2)
m.limits["z"] = (0, 1)
m.limits["sigma", "tau"] = (0, None)
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [4], in <cell line: 5>()
      1 def pdf(x, z, mu, sigma, tau):
      2     return (z * truncnorm.pdf(x, *xr, mu, sigma) +
      3             (1 - z) * truncexpon.pdf(x, *xr, 0.0, tau))
----> 5 c = cost.UnbinnedNLL(xmix, pdf)
      7 m = Minuit(c, z=0.4, mu=0.1, sigma=0.2, tau=2)
      8 m.limits["z"] = (0, 1)

NameError: name 'xmix' is not defined

We visualize the fit.

[5]:
plt.errorbar(cx, n, n ** 0.5, fmt="ok")
xm = np.linspace(*xr)
plt.plot(xm, pdf(xm, *[p.value for p in m.init_params]) * len(xmix) * dx[0],
         ls=":", label="init")
plt.plot(xm, pdf(xm, *m.values) * len(xmix) * dx[0], label="fit")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 plt.errorbar(cx, n, n ** 0.5, fmt="ok")
      2 xm = np.linspace(*xr)
      3 plt.plot(xm, pdf(xm, *[p.value for p in m.init_params]) * len(xmix) * dx[0],
      4          ls=":", label="init")

NameError: name 'plt' is not defined

We can also fit a multivariate model to multivariate data. We pass model as a logpdf this time, which works well because the pdfs factorise.

[6]:
def logpdf(xy, mu, sigma, tau):
    x, y = xy
    return (norm.logpdf(x, mu, sigma) + expon.logpdf(y, 0, tau))

c = cost.UnbinnedNLL((xdata, ydata), logpdf, log=True)
m = Minuit(c, mu=1, sigma=2, tau=2)
m.limits["sigma", "tau"] = (0, None)
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [6], in <cell line: 5>()
      2     x, y = xy
      3     return (norm.logpdf(x, mu, sigma) + expon.logpdf(y, 0, tau))
----> 5 c = cost.UnbinnedNLL((xdata, ydata), logpdf, log=True)
      6 m = Minuit(c, mu=1, sigma=2, tau=2)
      7 m.limits["sigma", "tau"] = (0, None)

NameError: name 'xdata' is not defined

Extended unbinned fit

An important variant of the unbinned ML fit is described by Roger Barlow, Nucl.Instrum.Meth.A 297 (1990) 496-506. Use this if both the shape and the integral of the density are of interest. In practice, this is often the case, for example, if you want to estimate a cross-section or yield.

The model in this case has to return the integral of the density and the density itself (which must be vectorized). The parameters in this case are those already discussed in the previous section and in addition \(s\) (integral of the signal density), \(b\) (integral of the uniform density). The additional limits are:

  • \(s > 0\),

  • \(b > 0\).

Compared to the previous case, we have one more parameter to fit.

[7]:
def density(x, s, b, mu, sigma, tau):
    return s + b, (s * truncnorm.pdf(x, *xr, mu, sigma) +
        b * truncexpon.pdf(x, *xr, 0, tau))

c = cost.ExtendedUnbinnedNLL(xmix, density)

m = Minuit(c, s=300, b=1500, mu=0, sigma=0.2, tau=2)
m.limits["s", "b", "sigma", "tau"] = (0, None)
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [7], in <cell line: 5>()
      1 def density(x, s, b, mu, sigma, tau):
      2     return s + b, (s * truncnorm.pdf(x, *xr, mu, sigma) +
      3         b * truncexpon.pdf(x, *xr, 0, tau))
----> 5 c = cost.ExtendedUnbinnedNLL(xmix, density)
      7 m = Minuit(c, s=300, b=1500, mu=0, sigma=0.2, tau=2)
      8 m.limits["s", "b", "sigma", "tau"] = (0, None)

NameError: name 'xmix' is not defined

The fitted values and the uncertainty estimates for the shape parameters are identical to the previous fit.

[8]:
plt.errorbar(cx, n, n ** 0.5, fmt="ok")
xm = np.linspace(*xr)
plt.plot(xm, density(xm, *[p.value for p in m.init_params])[1] * dx[0],
         ls=":", label="init")
plt.plot(xm, density(xm, *m.values)[1] * dx[0], label="fit")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [8], in <cell line: 1>()
----> 1 plt.errorbar(cx, n, n ** 0.5, fmt="ok")
      2 xm = np.linspace(*xr)
      3 plt.plot(xm, density(xm, *[p.value for p in m.init_params])[1] * dx[0],
      4          ls=":", label="init")

NameError: name 'plt' is not defined

Once again, we fit 2D data, using the logdensity mode.

[9]:
def logdensity(xy, n, mu, sigma, tau):
    x, y = xy
    return n, np.log(n) + norm.logpdf(x, mu, sigma) + expon.logpdf(y, 0, tau)

c = cost.ExtendedUnbinnedNLL((xdata, ydata), logdensity, log=True)
m = Minuit(c, n=1, mu=1, sigma=2, tau=2)
m.limits["n", "sigma", "tau"] = (0, None)
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [9], in <cell line: 5>()
      2     x, y = xy
      3     return n, np.log(n) + norm.logpdf(x, mu, sigma) + expon.logpdf(y, 0, tau)
----> 5 c = cost.ExtendedUnbinnedNLL((xdata, ydata), logdensity, log=True)
      6 m = Minuit(c, n=1, mu=1, sigma=2, tau=2)
      7 m.limits["n", "sigma", "tau"] = (0, None)

NameError: name 'xdata' is not defined

Binned Fit

Binned fits are computationally more efficient and numerically more stable when samples are large. The caveat is that one has to choose an appropriate binning. The binning should be fine enough so that the essential information in the original is retained. Using large bins does not introduce a bias, but the parameters have a larger-than-minimal variance.

In this case, 50 bins are fine enough to retain all information. Using a large number of bins is safe, since the maximum-likelihood method correctly takes poisson statistics into account, which works even if bins have zero entries. Using more bins than necessary just increases the computational cost.

Instead of a pdf, you need to provide a cdf for a binned fit (which must be vectorized). Note that you can approximate the cdf as “bin-width times pdf evaluated at center”, if the cdf is expensive to calculate, but this is an approxmiation and will lead to a bias. Using the cdf avoids this bias.

[10]:
def cdf(xe, z, mu, sigma, tau):
    return (z * truncnorm.cdf(xe, *xr, mu, sigma) +
            (1-z) * truncexpon.cdf(xe, *xr, 0, tau))

c = cost.BinnedNLL(n, xe, cdf)
m = Minuit(c, z=0.4, mu=0, sigma=0.2, tau=2)
m.limits["z"] = (0, 1)
m.limits["sigma", "tau"] = (0, None)
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [10], in <cell line: 5>()
      1 def cdf(xe, z, mu, sigma, tau):
      2     return (z * truncnorm.cdf(xe, *xr, mu, sigma) +
      3             (1-z) * truncexpon.cdf(xe, *xr, 0, tau))
----> 5 c = cost.BinnedNLL(n, xe, cdf)
      6 m = Minuit(c, z=0.4, mu=0, sigma=0.2, tau=2)
      7 m.limits["z"] = (0, 1)

NameError: name 'n' is not defined

The fitted values and the uncertainty estimates for \(\mu\) and \(\sigma\) are not identical to the unbinned fit, but very close. For practical purposes, the results are equivalent. This shows that the binning is fine enough to retain the essential information in the original data.

Note that iminuit also shows the chi2/ndof goodness-of-fit estimator when the data are binned. It can be calculated for free in the binned case.

[11]:
plt.errorbar(cx, n, n ** 0.5, fmt="ok")
plt.stairs(np.diff(cdf(xe, *[p.value for p in m.init_params])) * len(xmix), xe,
           ls=":", label="init")
plt.stairs(np.diff(cdf(xe, *m.values)) * len(xmix), xe, label="fit")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 plt.errorbar(cx, n, n ** 0.5, fmt="ok")
      2 plt.stairs(np.diff(cdf(xe, *[p.value for p in m.init_params])) * len(xmix), xe,
      3            ls=":", label="init")
      4 plt.stairs(np.diff(cdf(xe, *m.values)) * len(xmix), xe, label="fit")

NameError: name 'plt' is not defined

Fitting a multidimensional histogram is equally easy. Since the pdfs in this example factorise, the cdf of the 2D model is the product of the cdfs along each axis.

[12]:
def cdf(xe_ye, mu, sigma, tau):
    xe, ye = xe_ye
    return norm.cdf(xe, mu, sigma) * expon.cdf(ye, 0, tau)

c = cost.BinnedNLL(n2, (xe, ye), cdf)
m = Minuit(c, mu=0.1, sigma=0.2, tau=2)
m.limits["sigma", "tau"] = (0, None)
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [12], in <cell line: 5>()
      2     xe, ye = xe_ye
      3     return norm.cdf(xe, mu, sigma) * expon.cdf(ye, 0, tau)
----> 5 c = cost.BinnedNLL(n2, (xe, ye), cdf)
      6 m = Minuit(c, mu=0.1, sigma=0.2, tau=2)
      7 m.limits["sigma", "tau"] = (0, None)

NameError: name 'n2' is not defined

Extended binned maximum-likelihood fit

As in the unbinned case, the binned extended maximum-likelihood fit should be used when also the amplitudes of the pdfs are of interest.

Instead of a density, you need to provide the integrated density in this case (which must be vectorized). There is no need to separately return the total integral of the density, like in the unbinned case. The parameters are the same as in the unbinned extended fit.

[13]:
def integral(xe, s, b, mu, sigma, tau):
    return (s * truncnorm.cdf(xe, *xr, mu, sigma) +
            b * truncexpon.cdf(xe, *xr, 0, tau))

c = cost.ExtendedBinnedNLL(n, xe, integral)
m = Minuit(c, s=300, b=1500, mu=0, sigma=0.2, tau=2)
m.limits["s", "b", "sigma", "tau"] = (0, None)
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [13], in <cell line: 5>()
      1 def integral(xe, s, b, mu, sigma, tau):
      2     return (s * truncnorm.cdf(xe, *xr, mu, sigma) +
      3             b * truncexpon.cdf(xe, *xr, 0, tau))
----> 5 c = cost.ExtendedBinnedNLL(n, xe, integral)
      6 m = Minuit(c, s=300, b=1500, mu=0, sigma=0.2, tau=2)
      7 m.limits["s", "b", "sigma", "tau"] = (0, None)

NameError: name 'n' is not defined
[14]:
plt.errorbar(cx, n, n ** 0.5, fmt="ok")
plt.stairs(np.diff(integral(xe, *[p.value for p in m.init_params])), xe,
           ls=":", label="init")
plt.stairs(np.diff(integral(xe, *m.values)), xe, label="fit")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 plt.errorbar(cx, n, n ** 0.5, fmt="ok")
      2 plt.stairs(np.diff(integral(xe, *[p.value for p in m.init_params])), xe,
      3            ls=":", label="init")
      4 plt.stairs(np.diff(integral(xe, *m.values)), xe, label="fit")

NameError: name 'plt' is not defined

Again, we can also fit multivariate data.

[15]:
def integral(xe_ye, n, mu, sigma, tau):
    xe, ye = xe_ye
    return n * norm.cdf(xe, mu, sigma) * expon.cdf(ye, 0, tau)

c = cost.ExtendedBinnedNLL(n2, (xe, ye), integral)
m = Minuit(c, n=1500, mu=0.1, sigma=0.2, tau=2)
m.limits["n", "sigma", "tau"] = (0, None)
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [15], in <cell line: 5>()
      2     xe, ye = xe_ye
      3     return n * norm.cdf(xe, mu, sigma) * expon.cdf(ye, 0, tau)
----> 5 c = cost.ExtendedBinnedNLL(n2, (xe, ye), integral)
      6 m = Minuit(c, n=1500, mu=0.1, sigma=0.2, tau=2)
      7 m.limits["n", "sigma", "tau"] = (0, None)

NameError: name 'n2' is not defined

Temporary masking

In complicated binned fits with peak and background, it is sometimes useful to fit in several stages. One typically starts by masking the signal region, to fit only the background region.

The cost functions have a mask attribute to that end. We demonstrate the use of the mask with an extended binned fit.

[16]:
def integral(xe, s, b, mu, sigma, tau):
    return (s * truncnorm.cdf(xe, *xr, mu, sigma) +
            b * truncexpon.cdf(xe, *xr, 0, tau))

c = cost.ExtendedBinnedNLL(n, xe, integral)

# we set the signal amplitude to zero and fix all signal parameters
m = Minuit(c, s=0, b=1500, mu=1, sigma=0.2, tau=2)

m.limits["s", "b", "sigma", "tau"] = (0, None)
m.fixed["s", "mu", "sigma"] = True

# we temporarily mask out the signal
c.mask = (cx < 0.5) | (1.5 < cx)

m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [16], in <cell line: 5>()
      1 def integral(xe, s, b, mu, sigma, tau):
      2     return (s * truncnorm.cdf(xe, *xr, mu, sigma) +
      3             b * truncexpon.cdf(xe, *xr, 0, tau))
----> 5 c = cost.ExtendedBinnedNLL(n, xe, integral)
      7 # we set the signal amplitude to zero and fix all signal parameters
      8 m = Minuit(c, s=0, b=1500, mu=1, sigma=0.2, tau=2)

NameError: name 'n' is not defined

We plot the intermediate result. Points which have been masked out are shown with open markers.

[17]:
for ma, co in ((c.mask, "k"), (~c.mask, "w")):
    plt.errorbar(cx[ma], n[ma], n[ma] ** 0.5, fmt="o", color=co, mec="k", ecolor="k")
plt.stairs(np.diff(integral(xe, *[p.value for p in m.init_params])), xe,
           ls=":", label="init")
plt.stairs(np.diff(integral(xe, *m.values)), xe, label="fit")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 for ma, co in ((c.mask, "k"), (~c.mask, "w")):
      2     plt.errorbar(cx[ma], n[ma], n[ma] ** 0.5, fmt="o", color=co, mec="k", ecolor="k")
      3 plt.stairs(np.diff(integral(xe, *[p.value for p in m.init_params])), xe,
      4            ls=":", label="init")

NameError: name 'c' is not defined

Now we fix the background and fit only the signal parameters.

[18]:
c.mask = None # remove mask
m.fixed = False # release all parameters
m.fixed["b"] = True # fix background amplitude
m.values["s"] = 100 # do not start at the limit
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 c.mask = None # remove mask
      2 m.fixed = False # release all parameters
      3 m.fixed["b"] = True # fix background amplitude

NameError: name 'c' is not defined

Finally, we release all parameters and fit again to get the correct uncertainty estimates.

[19]:
m.fixed = None
m.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 m.fixed = None
      2 m.migrad()

NameError: name 'm' is not defined
[20]:
plt.errorbar(cx, n, n ** 0.5, fmt="ok")
plt.stairs(np.diff(integral(xe, *m.values)), xe, label="fit")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 plt.errorbar(cx, n, n ** 0.5, fmt="ok")
      2 plt.stairs(np.diff(integral(xe, *m.values)), xe, label="fit")
      3 plt.legend()

NameError: name 'plt' is not defined

We get the same result as before. Since this was an easy problem, we did not need these extra steps, but doing this can be helpful to fit lots of histograms without adjusting each fit manually.

Weighted histograms

The cost functions for binned data also support weighted histograms. Just pass an array with the shape (n, 2) instead of (n,) as the first argument, where the first number of each pair is the sum of weights and the second is the sum of weights squared (an estimate of the variance of that bin value).

Least-squares fits

A cost function for a general weighted least-squares fit (aka chi-square fit) is also included. In statistics this is called non-linear regression.

In this case you need to provide a model that predicts the y-values as a function of the x-values and the parameters. The fit needs estimates of the y-errors. If those are wrong, the fit may be biased. If your data has errors on the x-values as well, checkout the tutorial about automatic differentiation, which includes an application of that to such fits.

[21]:
def model(x, a, b):
    return a + b * x ** 2

rng = np.random.default_rng(4)

truth = 1, 2
x = np.linspace(0, 1, 20)
yt = model(x, *truth)
ye = 0.4 * x**5 + 0.1
y = rng.normal(yt, ye)

plt.plot(x, yt, ls="--", label="truth")
plt.errorbar(x, y, ye, fmt="ok", label="data")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [21], in <cell line: 4>()
      1 def model(x, a, b):
      2     return a + b * x ** 2
----> 4 rng = np.random.default_rng(4)
      6 truth = 1, 2
      7 x = np.linspace(0, 1, 20)

NameError: name 'np' is not defined
[22]:
c = cost.LeastSquares(x, y, ye, model)
m1 = Minuit(c, a=0, b=0)
m1.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 c = cost.LeastSquares(x, y, ye, model)
      2 m1 = Minuit(c, a=0, b=0)
      3 m1.migrad()

NameError: name 'x' is not defined
[23]:
plt.errorbar(c.x, c.y, c.yerror, fmt="ok", label="data")
plt.plot(c.x, model(c.x, *truth), ls="--", label="truth")
plt.plot(c.x, model(c.x, *m1.values), label="fit")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [23], in <cell line: 1>()
----> 1 plt.errorbar(c.x, c.y, c.yerror, fmt="ok", label="data")
      2 plt.plot(c.x, model(c.x, *truth), ls="--", label="truth")
      3 plt.plot(c.x, model(c.x, *m1.values), label="fit")

NameError: name 'plt' is not defined

We can also fit a multivariate model, in this case we fit a plane in 2D.

[24]:
def model2(x_y, a, bx, by):
    x, y = x_y
    return a + bx * x + by * y

# generate a regular grid in x and y
x = np.linspace(-1, 1, 10)
y = np.linspace(-1, 1, 10)
X, Y = np.meshgrid(x, y)
x = X.flatten()
y = Y.flatten()

# model truth
Z = model2((x, y), 1, 2, 3)

# add some noise
rng = np.random.default_rng(1)
Zerr = 1
Z = rng.normal(Z, Zerr)

plt.scatter(x, y, c=Z)
plt.colorbar();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [24], in <cell line: 6>()
      3     return a + bx * x + by * y
      5 # generate a regular grid in x and y
----> 6 x = np.linspace(-1, 1, 10)
      7 y = np.linspace(-1, 1, 10)
      8 X, Y = np.meshgrid(x, y)

NameError: name 'np' is not defined
[25]:
c2 = cost.LeastSquares((x, y), Z, Zerr, model2)
m2 = Minuit(c2, 0, 0, 0)
m2.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [25], in <cell line: 1>()
----> 1 c2 = cost.LeastSquares((x, y), Z, Zerr, model2)
      2 m2 = Minuit(c2, 0, 0, 0)
      3 m2.migrad()

NameError: name 'x' is not defined

Multivarate fits are difficult to check by eye. Here we use color to indicate the function value.

To guarantee that plot of the function and the plot of the data use the same color scale, we use the same normalising function for pyplot.pcolormesh and pyplot.scatter.

[26]:
xm = np.linspace(-1, 1, 100)
ym = np.linspace(-1, 1, 100)
Xm, Ym = np.meshgrid(xm, ym)
xm = Xm.flatten()
ym = Ym.flatten()

qm = plt.pcolormesh(Xm, Ym, model2((xm, ym), *m2.values).reshape(Xm.shape))
plt.scatter(c2.x[0], c2.x[1], c=c2.y, edgecolors="w", norm=qm.norm)
plt.colorbar()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [26], in <cell line: 1>()
----> 1 xm = np.linspace(-1, 1, 100)
      2 ym = np.linspace(-1, 1, 100)
      3 Xm, Ym = np.meshgrid(xm, ym)

NameError: name 'np' is not defined

Robust least-squares

The builtin least-squares function also supports robust fitting with an alternative loss functions. See the documentation of iminuit.cost.LeastSquares for details. Users can pass their own loss functions. Builtin loss functions are:

  • linear (default): gives ordinary weighted least-squares

  • soft_l1: quadratic ordinary loss for small deviations (\(\ll 1\sigma\)), linear loss for large deviations (\(\gg 1\sigma\)), and smooth interpolation in between

Let’s create one outlier and see what happens with ordinary loss.

[27]:
c.y[3] = 3 # generate an outlier

m3 = Minuit(c, a=0, b=0)
m3.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [27], in <cell line: 1>()
----> 1 c.y[3] = 3 # generate an outlier
      3 m3 = Minuit(c, a=0, b=0)
      4 m3.migrad()

NameError: name 'c' is not defined
[28]:
plt.errorbar(c.x, c.y, c.yerror, fmt="ok", label="data")
plt.plot(c.x, model(c.x, 1, 2), ls="--", label="truth")
plt.plot(c.x, model(c.x, *m3.values), label="fit")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [28], in <cell line: 1>()
----> 1 plt.errorbar(c.x, c.y, c.yerror, fmt="ok", label="data")
      2 plt.plot(c.x, model(c.x, 1, 2), ls="--", label="truth")
      3 plt.plot(c.x, model(c.x, *m3.values), label="fit")

NameError: name 'plt' is not defined

The result is distorted by the outlier. Note that the error did not increase! The size of the error computed by Minuit does not include mismodelling.

We can repair this with by switching to “soft_l1” loss.

[29]:
c.loss = "soft_l1"
m3.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [29], in <cell line: 1>()
----> 1 c.loss = "soft_l1"
      2 m3.migrad()

NameError: name 'c' is not defined
[30]:
plt.errorbar(c.x, c.y, c.yerror, fmt="ok", label="data")
plt.plot(c.x, model(c.x, *truth), ls="--", label="truth")
plt.plot(c.x, model(c.x, *m3.values), label="fit")
plt.legend();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [30], in <cell line: 1>()
----> 1 plt.errorbar(c.x, c.y, c.yerror, fmt="ok", label="data")
      2 plt.plot(c.x, model(c.x, *truth), ls="--", label="truth")
      3 plt.plot(c.x, model(c.x, *m3.values), label="fit")

NameError: name 'plt' is not defined

The result is almost identical as in the previous case without an outlier.

Robust fitting is very useful if the data are contaminated with small amounts of outliers. It comes with a price, however, the uncertainties are in general larger and the errors computed by Minuit are not correct anymore.

Calculating the parameter uncertainty properly for this case requires a so-called sandwich estimator, which is currently not implemented. As an alternative, one can use the bootstrap to compute parameter uncertaintes. We use the resample library to do this.

[31]:
from resample.bootstrap import variance as bvar

def fit(x, y, ye):
    c = cost.LeastSquares(x, y, ye, model, loss="soft_l1")
    m = Minuit(c, a=0, b=0)
    m.migrad()
    return m.values

berr = bvar(fit, c.x, c.y, c.yerror, size=1000, random_state=1) ** 0.5

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
for i, axi in enumerate(ax):
    axi.errorbar(0, m1.values[i], m1.errors[i], fmt="o")
    axi.errorbar(1, m3.values[i], m3.errors[i], fmt="o")
    axi.errorbar(2, m3.values[i], berr[i], fmt="o")
    axi.set_xticks(np.arange(3), ("no outlier", "Minuit, soft_l1", "bootstrap"));
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Input In [31], in <cell line: 1>()
----> 1 from resample.bootstrap import variance as bvar
      3 def fit(x, y, ye):
      4     c = cost.LeastSquares(x, y, ye, model, loss="soft_l1")

File ~/python-iminuit/src/python-iminuit/test-env/lib/python3.10/site-packages/resample/bootstrap.py:17, in <module>
     14 import typing as _tp
     16 import numpy as np
---> 17 from scipy import stats
     19 from . import _util
     20 from .empirical import quantile_function_gen

File /usr/lib/python3.10/site-packages/scipy/__init__.py:211, in __getattr__(name)
    209 def __getattr__(name):
    210     if name in submodules:
--> 211         return _importlib.import_module(f'scipy.{name}')
    212     else:
    213         try:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File /usr/lib/python3.10/site-packages/scipy/stats/__init__.py:467, in <module>
      1 """
      2 .. _statsrefmanual:
      3
   (...)
    462
    463 """
    465 from ._warnings_errors import (ConstantInputWarning, NearConstantInputWarning,
    466                                DegenerateDataWarning, FitError)
--> 467 from ._stats_py import *
    468 from ._variation import variation
    469 from .distributions import *

File /usr/lib/python3.10/site-packages/scipy/stats/_stats_py.py:39, in <module>
     36 from numpy.lib import NumpyVersion
     37 from numpy.testing import suppress_warnings
---> 39 from scipy.spatial.distance import cdist
     40 from scipy.ndimage import _measurements
     41 from scipy._lib._util import (check_random_state, MapWrapper,
     42                               rng_integers, _rename_parameter)

File /usr/lib/python3.10/site-packages/scipy/spatial/__init__.py:105, in <module>
      1 """
      2 =============================================================
      3 Spatial algorithms and data structures (:mod:`scipy.spatial`)
   (...)
    102    QhullError
    103 """
--> 105 from ._kdtree import *
    106 from ._ckdtree import *
    107 from ._qhull import *

File /usr/lib/python3.10/site-packages/scipy/spatial/_kdtree.py:5, in <module>
      3 import numpy as np
      4 import warnings
----> 5 from ._ckdtree import cKDTree, cKDTreeNode
      7 __all__ = ['minkowski_distance_p', 'minkowski_distance',
      8            'distance_matrix',
      9            'Rectangle', 'KDTree']
     12 def minkowski_distance_p(x, y, p=2):

File _ckdtree.pyx:10, in init scipy.spatial._ckdtree()

File /usr/lib/python3.10/site-packages/scipy/sparse/__init__.py:267, in <module>
    264 import warnings as _warnings
    266 from ._base import *
--> 267 from ._csr import *
    268 from ._csc import *
    269 from ._lil import *

File /usr/lib/python3.10/site-packages/scipy/sparse/_csr.py:10, in <module>
      7 import numpy as np
      9 from ._base import spmatrix
---> 10 from ._sparsetools import (csr_tocsc, csr_tobsr, csr_count_blocks,
     11                            get_csr_submatrix)
     12 from ._sputils import upcast, get_index_dtype
     14 from ._compressed import _cs_matrix

ImportError: numpy.core.multiarray failed to import

In this case, Minuit’s estimate is similar to the bootstrap estimate, but that is not generally true when the “soft_l1” loss is used.

Robust fits are very powerful when the outliers cannot be removed by other means. If one can identify outliers by other means, it is better to remove them. We manually remove the point (using the mask attribute) and switch back to ordinary loss.

[32]:
c.mask = np.arange(len(c.x)) != 3
c.loss = "linear"
m4 = Minuit(c, a=0, b=0)
m4.migrad()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [32], in <cell line: 1>()
----> 1 c.mask = np.arange(len(c.x)) != 3
      2 c.loss = "linear"
      3 m4 = Minuit(c, a=0, b=0)

NameError: name 'np' is not defined

Now the uncertainties are essentially the same as before adding the outlier.

[33]:
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
for i, axi in enumerate(ax):
    axi.errorbar(0, m1.values[i], m1.errors[i], fmt="o")
    axi.errorbar(1, m3.values[i], m3.errors[i], fmt="o")
    axi.errorbar(2, m3.values[i], berr[i], fmt="o")
    axi.errorbar(3, m4.values[i], m4.errors[i], fmt="o")
    axi.set_xticks(np.arange(4), ("no outlier", "Minuit, soft_l1", "bootstrap", "outlier removed"));
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [33], in <cell line: 1>()
----> 1 fig, ax = plt.subplots(1, 2, figsize=(10, 4))
      2 for i, axi in enumerate(ax):
      3     axi.errorbar(0, m1.values[i], m1.errors[i], fmt="o")

NameError: name 'plt' is not defined