Performance impact of memory layout
Modern CPUs are so fast that they are often wait for memory to be transferred. In case of memory access with a regular pattern, the CPU will prefetch memory that is likely to be used next. We can optimise this a little bit by arranging the numbers in memory that they are easy to fetch. For 1D data, there is not much that we can do, but for ND data, we have the choice between two layouts
x0, y0, … x1, y1, …
x0, x1, …, y0, y1, …
Which one is more efficient is not obvious, so we try both options here. It turns out that the second option is better and that is the one used internally in the builtin cost functions as well.
[1]:
from iminuit import Minuit
from iminuit.cost import UnbinnedNLL
import numpy as np
from scipy.stats import multivariate_normal
rng = np.random.default_rng(1)
xy1 = rng.normal(size=(1_000_000, 2))
xy2 = rng.normal(size=(2, 1_000_000))
def cost1(x, y):
return -np.sum(multivariate_normal.logpdf(xy1, (x, y)))
cost1.errordef = Minuit.LIKELIHOOD
def cost2(x, y):
return -np.sum(multivariate_normal.logpdf(xy2.T, (x, y)))
cost2.errordef = Minuit.LIKELIHOOD
def logpdf(xy, x, y):
return multivariate_normal.logpdf(xy.T, (x, y))
cost3 = UnbinnedNLL(xy2, logpdf, log=True)
[2]:
%%timeit -n 1 -r 1
m = Minuit(cost1, x=0, y=0)
m.migrad()
987 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[3]:
%%timeit -n 1 -r 1
m = Minuit(cost2, x=0, y=0)
m.migrad()
778 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[4]:
%%timeit -n 1 -r 1
m = Minuit(cost3, x=0, y=0)
m.migrad()
782 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
cost2
and cost3
are using the “first all x
then all y
” memory layout. cost3
measures the small overhead incurred by using the built-in cost function UnbinnedNLL
compared to a hand-tailored one.