# Introduction¶

Measurements in High Energy Physics (HEP) rely on determining the compatibility of observed collision events with theoretical predictions. The relationship between them is often formalised in a statistical model $$f(\bm{x}|\fullset)$$ describing the probability of data $$\bm{x}$$ given model parameters $$\fullset$$. Given observed data, the likelihood $$\mathcal{L}(\fullset)$$ then serves as the basis to test hypotheses on the parameters $$\fullset$$. For measurements based on binned data (histograms), the $$\HiFa{}$$ family of statistical models has been widely used in both Standard Model measurements [intro-4] as well as searches for new physics [intro-5]. In this package, a declarative, plain-text format for describing $$\HiFa{}$$-based likelihoods is presented that is targeted for reinterpretation and long-term preservation in analysis data repositories such as HEPData [intro-3].

## HistFactory¶

Statistical models described using $$\HiFa{}$$ [intro-2] center around the simultaneous measurement of disjoint binned distributions (channels) observed as event counts $$\channelcounts$$. For each channel, the overall expected event rate 1 is the sum over a number of physics processes (samples). The sample rates may be subject to parametrised variations, both to express the effect of free parameters $$\freeset$$ 2 and to account for systematic uncertainties as a function of constrained parameters $$\constrset$$. The degree to which the latter can cause a deviation of the expected event rates from the nominal rates is limited by constraint terms. In a frequentist framework these constraint terms can be viewed as auxiliary measurements with additional global observable data $$\auxdata$$, which paired with the channel data $$\channelcounts$$ completes the observation $$\bm{x} = (\channelcounts,\auxdata)$$. In addition to the partition of the full parameter set into free and constrained parameters $$\fullset = (\freeset,\constrset)$$, a separate partition $$\fullset = (\poiset,\nuisset)$$ will be useful in the context of hypothesis testing, where a subset of the parameters are declared parameters of interest $$\poiset$$ and the remaining ones as nuisance parameters $$\nuisset$$.

(1)$f(\bm{x}|\fullset) = f(\bm{x}|\overbrace{\freeset}^{\llap{\text{free}}},\underbrace{\constrset}_{\llap{\text{constrained}}}) = f(\bm{x}|\overbrace{\poiset}^{\rlap{\text{parameters of interest}}},\underbrace{\nuisset}_{\rlap{\text{nuisance parameters}}})$

Thus, the overall structure of a $$\HiFa{}$$ probability model is a product of the analysis-specific model term describing the measurements of the channels and the analysis-independent set of constraint terms:

(2)$\begin{split}f(\channelcounts, \auxdata \,|\,\freeset,\constrset) = \underbrace{\color{blue}{\prod_{c\in\mathrm{\,channels}} \prod_{b \in \mathrm{\,bins}_c}\textrm{Pois}\left(n_{cb} \,\middle|\, \nu_{cb}\left(\freeset,\constrset\right)\right)}}_{\substack{\text{Simultaneous measurement}\\% \text{of multiple channels}}} \underbrace{\color{red}{\prod_{\singleconstr \in \constrset} c_{\singleconstr}(a_{\singleconstr} |\, \singleconstr)}}_{\substack{\text{constraint terms}\\% \text{for }\unicode{x201C}\text{auxiliary measurements}\unicode{x201D}}},\end{split}$

where within a certain integrated luminosity we observe $$n_{cb}$$ events given the expected rate of events $$\nu_{cb}(\freeset,\constrset)$$ as a function of unconstrained parameters $$\freeset$$ and constrained parameters $$\constrset$$. The latter has corresponding one-dimensional constraint terms $$c_\singleconstr(a_\singleconstr|\,\singleconstr)$$ with auxiliary data $$a_\singleconstr$$ constraining the parameter $$\singleconstr$$. The event rates $$\nu_{cb}$$ are defined as

(3)$\nu_{cb}\left(\fullset\right) = \sum_{s\in\mathrm{\,samples}} \nu_{scb}\left(\freeset,\constrset\right) = \sum_{s\in\mathrm{\,samples}}\underbrace{\left(\prod_{\kappa\in\,\bm{\kappa}} \kappa_{scb}\left(\freeset,\constrset\right)\right)}_{\text{multiplicative modifiers}}\, \Bigg(\nu_{scb}^0\left(\freeset, \constrset\right) + \underbrace{\sum_{\Delta\in\bm{\Delta}} \Delta_{scb}\left(\freeset,\constrset\right)}_{\text{additive modifiers}}\Bigg)\,.$

The total rates are the sum over sample rates $$\nu_{csb}$$, each determined from a nominal rate $$\nu_{scb}^0$$ and a set of multiplicative and additive denoted rate modifiers $$\bm{\kappa}(\fullset)$$ and $$\bm{\Delta}(\fullset)$$. These modifiers are functions of (usually a single) model parameters. Starting from constant nominal rates, one can derive the per-bin event rate modification by iterating over all sample rate modifications as shown in (3).

As summarised in Modifiers and Constraints, rate modifications are defined in $$\HiFa{}$$ for bin $$b$$, sample $$s$$, channel $$c$$. Each modifier is represented by a parameter $$\phi \in \{\gamma, \alpha, \lambda, \mu\}$$. By convention bin-wise parameters are denoted with $$\gamma$$ and interpolation parameters with $$\alpha$$. The luminosity $$\lambda$$ and scale factors $$\mu$$ affect all bins equally. For constrained modifiers, the implied constraint term is given as well as the necessary input data required to construct it. $$\sigma_b$$ corresponds to the relative uncertainty of the event rate, whereas $$\delta_b$$ is the event rate uncertainty of the sample relative to the total event rate $$\nu_b = \sum_s = \nu^0_{sb}$$.

Modifiers implementing uncertainties are paired with a corresponding default constraint term on the parameter limiting the rate modification. The available modifiers may affect only the total number of expected events of a sample within a given channel, i.e. only change its normalisation, while holding the distribution of events across the bins of a channel, i.e. its “shape”, invariant. Alternatively, modifiers may change the sample shapes. Here $$\HiFa{}$$ supports correlated an uncorrelated bin-by-bin shape modifications. In the former, a single nuisance parameter affects the expected sample rates within the bins of a given channel, while the latter introduces one nuisance parameter for each bin, each with their own constraint term. For the correlated shape and normalisation uncertainties, $$\HiFa{}$$ makes use of interpolating functions, $$f_p$$ and $$g_p$$, constructed from a small number of evaluations of the expected rate at fixed values of the parameter $$\alpha$$ 3. For the remaining modifiers, the parameter directly affects the rate.

Modifiers and Constraints

Description

Modification

Constraint Term $$c_\singleconstr$$

Input

Uncorrelated Shape

$$\kappa_{scb}(\gamma_b) = \gamma_b$$

$$\prod_b \mathrm{Pois}\left(r_b = \sigma_b^{-2}\middle|\,\rho_b = \sigma_b^{-2}\gamma_b\right)$$

$$\sigma_{b}$$

Correlated Shape

$$\Delta_{scb}(\alpha) = f_p\left(\alpha\middle|\,\Delta_{scb,\alpha=-1},\Delta_{scb,\alpha = 1}\right)$$

$$\displaystyle\mathrm{Gaus}\left(a = 0\middle|\,\alpha,\sigma = 1\right)$$

$$\Delta_{scb,\alpha=\pm1}$$

Normalisation Unc.

$$\kappa_{scb}(\alpha) = g_p\left(\alpha\middle|\,\kappa_{scb,\alpha=-1},\kappa_{scb,\alpha=1}\right)$$

$$\displaystyle\mathrm{Gaus}\left(a = 0\middle|\,\alpha,\sigma = 1\right)$$

$$\kappa_{scb,\alpha=\pm1}$$

MC Stat. Uncertainty

$$\kappa_{scb}(\gamma_b) = \gamma_b$$

$$\prod_b \mathrm{Gaus}\left(a_{\gamma_b} = 1\middle|\,\gamma_b,\delta_b\right)$$

$$\delta_b^2 = \sum_s\delta^2_{sb}$$

Luminosity

$$\kappa_{scb}(\lambda) = \lambda$$

$$\displaystyle\mathrm{Gaus}\left(l = \lambda_0\middle|\,\lambda,\sigma_\lambda\right)$$

$$\lambda_0,\sigma_\lambda$$

Normalisation

$$\kappa_{scb}(\mu_b) = \mu_b$$

Data-driven Shape

$$\kappa_{scb}(\gamma_b) = \gamma_b$$

Given the likelihood $$\mathcal{L}(\fullset)$$, constructed from observed data in all channels and the implied auxiliary data, measurements in the form of point and interval estimates can be defined. The majority of the parameters are nuisance parameters — parameters that are not the main target of the measurement but are necessary to correctly model the data. A small subset of the unconstrained parameters may be declared as parameters of interest for which measurements hypothesis tests are performed, e.g. profile likelihood methods [intro-1]. The Symbol Notation table provides a summary of all the notation introduced in this documentation.

Symbol Notation

Symbol

Name

$$f(\bm{x} | \fullset)$$

model

$$\mathcal{L}(\fullset)$$

likelihood

$$\bm{x} = \{\channelcounts, \auxdata\}$$

full dataset (including auxiliary data)

$$\channelcounts$$

channel data (or event counts)

$$\auxdata$$

auxiliary data

$$\nu(\fullset)$$

calculated event rates

$$\fullset = \{\freeset, \constrset\} = \{\poiset, \nuisset\}$$

all parameters

$$\freeset$$

free parameters

$$\constrset$$

constrained parameters

$$\poiset$$

parameters of interest

$$\nuisset$$

nuisance parameters

$$\bm{\kappa}(\fullset)$$

multiplicative rate modifier

$$\bm{\Delta}(\fullset)$$

$$c_\singleconstr(a_\singleconstr | \singleconstr)$$

constraint term for constrained parameter $$\singleconstr$$

$$\sigma_\singleconstr$$

relative uncertainty in the constrained parameter

## Declarative Formats¶

While flexible enough to describe a wide range of LHC measurements, the design of the $$\HiFa{}$$ specification is sufficiently simple to admit a declarative format that fully encodes the statistical model of the analysis. This format defines the channels, all associated samples, their parameterised rate modifiers and implied constraint terms as well as the measurements. Additionally, the format represents the mathematical model, leaving the implementation of the likelihood minimisation to be analysis-dependent and/or language-dependent. Originally XML was chosen as a specification language to define the structure of the model while introducing a dependence on $$\Root{}$$ to encode the nominal rates and required input data of the constraint terms [intro-2]. Using this specification, a model can be constructed and evaluated within the $$\RooFit{}$$ framework.

This package introduces an updated form of the specification based on the ubiquitous plain-text JSON format and its schema-language JSON Schema. Described in more detail in Likelihood Specification, this schema fully specifies both structure and necessary constrained data in a single document and thus is implementation independent.

### Footnotes¶

1

Here rate refers to the number of events expected to be observed within a given data-taking interval defined through its integrated luminosity. It often appears as the input parameter to the Poisson distribution, hence the name “rate”.

2

These free parameters frequently include the of a given process, i.e. its cross-section normalised to a particular reference cross-section such as that expected from the Standard Model or a given BSM scenario.

3

This is usually constructed from the nominal rate and measurements of the event rate at $$\alpha=\pm1$$, where the value of the modifier at $$\alpha=\pm1$$ must be provided and the value at $$\alpha=0$$ corresponds to the corresponding identity operation of the modifier, i.e. $$f_{p}(\alpha=0) = 0$$ and $$g_{p}(\alpha = 0)=1$$ for additive and multiplicative modifiers respectively. See Section 4.1 in [intro-2].

### Bibliography¶

intro-1

Glen Cowan, Kyle Cranmer, Eilam Gross, and Ofer Vitells. Asymptotic formulae for likelihood-based tests of new physics. Eur. Phys. J. C, 71:1554, 2011. arXiv:1007.1727, doi:10.1140/epjc/s10052-011-1554-0.

intro-2(1,2,3)

Kyle Cranmer, George Lewis, Lorenzo Moneta, Akira Shibata, and Wouter Verkerke. HistFactory: A tool for creating statistical models for use with RooFit and RooStats. Technical Report CERN-OPEN-2012-016, New York U., New York, Jan 2012. URL: https://cds.cern.ch/record/1456844.

intro-3

Eamonn Maguire, Lukas Heinrich, and Graeme Watt. HEPData: a repository for high energy physics data. J. Phys. Conf. Ser., 898(10):102006, 2017. arXiv:1704.05473, doi:10.1088/1742-6596/898/10/102006.

intro-4

ATLAS Collaboration. Measurements of Higgs boson production and couplings in diboson final states with the ATLAS detector at the LHC. Phys. Lett. B, 726:88, 2013. arXiv:1307.1427, doi:10.1016/j.physletb.2014.05.011.

intro-5

ATLAS Collaboration. Search for supersymmetry in final states with missing transverse momentum and multiple $$b$$-jets in proton–proton collisions at $$\sqrt s = 13$$ $$\TeV$$ with the ATLAS detector. ATLAS-CONF-2018-041, 2018. URL: https://cds.cern.ch/record/2632347.