# Introduction

Measurements in High Energy Physics (HEP) rely on determining the
compatibility of observed collision events with theoretical predictions.
The relationship between them is often formalised in a statistical *model*
\(f(\bm{x}|\fullset)\) describing the probability of data
\(\bm{x}\) given model parameters \(\fullset\). Given observed
data, the *likelihood* \(\mathcal{L}(\fullset)\) then serves as the basis to test
hypotheses on the parameters \(\fullset\). For measurements based
on binned data (*histograms*), the \(\HiFa{}\) family of statistical models has been widely used
in both Standard Model measurements [intro-4] as
well as searches for new
physics [intro-5]. In this package, a
declarative, plain-text format for describing \(\HiFa{}\)-based likelihoods is
presented that is targeted for reinterpretation and long-term
preservation in analysis data repositories such as
HEPData [intro-3].

## HistFactory

Statistical models described using \(\HiFa{}\) [intro-2]
center around the simultaneous measurement of disjoint binned
distributions (*channels*) observed as event counts \(\channelcounts\). For
each channel, the overall expected event rate 1 is the sum over a
number of physics processes (*samples*). The sample rates may be subject to
parametrised variations, both to express the effect of *free parameters*
\(\freeset\) 2 and to account for systematic uncertainties as a
function of *constrained parameters* \(\constrset\). The degree to which the latter can cause
a deviation of the expected event rates from the nominal rates is
limited by *constraint terms*. In a frequentist framework these constraint terms can be
viewed as *auxiliary measurements* with additional global observable data \(\auxdata\), which
paired with the channel data \(\channelcounts\) completes the
observation \(\bm{x} =
(\channelcounts,\auxdata)\). In addition to the partition of the full
parameter set into free and constrained parameters \(\fullset =
(\freeset,\constrset)\), a separate partition \(\fullset =
(\poiset,\nuisset)\) will be useful in the context of hypothesis testing,
where a subset of the parameters are declared *parameters of interest* \(\poiset\) and the
remaining ones as *nuisance parameters* \(\nuisset\).

Thus, the overall structure of a \(\HiFa{}\) probability model is a product of the analysis-specific model term describing the measurements of the channels and the analysis-independent set of constraint terms:

where within a certain integrated luminosity we observe \(n_{cb}\) events given the expected rate of events \(\nu_{cb}(\freeset,\constrset)\) as a function of unconstrained parameters \(\freeset\) and constrained parameters \(\constrset\). The latter has corresponding one-dimensional constraint terms \(c_\singleconstr(a_\singleconstr|\,\singleconstr)\) with auxiliary data \(a_\singleconstr\) constraining the parameter \(\singleconstr\). The event rates \(\nu_{cb}\) are defined as

The total rates are the sum over sample rates \(\nu_{csb}\), each
determined from a *nominal rate* \(\nu_{scb}^0\) and a set of multiplicative and
additive denoted *rate modifiers* \(\bm{\kappa}(\fullset)\) and
\(\bm{\Delta}(\fullset)\). These modifiers are functions of (usually
a single) model parameters. Starting from constant nominal rates, one
can derive the per-bin event rate modification by iterating over all
sample rate modifications as shown in (3).

As summarised in Modifiers and Constraints, rate modifications are defined in \(\HiFa{}\) for bin \(b\), sample \(s\), channel \(c\). Each modifier is represented by a parameter \(\phi \in \{\gamma, \alpha, \lambda, \mu\}\). By convention bin-wise parameters are denoted with \(\gamma\) and interpolation parameters with \(\alpha\). The luminosity \(\lambda\) and scale factors \(\mu\) affect all bins equally. For constrained modifiers, the implied constraint term is given as well as the necessary input data required to construct it. \(\sigma_b\) corresponds to the relative uncertainty of the event rate, whereas \(\delta_b\) is the event rate uncertainty of the sample relative to the total event rate \(\nu_b = \sum_s \nu^0_{sb}\).

Modifiers implementing uncertainties are paired with a corresponding default constraint term on the parameter limiting the rate modification. The available modifiers may affect only the total number of expected events of a sample within a given channel, i.e. only change its normalisation, while holding the distribution of events across the bins of a channel, i.e. its “shape”, invariant. Alternatively, modifiers may change the sample shapes. Here \(\HiFa{}\) supports correlated an uncorrelated bin-by-bin shape modifications. In the former, a single nuisance parameter affects the expected sample rates within the bins of a given channel, while the latter introduces one nuisance parameter for each bin, each with their own constraint term. For the correlated shape and normalisation uncertainties, \(\HiFa{}\) makes use of interpolating functions, \(f_p\) and \(g_p\), constructed from a small number of evaluations of the expected rate at fixed values of the parameter \(\alpha\) 3. For the remaining modifiers, the parameter directly affects the rate.

Description |
Modification |
Constraint Term \(c_\singleconstr\) |
Input |
---|---|---|---|

Uncorrelated Shape |
\(\kappa_{scb}(\gamma_b) = \gamma_b\) |
\(\prod_b \mathrm{Pois}\left(r_b = \sigma_b^{-2}\middle|\,\rho_b = \sigma_b^{-2}\gamma_b\right)\) |
\(\sigma_{b}\) |

Correlated Shape |
\(\Delta_{scb}(\alpha) = f_p\left(\alpha\middle|\,\Delta_{scb,\alpha=-1},\Delta_{scb,\alpha = 1}\right)\) |
\(\displaystyle\mathrm{Gaus}\left(a = 0\middle|\,\alpha,\sigma = 1\right)\) |
\(\Delta_{scb,\alpha=\pm1}\) |

Normalisation Unc. |
\(\kappa_{scb}(\alpha) = g_p\left(\alpha\middle|\,\kappa_{scb,\alpha=-1},\kappa_{scb,\alpha=1}\right)\) |
\(\displaystyle\mathrm{Gaus}\left(a = 0\middle|\,\alpha,\sigma = 1\right)\) |
\(\kappa_{scb,\alpha=\pm1}\) |

MC Stat. Uncertainty |
\(\kappa_{scb}(\gamma_b) = \gamma_b\) |
\(\prod_b \mathrm{Gaus}\left(a_{\gamma_b} = 1\middle|\,\gamma_b,\delta_b\right)\) |
\(\delta_b^2 = \sum_s\delta^2_{sb}\) |

Luminosity |
\(\kappa_{scb}(\lambda) = \lambda\) |
\(\displaystyle\mathrm{Gaus}\left(l = \lambda_0\middle|\,\lambda,\sigma_\lambda\right)\) |
\(\lambda_0,\sigma_\lambda\) |

Normalisation |
\(\kappa_{scb}(\mu_b) = \mu_b\) |
||

Data-driven Shape |
\(\kappa_{scb}(\gamma_b) = \gamma_b\) |

Given the likelihood \(\mathcal{L}(\fullset)\), constructed from
observed data in all channels and the implied auxiliary data, *measurements* in the
form of point and interval estimates can be defined. The majority of the
parameters are *nuisance parameters* — parameters that are not the main target of the
measurement but are necessary to correctly model the data. A small
subset of the unconstrained parameters may be declared as *parameters of interest* for which
measurements hypothesis tests are performed, e.g. profile likelihood
methods [intro-1]. The Symbol Notation table provides a summary of all the
notation introduced in this documentation.

Symbol |
Name |
---|---|

\(f(\bm{x} | \fullset)\) |
model |

\(\mathcal{L}(\fullset)\) |
likelihood |

\(\bm{x} = \{\channelcounts, \auxdata\}\) |
full dataset (including auxiliary data) |

\(\channelcounts\) |
channel data (or event counts) |

\(\auxdata\) |
auxiliary data |

\(\nu(\fullset)\) |
calculated event rates |

\(\fullset = \{\freeset, \constrset\} = \{\poiset, \nuisset\}\) |
all parameters |

\(\freeset\) |
free parameters |

\(\constrset\) |
constrained parameters |

\(\poiset\) |
parameters of interest |

\(\nuisset\) |
nuisance parameters |

\(\bm{\kappa}(\fullset)\) |
multiplicative rate modifier |

\(\bm{\Delta}(\fullset)\) |
additive rate modifier |

\(c_\singleconstr(a_\singleconstr | \singleconstr)\) |
constraint term for constrained parameter \(\singleconstr\) |

\(\sigma_\singleconstr\) |
relative uncertainty in the constrained parameter |

## Declarative Formats

While flexible enough to describe a wide range of LHC measurements, the
design of the \(\HiFa{}\) specification is sufficiently simple to admit a *declarative format* that fully
encodes the statistical model of the analysis. This format defines the
channels, all associated samples, their parameterised rate modifiers and
implied constraint terms as well as the measurements. Additionally, the
format represents the mathematical model, leaving the implementation of
the likelihood minimisation to be analysis-dependent and/or
language-dependent. Originally XML was chosen as a specification
language to define the structure of the model while introducing a
dependence on \(\Root{}\) to encode the nominal rates and required input data of the
constraint terms [intro-2]. Using this
specification, a model can be constructed and evaluated within the
\(\RooFit{}\) framework.

This package introduces an updated form of the specification based on
the ubiquitous plain-text JSON format and its schema-language *JSON Schema*.
Described in more detail in Likelihood Specification, this schema fully specifies both structure
and necessary constrained data in a single document and thus is
implementation independent.

## Additional Material

### Footnotes

- 1
Here rate refers to the number of events expected to be observed within a given data-taking interval defined through its integrated luminosity. It often appears as the input parameter to the Poisson distribution, hence the name “rate”.

- 2
These

*free parameters*frequently include the of a given process, i.e. its cross-section normalised to a particular reference cross-section such as that expected from the Standard Model or a given BSM scenario.- 3
This is usually constructed from the nominal rate and measurements of the event rate at \(\alpha=\pm1\), where the value of the modifier at \(\alpha=\pm1\) must be provided and the value at \(\alpha=0\) corresponds to the corresponding identity operation of the modifier, i.e. \(f_{p}(\alpha=0) = 0\) and \(g_{p}(\alpha = 0)=1\) for additive and multiplicative modifiers respectively. See Section 4.1 in [intro-2].

### Bibliography

- intro-1
Glen Cowan, Kyle Cranmer, Eilam Gross, and Ofer Vitells. Asymptotic formulae for likelihood-based tests of new physics.

*Eur. Phys. J. C*, 71:1554, 2011. arXiv:1007.1727, doi:10.1140/epjc/s10052-011-1554-0.- intro-2(1,2,3)
Kyle Cranmer, George Lewis, Lorenzo Moneta, Akira Shibata, and Wouter Verkerke. HistFactory: A tool for creating statistical models for use with RooFit and RooStats. Technical Report CERN-OPEN-2012-016, New York U., New York, Jan 2012. URL: https://cds.cern.ch/record/1456844.

- intro-3
Eamonn Maguire, Lukas Heinrich, and Graeme Watt. HEPData: a repository for high energy physics data.

*J. Phys. Conf. Ser.*, 898(10):102006, 2017. arXiv:1704.05473, doi:10.1088/1742-6596/898/10/102006.- intro-4
ATLAS Collaboration. Measurements of Higgs boson production and couplings in diboson final states with the ATLAS detector at the LHC.

*Phys. Lett. B*, 726:88, 2013. arXiv:1307.1427, doi:10.1016/j.physletb.2014.05.011.- intro-5
ATLAS Collaboration. Search for supersymmetry in final states with missing transverse momentum and multiple \(b\)-jets in proton–proton collisions at \(\sqrt s = 13\) \(\TeV \) with the ATLAS detector. ATLAS-CONF-2018-041, 2018. URL: https://cds.cern.ch/record/2632347.