Likelihood Specification#

The structure of the JSON specification of models follows closely the original XML-based specification [likelihood-2].

Workspace#

{
    "$schema": "http://json-schema.org/draft-06/schema#",
    "$id": "https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json",
    "$ref": "defs.json#/definitions/workspace"
}

The overall document in the above code snippet describes a workspace, which includes

channels: The channels in the model, which include a description of the samples within each channel and their possible parametrised modifiers.
measurements: A set of measurements, which define among others the parameters of interest for a given statistical analysis objective.
observations: The observed data, with which a likelihood can be constructed from the model.

A workspace consists of the channels, one set of observed data, but can include multiple measurements. If provided a JSON file, one can quickly check that it conforms to the provided workspace specification as follows:

import json, requests, jsonschema

with open("/path/to/analysis_workspace.json", encoding="utf-8") as ws_file:
    workspace = json.load(ws_file)
# if no exception is raised, it found and parsed the schema
schema = requests.get("https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json").json()
# If no exception is raised by validate(), the instance is valid.
jsonschema.validate(instance=workspace, schema=schema)

Channel#

A channel is defined by a channel name and a list of samples [likelihood-1].

{
    "channel": {
        "type": "object",
        "properties": {
            "name": { "type": "string" },
            "samples": { "type": "array", "items": {"$ref": "#/definitions/sample"}, "minItems": 1 }
        },
        "required": ["name", "samples"],
        "additionalProperties": false
    },
}

The Channel specification consists of a list of channel descriptions. Each channel, an analysis region encompassing one or more measurement bins, consists of a name field and a samples field (see Channel), which holds a list of sample definitions (see Sample). Each sample definition in turn has a name field, a data field for the nominal event rates for all bins in the channel, and a modifiers field of the list of modifiers for the sample.

Sample#

A sample is defined by a sample name, the sample event rate, and a list of modifiers [likelihood-1].

{
    "sample": {
        "type": "object",
        "properties": {
            "name": { "type": "string" },
            "data": { "type": "array", "items": {"type": "number"}, "minItems": 1 },
            "modifiers": {
                "type": "array",
                "items": {
                    "anyOf": [
                        { "$ref": "#/definitions/modifier/histosys" },
                        { "$ref": "#/definitions/modifier/lumi" },
                        { "$ref": "#/definitions/modifier/normfactor" },
                        { "$ref": "#/definitions/modifier/normsys" },
                        { "$ref": "#/definitions/modifier/shapefactor" },
                        { "$ref": "#/definitions/modifier/shapesys" },
                        { "$ref": "#/definitions/modifier/staterror" }
                    ]
                }
            }
        },
        "required": ["name", "data", "modifiers"],
        "additionalProperties": false
    },
}

Modifiers#

The modifiers that are applicable for a given sample are encoded as a list of JSON objects with three fields. A name field, a type field denoting the class of the modifier, and a data field which provides the necessary input data as denoted in Modifiers and Constraints.

Based on the declared modifiers, the set of parameters and their constraint terms are derived implicitly as each type of modifier unambiguously defines the constraint terms it requires. Correlated shape modifiers and normalisation uncertainties have compatible constraint terms and thus modifiers can be declared that share parameters by reusing a name [1] for multiple modifiers. That is, a variation of a single parameter causes a shift within sample rates due to both shape and normalisation variations.

We review the structure of each modifier type below.

Uncorrelated Shape (shapesys)#

To construct the constraint term, the relative uncertainties \(\sigma_b\) are necessary for each bin. Therefore, we record the absolute uncertainty as an array of floats, which combined with the nominal sample data yield the desired \(\sigma_b\). An example of an uncorrelated shape modifier with three absolute uncertainty terms for a 3-bin channel is shown below:

{ "name": "mod_name", "type": "shapesys", "data": [1.0, 1.5, 2.0] }

Warning

For bins in the model where:

the samples nominal expected rate is zero, or

the absolute uncertainty is zero.

nuisance parameters will be allocated, but will be fixed to 1 in the calculation (as shapesys is a multiplicative modifier this results in multiplying by 1).

These values are, in the context of uncorrelated shape uncertainties, unphysical. If this situation occurs, one needs to go back and understand the inputs as this is undefined behavior in HistFactory.

The previous example will allocate three nuisance parameters for mod_name. The following example will also allocate three nuisance parameters for a 3-bin channel, with the second nuisance parameter fixed to 1:

{ "name": "mod_name", "type": "shapesys", "data": [1.0, 0.0, 2.0] }

Correlated Shape (histosys)#

This modifier represents the same source of uncertainty which has a different effect on the various sample shapes, hence a correlated shape. To implement an interpolation between sample distribution shapes, the distributions with a “downward variation” (“lo”) associated with \(\alpha=-1\) and an “upward variation” (“hi”) associated with \(\alpha=+1\) are provided as arrays of floats. An example of a correlated shape modifier with absolute shape variations for a 2-bin channel is shown below:

{ "name": "mod_name", "type": "histosys", "data": {"hi_data": [20,15], "lo_data": [10, 10]} }

This example specifies the expected event rate for the high-variation of the histosys as [20, 15] (20 events in first bin, 15 events in second bin); for the low-variation as [10, 10] (10 events in first bin, 10 events in second bin). This variation is absolute (not relative!).

Normalisation Uncertainty (normsys)#

The normalisation uncertainty modifies the sample rate by a overall factor \(\kappa(\alpha)\) constructed as the interpolation between downward (“lo”) and upward (“hi”) as well as the nominal setting, i.e. \(\kappa(-1) = \kappa_{\alpha=-1}\), \(\kappa(0) = 1\) and \(\kappa(+1) = \kappa_{\alpha=+1}\). In the modifier definition we record \(\kappa_{\alpha=+1}\) and \(\kappa_{\alpha=-1}\) as floats. An example of a normalisation uncertainty modifier with scale factors recorded for the up/down variations of an \(n\)-bin channel is shown below:

{ "name": "mod_name", "type": "normsys", "data": {"hi": 1.1, "lo": 0.9} }

MC Statistical Uncertainty (staterror)#

As the sample counts are often derived from Monte Carlo (MC) datasets, they necessarily carry an uncertainty due to the finite sample size of the datasets. As explained in detail in [likelihood-2], adding uncertainties for each sample would yield a very large number of nuisance parameters with limited utility. Therefore a set of bin-wise scale factors \(\gamma_{cb}\) is introduced to model the overall uncertainty in the bin due to MC statistics. The constraint term is constructed as a set of constraints with a central value equal to unity, e.g. \(\mathrm{Gauss} (\mu = 1, \sigma_{cb})\), for each bin in the channel. The scales \(\sigma_{cb}\) of the constraints are computed from the individual uncertainties of samples defined within the channel relative to the total event rate of all samples: \(\sigma_{cb} = \sqrt{\sum_s\delta_{csb}}/\sum_s \nu^0_{csb}\), where \(\delta_{csb}\) is the absolute yield uncertainty in each bin.

As not all samples within a channel are estimated from MC simulations, only the samples with a declared statistical uncertainty modifier enter the sum. An example of a statistical uncertainty modifier for a single bin channel is shown below:

{ "name": "mod_name", "type": "staterror", "data": [0.1] }

Warning

For bins in the model where:

the samples nominal expected rate is zero, or

the scale factor is zero.

nuisance parameters will be allocated, but will be fixed to 1 in the calculation (as staterror is a multiplicative modifier this results in multiplying by 1).

Luminosity (lumi)#

Sample rates derived from theory calculations, as opposed to data-driven estimates, are scaled to the integrated luminosity corresponding to the observed data. As the luminosity measurement is itself subject to an uncertainty, it must be reflected in the rate estimates of such samples. As this modifier is of global nature, no additional per-sample information is required and thus the data field is nulled. This uncertainty is relevant, in particular, when the parameter of interest is a signal cross-section. The luminosity uncertainty \(\sigma_\lambda\) is provided as part of the parameter configuration included in the measurement specification discussed in Measurements. An example of a luminosity modifier is shown below:

{ "name": "mod_name", "type": "lumi", "data": null }

Unconstrained Normalisation (normfactor)#

The unconstrained normalisation modifier scales the event rates of a sample by a free parameter \(\mu\). Common use cases are the signal rate of a possible BSM signal or simultaneous in-situ measurements of background samples. Such parameters are frequently the parameters of interest of a given measurement. No additional per-sample data is required. An example of a normalisation modifier is shown below:

{ "name": "mod_name", "type": "normfactor", "data": null }

Data-driven Shape (shapefactor)#

In order to support data-driven estimation of sample rates (e.g. for multijet backgrounds), the data-driven shape modifier adds free, bin-wise multiplicative parameters. Similarly to the normalisation factors, no additional data is required as no constraint is defined. An example of an uncorrelated shape modifier is shown below:

{ "name": "mod_name", "type": "shapefactor", "data": null }

Data#

The data provided by the analysis are the observed data for each channel (or region). This data is provided as a mapping from channel name to an array of floats, which provide the observed rates in each bin of the channel. The auxiliary data is not included as it is an input to the likelihood that does not need to be archived and can be determined automatically from the specification. An example of channel data is shown below:

{ "chan_name_one": [10, 20], "chan_name_two": [4, 0]}

Measurements#

Given the data and the model definitions, a measurement can be defined. In the current schema, the measurements defines the name of the parameter of interest as well as parameter set configurations. [2] Here, the remaining information not covered through the channel definition is provided, e.g. for the luminosity parameter. For all modifiers, the default settings can be overridden where possible:

inits: Initial value of the parameter.
bounds: Interval bounds of the parameter.
auxdata: Auxiliary data for the associated constraint term.
sigmas: Associated uncertainty of the parameter.

An example of a measurement is shown below:

{
    "name": "MyMeasurement",
    "config": {
        "poi": "SignalCrossSection", "parameters": [
            { "name":"lumi", "auxdata":[1.0],"sigmas":[0.017], "bounds":[[0.915,1.085]],"inits":[1.0] },
            { "name":"mu_ttbar", "bounds":[[0, 5]] },
            { "name":"rw_1CR", "fixed":true }
        ]
    }
}

This measurement, which scans over the parameter of interest SignalCrossSection, is setting configurations for the luminosity modifier, changing the default bounds for the normfactor modifier named mu_ttbar, and specifying that the modifier rw_1CR is held constant (fixed).

Observations#

This is what we evaluate the hypothesis testing against, to determine the compatibility of signal+background hypothesis to the background-only hypothesis. This is specified as a list of objects, with each object structured as

name: the channel for which the observations are recorded
data: the bin-by-bin observations for the named channel

An example of an observation for a 2-bin channel channel1, with values 110.0 and 120.0 is shown below:

{
    "name": "channel1", "data": [110.0, 120.0]
}

Toy Example#

{
    "channels": [
        { "name": "singlechannel",
          "samples": [
            { "name": "signal",
              "data": [5.0, 10.0],
              "modifiers": [ { "name": "mu", "type": "normfactor", "data": null} ]
            },
            { "name": "background",
              "data": [50.0, 60.0],
              "modifiers": [ {"name": "uncorr_bkguncrt", "type": "shapesys", "data": [5.0, 12.0]} ]
            }
          ]
        }
    ],
    "observations": [
        { "name": "singlechannel", "data": [50.0, 60.0] }
    ],
    "measurements": [
        { "name": "Measurement", "config": {"poi": "mu", "parameters": []} }
    ],
    "version": "1.0.0"
}

In the above example, we demonstrate a simple measurement of a single two-bin channel with two samples: a signal sample and a background sample. The signal sample has an unconstrained normalisation factor \(\mu\), while the background sample carries an uncorrelated shape systematic controlled by parameters \(\gamma_1\) and \(\gamma_2\). The background uncertainty for the bins is 10% and 20% respectively.

Additional Material#

Footnotes#

Bibliography#

[likelihood-1] (1,2)

Histfactory definitions schema. Accessed: 2019-06-20. URL: https://scikit-hep.org/pyhf/schemas/1.0.0/defs.json.

[likelihood-2] (1,2)

Kyle Cranmer, George Lewis, Lorenzo Moneta, Akira Shibata, and Wouter Verkerke. HistFactory: A tool for creating statistical models for use with RooFit and RooStats. Technical Report CERN-OPEN-2012-016, New York U., New York, Jan 2012. URL: https://cds.cern.ch/record/1456844.