.. _sec:likelihood:

Likelihood Specification
========================

The structure of the JSON specification of models follows closely the
original XML-based specification :cite:`likelihood-Cranmer:1456844`.

Workspace
---------

.. literalinclude:: ../src/pyhf/schemas/1.0.0/workspace.json
   :language: json

The overall document in the above code snippet describes a *workspace*, which includes

* **channels**: The channels in the model, which include a description of the samples
  within each channel and their possible parametrised modifiers.
* **measurements**: A set of measurements, which define among others the parameters of
  interest for a given statistical analysis objective.
* **observations**: The observed data, with which a likelihood can be constructed from the model.

A workspace consists of the channels, one set of observed data, but can
include multiple measurements. If provided a JSON file, one can quickly
check that it conforms to the provided workspace specification as follows:

.. code:: python

   import json, requests, jsonschema

   with open("/path/to/analysis_workspace.json", encoding="utf-8") as ws_file:
       workspace = json.load(ws_file)
   # if no exception is raised, it found and parsed the schema
   schema = requests.get("https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json").json()
   # If no exception is raised by validate(), the instance is valid.
   jsonschema.validate(instance=workspace, schema=schema)


.. _ssec:channel:

Channel
-------

A channel is defined by a channel name and a list of samples :cite:`likelihood-schema_defs`.

.. code:: json

    {
        "channel": {
            "type": "object",
            "properties": {
                "name": { "type": "string" },
                "samples": { "type": "array", "items": {"$ref": "#/definitions/sample"}, "minItems": 1 }
            },
            "required": ["name", "samples"],
            "additionalProperties": false
        },
    }

The Channel specification consists of a list of channel descriptions.
Each channel, an analysis region encompassing one or more measurement
bins, consists of a ``name`` field and a ``samples`` field (see :ref:`ssec:channel`), which
holds a list of sample definitions (see :ref:`ssec:sample`). Each sample definition in
turn has a ``name`` field, a ``data`` field for the nominal event rates
for all bins in the channel, and a ``modifiers`` field of the list of
modifiers for the sample.

.. _ssec:sample:

Sample
------

A sample is defined by a sample name, the sample event rate, and a list of modifiers :cite:`likelihood-schema_defs`.

.. _lst:schema:sample:

.. code:: json

    {
        "sample": {
            "type": "object",
            "properties": {
                "name": { "type": "string" },
                "data": { "type": "array", "items": {"type": "number"}, "minItems": 1 },
                "modifiers": {
                    "type": "array",
                    "items": {
                        "anyOf": [
                            { "$ref": "#/definitions/modifier/histosys" },
                            { "$ref": "#/definitions/modifier/lumi" },
                            { "$ref": "#/definitions/modifier/normfactor" },
                            { "$ref": "#/definitions/modifier/normsys" },
                            { "$ref": "#/definitions/modifier/shapefactor" },
                            { "$ref": "#/definitions/modifier/shapesys" },
                            { "$ref": "#/definitions/modifier/staterror" }
                        ]
                    }
                }
            },
            "required": ["name", "data", "modifiers"],
            "additionalProperties": false
        },
    }

Modifiers
---------

The modifiers that are applicable for a given sample are encoded as a
list of JSON objects with three fields. A name field, a type field
denoting the class of the modifier, and a data field which provides the
necessary input data as denoted in :ref:`tab:modifiers_and_constraints`.

Based on the declared modifiers, the set of parameters and their
constraint terms are derived implicitly as each type of modifier
unambiguously defines the constraint terms it requires. Correlated shape
modifiers and normalisation uncertainties have compatible constraint
terms and thus modifiers can be declared that *share* parameters by
re-using a name [1]_ for multiple modifiers. That is, a variation of a
single parameter causes a shift within sample rates due to both shape
and normalisation variations.

We review the structure of each modifier type below.

Uncorrelated Shape (shapesys)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To construct the constraint term, the relative uncertainties
:math:`\sigma_b` are necessary for each bin. Therefore, we record the
absolute uncertainty as an array of floats, which combined with the
nominal sample data yield the desired :math:`\sigma_b`.
An example of an uncorrelated shape modifier with three absolute uncertainty
terms for a 3-bin channel is shown below:

.. code:: json

   { "name": "mod_name", "type": "shapesys", "data": [1.0, 1.5, 2.0] }

.. warning::

   For bins in the model where:

     * the samples nominal expected rate is zero, or
     * the absolute uncertainty is zero.

   nuisance parameters will be allocated, but will be fixed to ``1`` in the
   calculation (as shapesys is a multiplicative modifier this results in
   multiplying by ``1``).

   These values are, in the context of uncorrelated shape uncertainties,
   unphysical. If this situation occurs, one needs to go back and understand
   the inputs as this is undefined behavior in HistFactory.

The previous example will allocate three nuisance parameters for ``mod_name``.
The following example will also allocate three nuisance parameters for a 3-bin
channel, with the second nuisance parameter fixed to ``1``:

.. code:: json

   { "name": "mod_name", "type": "shapesys", "data": [1.0, 0.0, 2.0] }

Correlated Shape (histosys)
~~~~~~~~~~~~~~~~~~~~~~~~~~~

This modifier represents the same source of uncertainty which has a
different effect on the various sample shapes, hence a correlated shape.
To implement an interpolation between sample distribution shapes, the
distributions with a "downward variation" ("lo") associated with
:math:`\alpha=-1` and an "upward variation" ("hi") associated with
:math:`\alpha=+1` are provided as arrays of floats.
An example of a correlated shape modifier with absolute shape variations
for a 2-bin channel is shown below:

.. code:: json

   { "name": "mod_name", "type": "histosys", "data": {"hi_data": [20,15], "lo_data": [10, 10]} }

This example specifies the expected event rate for the high-variation of the ``histosys`` as ``[20, 15]`` (20 events in first bin, 15 events in second bin); for the low-variation as ``[10, 10]`` (10 events in first bin, 10 events in second bin).
This variation is absolute (not relative!).

Normalisation Uncertainty (normsys)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The normalisation uncertainty modifies the sample rate by a overall
factor :math:`\kappa(\alpha)` constructed as the interpolation between
downward ("lo") and upward ("hi") as well as the nominal setting, i.e.
:math:`\kappa(-1) = \kappa_{\alpha=-1}`, :math:`\kappa(0) = 1` and
:math:`\kappa(+1) = \kappa_{\alpha=+1}`. In the modifier definition we record
:math:`\kappa_{\alpha=+1}` and :math:`\kappa_{\alpha=-1}` as floats.
An example of a normalisation uncertainty modifier with scale factors recorded
for the up/down variations of an :math:`n`-bin channel is shown below:

.. code:: json

   { "name": "mod_name", "type": "normsys", "data": {"hi": 1.1, "lo": 0.9} }

MC Statistical Uncertainty (staterror)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As the sample counts are often derived from Monte Carlo (MC) datasets, they
necessarily carry an uncertainty due to the finite sample size of the datasets.
As explained in detail in :cite:`likelihood-Cranmer:1456844`, adding uncertainties for
each sample would yield a very large number of nuisance parameters with limited
utility.
Therefore a set of bin-wise scale factors :math:`\gamma_{cb}` is
introduced to model the overall uncertainty in the bin due to MC statistics.
The constraint term is constructed as a set of constraints with a
central value equal to unity, e.g. :math:`\mathrm{Gauss} (\mu = 1, \sigma_{cb})`, for
each bin in the channel.
The scales :math:`\sigma_{cb}` of the constraints are computed from the individual
uncertainties of samples defined within the channel relative to the total event rate
of all samples: :math:`\sigma_{cb} = \sqrt{\sum_s\delta_{csb}}/\sum_s \nu^0_{csb}`,
where :math:`\delta_{csb}` is the absolute yield uncertainty in each bin.

As not all samples within a channel are estimated from MC simulations, only
the samples with a declared statistical uncertainty modifier enter the sum.
An example of a statistical uncertainty modifier for a single bin channel is
shown below:

.. code:: json

   { "name": "mod_name", "type": "staterror", "data": [0.1] }

.. warning::

   For bins in the model where:

     * the samples nominal expected rate is zero, or
     * the scale factor is zero.

   nuisance parameters will be allocated, but will be fixed to ``1`` in the
   calculation (as staterror is a multiplicative modifier this results in
   multiplying by ``1``).

Luminosity (lumi)
~~~~~~~~~~~~~~~~~

Sample rates derived from theory calculations, as opposed to data-driven
estimates, are scaled to the integrated luminosity corresponding to the
observed data. As the luminosity measurement is itself subject to an
uncertainty, it must be reflected in the rate estimates of such samples.  As
this modifier is of global nature, no additional per-sample information is
required and thus the data field is nulled. This uncertainty is relevant, in
particular, when the parameter of interest is a signal cross-section. The
luminosity uncertainty :math:`\sigma_\lambda` is provided as part of the
parameter configuration included in the measurement specification discussed
in :ref:`ssec:measurements`.
An example of a luminosity modifier is shown below:

.. code:: json

   { "name": "mod_name", "type": "lumi", "data": null }

Unconstrained Normalisation (normfactor)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The unconstrained normalisation modifier scales the event rates of a
sample by a free parameter :math:`\mu`. Common use cases are the signal
rate of a possible BSM signal or simultaneous in-situ measurements of
background samples. Such parameters are frequently the parameters of
interest of a given measurement. No additional per-sample data is
required.
An example of a normalisation modifier is shown below:

.. code:: json

   { "name": "mod_name", "type": "normfactor", "data": null }

Data-driven Shape (shapefactor)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to support data-driven estimation of sample rates (e.g. for
multijet backgrounds), the data-driven shape modifier adds free,
bin-wise multiplicative parameters. Similarly to the normalisation
factors, no additional data is required as no constraint is defined.
An example of an uncorrelated shape modifier is shown below:

.. code:: json

   { "name": "mod_name", "type": "shapefactor", "data": null }

Data
----

The data provided by the analysis are the observed data for each channel
(or region). This data is provided as a mapping from channel name to an
array of floats, which provide the observed rates in each bin of the
channel. The auxiliary data is not included as it is an input to the
likelihood that does not need to be archived and can be determined
automatically from the specification.
An example of channel data is shown below:

.. _lst:example:data:

.. code:: json

   { "chan_name_one": [10, 20], "chan_name_two": [4, 0]}

.. _ssec:measurements:

Measurements
------------

Given the data and the model definitions, a measurement can be defined.
In the current schema, the measurements defines the name of the
parameter of interest as well as parameter set configurations.  [2]_
Here, the remaining information not covered through the channel
definition is provided, e.g. for the luminosity parameter. For all
modifiers, the default settings can be overridden where possible:

* **inits**: Initial value of the parameter.
* **bounds**: Interval bounds of the parameter.
* **auxdata**: Auxiliary data for the associated constraint term.
* **sigmas**: Associated uncertainty of the parameter.

An example of a measurement is shown below:

.. code:: json

   {
       "name": "MyMeasurement",
       "config": {
           "poi": "SignalCrossSection", "parameters": [
               { "name":"lumi", "auxdata":[1.0],"sigmas":[0.017], "bounds":[[0.915,1.085]],"inits":[1.0] },
               { "name":"mu_ttbar", "bounds":[[0, 5]] },
               { "name":"rw_1CR", "fixed":true }
           ]
       }
   }

This measurement, which scans over the parameter of interest ``SignalCrossSection``, is setting configurations for the luminosity modifier, changing the default bounds for the normfactor modifier named ``mu_ttbar``, and specifying that the modifier ``rw_1CR`` is held constant (``fixed``).

.. _ssec:observations:

Observations
------------

This is what we evaluate the hypothesis testing against, to determine the
compatibility of signal+background hypothesis to the background-only
hypothesis. This is specified as a list of objects, with each object structured
as

* **name**: the channel for which the observations are recorded
* **data**: the bin-by-bin observations for the named channel

An example of an observation for a 2-bin channel ``channel1``, with values
``110.0`` and ``120.0`` is shown below:

.. code:: json

   {
       "name": "channel1", "data": [110.0, 120.0]
   }

Toy Example
-----------

.. # N.B. If the following literalinclude is changed test_examples.py must be changed accordingly
.. literalinclude:: ./examples/json/2-bin_1-channel.json
   :language: json

In the above example, we demonstrate a simple measurement of a
single two-bin channel with two samples: a signal sample and a background
sample. The signal sample has an unconstrained normalisation factor
:math:`\mu`, while the background sample carries an uncorrelated shape
systematic controlled by parameters :math:`\gamma_1` and :math:`\gamma_2`. The
background uncertainty for the bins is 10% and 20% respectively.

Additional Material
-------------------

Footnotes
~~~~~~~~~

.. [1]
   The name of a modifier specifies the parameter set it is controlled
   by. Modifiers with the same name share parameter sets.

.. [2]
   In this context a parameter set corresponds to a named
   lower-dimensional subspace of the full parameters :math:`\fullset`.
   In many cases these are one-dimensional subspaces, e.g. a specific
   interpolation parameter :math:`\alpha` or the luminosity parameter
   :math:`\lambda`. For multi-bin channels, however, e.g. all bin-wise
   nuisance parameters of the uncorrelated shape modifiers are grouped
   under a single name. Therefore in general a parameter set definition
   provides arrays of initial values, bounds, etc.

Bibliography
~~~~~~~~~~~~

.. bibliography:: bib/docs.bib
   :filter: docname in docnames
   :style: plain
   :keyprefix: likelihood-
   :labelprefix: likelihood-