Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
af4c7c0
writing pseudodata in diagonal basis and saving eigvecs and eigvals a…
jacoterh Apr 14, 2026
36f9697
using table decorator
jacoterh Apr 21, 2026
5198e90
more cleaning
jacoterh Apr 21, 2026
1284cd0
more cleaning
jacoterh Apr 21, 2026
2ad0d47
setting use_t0 = True in vp-setupfit
jacoterh Apr 24, 2026
fd366da
wip on loading diag rot from table dir in vp-setupfit
jacoterh Apr 24, 2026
211c277
wip on loading diag rot from table dir in vp-setupfit
jacoterh Apr 27, 2026
d6adf4d
updating theory cov defaults
jacoterh Apr 27, 2026
3176790
swapping order rotation and theory covmat in vp-setupfit
jacoterh Apr 27, 2026
20d6f12
vp-setupfit now runs
jacoterh Apr 27, 2026
d52ee58
calling nnfit_theory_covmat instead of reading the csv
jacoterh Apr 27, 2026
c2f1094
vp-setupfit stores covmat in diagonal and non-diagonal case
jacoterh Apr 27, 2026
e7c874c
caching inverse covmat in non diagonal basis
jacoterh Apr 27, 2026
ceabf61
cleaning
jacoterh Apr 27, 2026
59f3e04
fixing covmat by reference issue
jacoterh Apr 29, 2026
0fe6718
n3fit runs again
jacoterh Apr 30, 2026
4ff68d5
extending number of headers
jacoterh Apr 30, 2026
138e1d3
passing covmat in data basis to make_replica for correct sampling
jacoterh May 1, 2026
4b95f6a
indexing by process group
jacoterh May 1, 2026
b1a0b1b
make rotation action condition on presence of theorycovmatconfig in r…
jacoterh May 1, 2026
1b0f79a
Update validphys2/src/validphys/config.py
jacoterh May 1, 2026
b93eddf
making csv file name theorycovmat dependent + no longer modify global…
jacoterh May 1, 2026
554b6d2
make_replica uses loaded theory covmat
jacoterh May 14, 2026
34dbe2b
adding loaded theory covmat variants
jacoterh May 14, 2026
61fdfc2
removing t0_considered from hashed covmat array
jacoterh May 14, 2026
b647ea0
Fixng errors: moving rotation_action out of closure block
achiefa May 15, 2026
347f331
Using single-column csv when reading diag-basis results
achiefa May 16, 2026
8932717
Enfore correct routing for constructing or loading the theory covmat
achiefa May 18, 2026
d5b7f57
Remove dead variable
achiefa May 18, 2026
22acf9f
Fix in diagonal basis + update tests
achiefa May 18, 2026
28d35cb
Regenerate tests
achiefa May 19, 2026
99a2339
Update test
achiefa May 19, 2026
85f806c
Enforce data-input ordering in nnfit_theory_covmat
achiefa May 20, 2026
e7071f0
removing redundant comment in docstring groups_index
jacoterh May 19, 2026
f68bafd
Align nnfit_theory_covat to data_input
jacoterh May 20, 2026
e802729
Change notation for diagonal basis in documentation - first try
achiefa May 20, 2026
ebcd842
updating docs
jacoterh May 21, 2026
b7053de
Update validphys2/src/validphys/n3fit_data.py
jacoterh May 21, 2026
57a6c49
Update validphys2/src/validphys/pseudodata.py
jacoterh May 21, 2026
009a929
Merge branch 'master' into diag_covmat_reproducibility
jacoterh May 21, 2026
b31822d
Run pre-commit
achiefa May 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 65 additions & 28 deletions doc/sphinx/source/n3fit/methodology.rst
Original file line number Diff line number Diff line change
Expand Up @@ -348,53 +348,90 @@ The figure above provides a schematic representation of this feature scaling met
Diagonal basis
--------------

Performing the training and validation split without diagonalising the :math:`t_0` covmat :math:`C_{0}` neglects
any correlations that may be present between training and validation data. To remedy this,
we rotate to a basis in which the correlation matrix is diagonal before performing any training/validation split.
Starting from the definition of the :math:`\chi^2` function in the NNPDF methodology, we have
Training and validation data are obtained by performing a random split of the
original data set. However, data points in the two sets are not necessarily
statistically independent, as they may be correlated through the fitting
covariance matrix :math:`C_{\rm fit}`. Here the fitting covariance matrix is the
sum of the :math:`t_{0}` experimental covariance matrix :math:`C_{0}` and any
theory covariance matrix :math:`C_{\rm th}` used in the fit, i.e., :math:`C_{\rm
fit} = C_{0} + C_{\rm th}`. In order to disentangle the training and
validation data, we perform the training-validation split in a basis in which
the correlation matrix is diagonal.

We first compute the correlation matrix :math:`\rho`, which is defined as

.. math::

\chi^2 &= (D-T)^T C_0^{-1} (D-T) \\
&= (D-T)^T R^{-1} R C_0^{-1} R R^{-1} (D-T) \\
&= (D-T)^T R^{-1} \left( R^{-1} C_0 R^{-1} \right)^{-1} R^{-1} (D-T) \\
&\equiv \tilde{\epsilon}^T \rho^{-1} \tilde{\epsilon} \, ,
\rho = \Sigma^{-1} C_{\rm fit} \Sigma^{-1} \, ,

where we have defined :math:`\tilde{\epsilon} \equiv R^{-1}(D-T)` and :math:`\rho = R^{-1} C_0 R^{-1}`.
where we have defined :math:`\Sigma_{ij} = \sqrt{C_{\rm fit, ii}} \delta_{ij}` and
:math:`(\Sigma^{-1})_{ij} = \frac{1}{\sqrt{C_{\rm fit, ii}}} \delta_{ij}`. The
correlation matrix is a symmetric positive-definite matrix, and therefore it can
be diagonalized by an orthogonal transformation. Therefore we can write

Choosing :math:`R_{ii} = \sqrt{C_{0, ii}}`, we have that :math:`R^{-1} C_0 R^{-1}` coincides with the usual definition of the correlation matrix.
.. math::

\rho = V \Lambda V^T \, ,

where :math:`V` is an orthogonal matrix and :math:`\Lambda` is a diagonal matrix
containing the eigenvalues of :math:`\rho`. The original fitting covariance
matrix can then be written as

.. math::

C_{\rm fit} &= \Sigma \rho \Sigma \\
&= (\Sigma V) \Lambda (V^T \Sigma) \\
&\equiv U \Lambda U^T \, ,

Next, we move to the basis in which :math:`\rho` is diagonal. Writing :math:`\rho = \tilde{U}^T \tilde{\Lambda} \tilde{U}`, we find
where we have defined the non-orthogonal matrix :math:`U = \Sigma V`. Its
inverse defines the rotation matrix that diagonalizes the
:math:`\chi^2`, and is given by :math:`R^T \equiv U^{-1} = V^T \Sigma^{-1}`.
Therefore, the inverse of the fitting covariance matrix can be written as

.. math::

\chi^2 &= \tilde{\epsilon}^T \rho^{-1} \tilde{\epsilon} \\
&= \tilde{\epsilon}^T (\tilde{U}^T \tilde{\Lambda} \tilde{U})^{-1} \tilde{\epsilon} \\
&= \tilde{\epsilon}^T \tilde{U}^T \tilde{\Lambda}^{-1} \tilde{U} \tilde{\epsilon} \\
&\equiv \tilde{\tilde{\epsilon}}^T \tilde{\Lambda}^{-1} \tilde{\tilde{\epsilon}} \, ,
C_{\rm fit}^{-1} &= (U \Lambda U^T)^{-1} \\
&= (U^T)^{-1} \Lambda^{-1} U^{-1} \\
&= R \Lambda^{-1} R^T \, .

where on the last line we have defined
Considering the definition of the :math:`\chi^2` function in the NNPDF
methodology, we finally have

.. math::

\tilde{\tilde{\epsilon}} \equiv \tilde{U}\tilde{\epsilon} = \tilde{U}R^{-1}(D-T).
\chi^2 &= (D-T)^T C_{\rm fit}^{-1} (D-T) \\
&= (D-T)^T R \Lambda^{-1} R^T (D-T) \\
&= \epsilon^T \Lambda^{-1} \epsilon \\
&= \lVert \epsilon \rVert^2_{\Lambda^{-1}} \, ,

where we have defined the residuals in the diagonal basis as :math:`\epsilon \equiv R^T(D-T)` or, writing it in index notation,

.. math::

\epsilon_i = (V^T)_{ij} \frac{(D-T)_j}{\sqrt{C_{\rm fit, jj}}} \,.

In this basis, the :math:`\chi^2` becomes a weighted norm of the residuals,
where the weights are given by the inverse of the eigenvalues of the correlation
matrix.

In index notation, this reads
The transformed data :math:`\epsilon` are statistically independent in the
diagonal basis of the correlation matrix :math:`\rho`. As a crosscheck, we
can compute the covariance of :math:`\epsilon`,

.. math::

\tilde{\tilde{\epsilon_i}} = \tilde{U}_{ij} \frac{(D-T)_j}{\sqrt{C_{0, jj}}}
\mathbb{E}[\epsilon \epsilon^T] &= \mathbb{E}[R^T(D-T)(D-T)^T R] \\
&= R^T \mathbb{E}[(D-T)(D-T)^T] R \\
&= R^T C_{\rm fit} R \\
&= R^T U \Lambda U^T R \\
&= \Lambda \,,

The transformed data :math:`\tilde{\tilde{\epsilon}}` is statistically independent in the diagonal basis of the correlation matrix :math:`\rho`.
Computing the covariance of :math:`\tilde{\tilde{\epsilon}}`,
where we have used the fact that :math:`R^T U = I` and the assumption that the
data are distributed according to the fitting covariance matrix :math:`C_{\rm fit}`

.. math::

\mathbb{E}[\tilde{\tilde{\epsilon}}\tilde{\tilde{\epsilon}}^T]
&= \mathbb{E} \big[ (\tilde{U} R^{-1}(D-T)) (\tilde{U} R^{-1}(D-T))^T \big] \\
&= \tilde{U} R^{-1} \mathbb{E}[(D-T)(D-T)^T] R^{-1} \tilde{U}^T \\
&= \tilde{U} \rho \tilde{U}^T \\
&= \tilde{U}\tilde{U}^T \tilde{\Lambda} \tilde{U}\tilde{U}^T \\
&= \tilde{\Lambda} \, ,
\mathbb{E}[(D-T)(D-T)^T] = C_{\rm fit} \, .

we find that it is diagonal, which demonstrates that the training/validation data are statistically independent indeed.
This shows that the correlation is indeed diagonal, and demonstrates that the
training/validation data are uncorrelated.
7 changes: 7 additions & 0 deletions doc/sphinx/source/n3fit/runcard_detailed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -429,6 +429,13 @@ according to their experiment. Additionally, the union of these two is saved in
``<fit_directory>/replica_<number>/datacuts_theory_fitting_pseudodata_table.csv``
if one is not interested in the exact nature of the splitting.

When ``diagonal_basis: true`` is used (by default), the saved pseudodata indices are labeled as
``eigenmode <i>`` corresponding to the diagonal basis used in the fit. In the presence of a theory covariance matrix,
``vp-setupfit`` writes one file with the eigenvalues of the total correlation matrix and the rotation matrix that diagonalises
the :math:`\chi2` under
``<fit_directory>/tables/datacuts_theory_theorycovmatconfig_fitting_covmat_table.csv``.
Without a theory covariance matrix, ``vp-setupfit`` writes this file instead under
``<fit_directory>/tables/datacuts_theory_fitting_covmat_table.csv``.

Imposing sum rules
^^^^^^^^^^^^^^^^^^
Expand Down
3 changes: 3 additions & 0 deletions n3fit/src/n3fit/scripts/n3fit_exec.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,9 @@ def from_yaml(cls, o, *args, **kwargs):
)
N3FIT_FIXED_CONFIG['point_prescriptions'] = thconfig.get('point_prescriptions')
N3FIT_FIXED_CONFIG['user_covmat_path'] = thconfig.get('user_covmat_path')
# vp-setupfit has already written the theory-covmat CSV. n3fit
# should load it instead of rebuilding from scratch.
N3FIT_FIXED_CONFIG['load_thcovmat_from_file'] = True

file_content.update(N3FIT_FIXED_CONFIG)
return cls(file_content, *args, **kwargs)
Expand Down
33 changes: 22 additions & 11 deletions n3fit/src/n3fit/scripts/vp_setupfit.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
# top.


import copy
import hashlib
import logging
import pathlib
Expand Down Expand Up @@ -54,9 +55,10 @@
'validphys.results',
'validphys.theorycovariance.construction',
'validphys.photon.compute',
'validphys.n3fit_data',
]

SETUPFIT_DEFAULTS = dict(use_cuts='internal')
SETUPFIT_DEFAULTS = dict(use_cuts='internal', use_t0=True)


log = logging.getLogger(__name__)
Expand Down Expand Up @@ -141,6 +143,9 @@ class SetupFitConfig(Config):

@classmethod
def from_yaml(cls, o, *args, **kwargs):
# Create a fresh copy of the fixed config to avoid in-place modifications
fixed_config = copy.deepcopy(SETUPFIT_FIXED_CONFIG)

try:
file_content = yaml_safe.load(o)
except error.YAMLError as e:
Expand All @@ -156,10 +161,10 @@ def from_yaml(cls, o, *args, **kwargs):
# Use faketheoryid to create the L0 data to be stored into the filter folder
# (L1 data is stored if fakedata is True)
if 'faketheoryid' in closuredict:
# make sure theory key exists in SETUPFIT_FIXED_CONFIG
SETUPFIT_FIXED_CONFIG.setdefault('theory', {})
# make sure theory key exists in fixed_config
fixed_config.setdefault('theory', {})
# overwrite theoryid with the faketheoryid
SETUPFIT_FIXED_CONFIG['theory']['theoryid'] = closuredict['faketheoryid']
fixed_config['theory']['theoryid'] = closuredict['faketheoryid']
# download theoryid since it will be used in the fit
try:
loader.check_theoryID(file_content['theory']['theoryid'])
Expand All @@ -171,8 +176,14 @@ def from_yaml(cls, o, *args, **kwargs):
filter_action = 'datacuts::theory::fitting filter'
check_n3fit_action = 'datacuts::theory::fitting n3fit_checks_action'

# Add rotation action for the total covariance matrix
if file_content.get('theorycovmatconfig') is not None:
rotation_action = 'datacuts::theory::theorycovmatconfig fitting_covmat_table'
else:
rotation_action = 'datacuts::theory fitting_covmat_table'

# The settings for these actions depend on the presence of closuretest
SETUPFIT_FIXED_CONFIG['actions_'] += [check_n3fit_action, filter_action]
fixed_config['actions_'] += [check_n3fit_action, filter_action, rotation_action]

Comment thread
jacoterh marked this conversation as resolved.
# Check theory covariance matrix configuration
thconfig = file_content.get('theorycovmatconfig', {})
Expand All @@ -184,7 +195,7 @@ def from_yaml(cls, o, *args, **kwargs):
"`point_prescriptions: ['9 point', '3 point']`"
)
if thconfig:
SETUPFIT_FIXED_CONFIG['actions_'].append(
fixed_config['actions_'].append(
'datacuts::theory::theorycovmatconfig nnfit_theory_covmat'
)

Expand Down Expand Up @@ -214,14 +225,14 @@ def from_yaml(cls, o, *args, **kwargs):
if compute_in_setupfit:
log.info("Forcing photon computation with FiatLux during setupfit.")
# Since the photon will be computed, check that the luxset and additional_errors exist
SETUPFIT_FIXED_CONFIG['actions_'].append('fiatlux check_luxset_exists')
fixed_config['actions_'].append('fiatlux check_luxset_exists')
if fiatlux.get("additional_errors"):
SETUPFIT_FIXED_CONFIG['actions_'].append('fiatlux check_additional_errors')
SETUPFIT_FIXED_CONFIG['actions_'].append('fiatlux::theory compute_photon_to_disk')
fixed_config['actions_'].append('fiatlux check_additional_errors')
fixed_config['actions_'].append('fiatlux::theory compute_photon_to_disk')

# Check positivity bound
if file_content.get('positivity_bound') is not None:
SETUPFIT_FIXED_CONFIG['actions_'].append('positivity_bound check_unpolarized_bc')
fixed_config['actions_'].append('positivity_bound check_unpolarized_bc')

# Check hyperscan
trials_config = file_content.get('trial_specs', {})
Expand All @@ -233,7 +244,7 @@ def from_yaml(cls, o, *args, **kwargs):
file_content.setdefault(k, v)

# Update file content with fixed configuration
file_content.update(SETUPFIT_FIXED_CONFIG)
file_content.update(fixed_config)

return cls(file_content, *args, **kwargs)

Expand Down
10 changes: 7 additions & 3 deletions n3fit/src/n3fit/tests/test_fit.py
Original file line number Diff line number Diff line change
Expand Up @@ -283,8 +283,11 @@ def test_parallel_against_sequential(tmp_path, rep_from=6, rep_to=8):
"ATLAS_TTBAR_8TEV_TOT_X-SEC",
"CMS_SINGLETOP_13TEV_TCHANNEL-XSEC",
]
dataset_inputs = [{"dataset": d, "frac": 0.6, "variant": "legacy"} for d in datasets]
dataset_inputs = [{"dataset": d, "variant": "legacy"} for d in datasets]
n3fit_input["dataset_inputs"] = dataset_inputs
# Using diaogonal basis
n3fit_input["diagonal_basis"] = True
n3fit_input["diagonal_frac"] = 0.5
# Exit inmediately
n3fit_input["parameters"]["epochs"] = 1
# Save pseudodata
Expand All @@ -311,8 +314,9 @@ def test_parallel_against_sequential(tmp_path, rep_from=6, rep_to=8):
for csvfile_seq in folder_seq.glob("*/*.csv"):
csvfile_par = folder_par / csvfile_seq.relative_to(folder_seq)

result_seq = pd.read_csv(csvfile_seq, sep="\t", index_col=[0, 1, 2], header=0)
result_par = pd.read_csv(csvfile_par, sep="\t", index_col=[0, 1, 2], header=0)
# Diagonal basis writes single-index csv files
result_seq = pd.read_csv(csvfile_seq, sep="\t", index_col=[0], header=0)
result_par = pd.read_csv(csvfile_par, sep="\t", index_col=[0], header=0)
pd.testing.assert_frame_equal(result_seq, result_par)

# Check the rest of the fit, while numerical differences are expected between sequential
Expand Down
66 changes: 53 additions & 13 deletions validphys2/src/validphys/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -805,16 +805,36 @@ def produce_experiment_from_input(self, experiment_input, theoryid, use_cuts, fi
}

@configparser.explicit_node
def produce_dataset_inputs_fitting_covmat(self, use_thcovmat_in_fitting=False):
def produce_dataset_inputs_fitting_covmat(
self, use_thcovmat_in_fitting=False, load_thcovmat_from_file=False
):
"""
Produces the correct covmat to be used in fitting_data_dict according
to some options: whether to include the theory covmat, whether to
separate the multiplcative errors and whether to compute the
experimental covmat using the t0 prescription.
Dispatcher node for the covmat used in ``fitting_data_dict``.

Returns the experimental t0 covmat (``dataset_inputs_t0_exp_covmat``) when
no theory covmat is requested. When ``use_thcovmat_in_fitting=True``,
returns the total (experimental + theory) covmat — either rebuilt from
scratch via ``nnfit_theory_covmat`` (``load_thcovmat_from_file=False``,
the default) or with the theory part loaded from a CSV previously
written by ``vp-setupfit`` (``load_thcovmat_from_file=True``).

The two contexts:

* **vp-setupfit** — leaves ``load_thcovmat_from_file`` at the default. It
*writes* the theory covmat CSV, so the load variant would raise
FileNotFoundError on a file that does not yet exist. The result feeds
``setupfit_fitting_covmat``, which serialises either the full fitting
covmat or its diagonal-basis rotation table.
* **n3fit** — sets ``load_thcovmat_from_file=True`` (see
``n3fit_exec.py``). The result feeds ``_inv_covmat_prepared``, which
prepares the inverse for the fit.
"""

from validphys import covmats

if use_thcovmat_in_fitting:
if load_thcovmat_from_file:
return covmats.dataset_load_inputs_t0_total_covmat
return covmats.dataset_inputs_t0_total_covmat
return covmats.dataset_inputs_t0_exp_covmat

Expand All @@ -830,8 +850,13 @@ def produce_dataset_inputs_sampling_covmat(
"""
Produces the correct MC replica method sampling covmat to be used in
make_replica according to some options: whether to sample using a t0
covariance matrix, include the theory covmat and whether to
separate the multiplcative errors.
covariance matrix, include the theory covmat and whether to separate the
multiplcative errors.

This node is never invoked by setupfit, but is used in n3fit when
sampling the MC replicas for the fit (``make_replica``). It routes to
the load variants under ``use_thcovmat_in_sampling=True``, which load
the theory covmat from the CSV file generated by setupfit.

Parameters
----------
Expand All @@ -851,9 +876,9 @@ def produce_dataset_inputs_sampling_covmat(
if use_t0_sampling:
if use_thcovmat_in_sampling:
if sep_mult:
return covmats.dataset_inputs_t0_total_covmat_separate
return covmats.dataset_load_inputs_t0_total_covmat_separate
else:
return covmats.dataset_inputs_t0_total_covmat
return covmats.dataset_load_inputs_t0_total_covmat
else:
if sep_mult:
return covmats.dataset_inputs_t0_exp_covmat_separate
Expand All @@ -863,15 +888,28 @@ def produce_dataset_inputs_sampling_covmat(
else:
if use_thcovmat_in_sampling:
if sep_mult:
return covmats.dataset_inputs_total_covmat_separate
return covmats.dataset_load_inputs_total_covmat_separate
else:
return covmats.dataset_inputs_total_covmat
return covmats.dataset_load_inputs_total_covmat
else:
if sep_mult:
return covmats.dataset_inputs_exp_covmat_separate
else:
return covmats.dataset_inputs_exp_covmat

def produce_fitting_covmat_name(self, fit):
"""Produce the name of the covmat to be used in fitting,
according to how it was generated by vp-setupfit.
"""
runcard = fit.as_input()
use_thcovmat = runcard.get("theorycovmatconfig", {}).get("use_thcovmat_in_fitting", False)
if use_thcovmat:
covmat_name = "datacuts_theory_theorycovmatconfig_fitting_covmat_table.csv"
else:
covmat_name = "datacuts_theory_fitting_covmat_table.csv"
path = fit.path / "tables" / covmat_name
return path

def produce_loaded_theory_covmat(
self,
output_path,
Expand All @@ -885,6 +923,7 @@ def produce_loaded_theory_covmat(
Loads the theory covmat from the correct file according to how it
was generated by vp-setupfit.
"""

if not use_thcovmat_in_sampling and not use_thcovmat_in_fitting:
return 0.0
# Load correct file according to how the thcovmat was generated by vp-setupfit
Expand Down Expand Up @@ -1328,6 +1367,7 @@ def produce_nnfit_theory_covmat(
This function is only used in vp-setupfit to store the necessary covmats as .csv files in
the tables directory.
"""

if point_prescriptions is not None:
if user_covmat_path is not None:
# Both scalevar and user uncertainties
Expand All @@ -1336,9 +1376,9 @@ def produce_nnfit_theory_covmat(
f = total_theory_covmat_fitting
else:
# Only scalevar uncertainties
from validphys.theorycovariance.construction import theory_covmat_custom
from validphys.theorycovariance.construction import theory_covmat_custom_fitting

f = theory_covmat_custom
f = theory_covmat_custom_fitting
elif user_covmat_path is not None:
# Only user uncertainties
from validphys.theorycovariance.construction import user_covmat_fitting
Expand Down
Loading
Loading