diff --git a/doc/sphinx/source/n3fit/methodology.rst b/doc/sphinx/source/n3fit/methodology.rst index a4543c8cc8..ddc36d3984 100644 --- a/doc/sphinx/source/n3fit/methodology.rst +++ b/doc/sphinx/source/n3fit/methodology.rst @@ -348,53 +348,90 @@ The figure above provides a schematic representation of this feature scaling met Diagonal basis -------------- -Performing the training and validation split without diagonalising the :math:`t_0` covmat :math:`C_{0}` neglects -any correlations that may be present between training and validation data. To remedy this, -we rotate to a basis in which the correlation matrix is diagonal before performing any training/validation split. -Starting from the definition of the :math:`\chi^2` function in the NNPDF methodology, we have +Training and validation data are obtained by performing a random split of the +original data set. However, data points in the two sets are not necessarily +statistically independent, as they may be correlated through the fitting +covariance matrix :math:`C_{\rm fit}`. Here the fitting covariance matrix is the +sum of the :math:`t_{0}` experimental covariance matrix :math:`C_{0}` and any +theory covariance matrix :math:`C_{\rm th}` used in the fit, i.e., :math:`C_{\rm +fit} = C_{0} + C_{\rm th}`. In order to disentangle the training and +validation data, we perform the training-validation split in a basis in which +the correlation matrix is diagonal. + +We first compute the correlation matrix :math:`\rho`, which is defined as .. math:: - \chi^2 &= (D-T)^T C_0^{-1} (D-T) \\ - &= (D-T)^T R^{-1} R C_0^{-1} R R^{-1} (D-T) \\ - &= (D-T)^T R^{-1} \left( R^{-1} C_0 R^{-1} \right)^{-1} R^{-1} (D-T) \\ - &\equiv \tilde{\epsilon}^T \rho^{-1} \tilde{\epsilon} \, , + \rho = \Sigma^{-1} C_{\rm fit} \Sigma^{-1} \, , -where we have defined :math:`\tilde{\epsilon} \equiv R^{-1}(D-T)` and :math:`\rho = R^{-1} C_0 R^{-1}`. +where we have defined :math:`\Sigma_{ij} = \sqrt{C_{\rm fit, ii}} \delta_{ij}` and +:math:`(\Sigma^{-1})_{ij} = \frac{1}{\sqrt{C_{\rm fit, ii}}} \delta_{ij}`. The +correlation matrix is a symmetric positive-definite matrix, and therefore it can +be diagonalized by an orthogonal transformation. Therefore we can write -Choosing :math:`R_{ii} = \sqrt{C_{0, ii}}`, we have that :math:`R^{-1} C_0 R^{-1}` coincides with the usual definition of the correlation matrix. +.. math:: + + \rho = V \Lambda V^T \, , + +where :math:`V` is an orthogonal matrix and :math:`\Lambda` is a diagonal matrix +containing the eigenvalues of :math:`\rho`. The original fitting covariance +matrix can then be written as + +.. math:: + + C_{\rm fit} &= \Sigma \rho \Sigma \\ + &= (\Sigma V) \Lambda (V^T \Sigma) \\ + &\equiv U \Lambda U^T \, , -Next, we move to the basis in which :math:`\rho` is diagonal. Writing :math:`\rho = \tilde{U}^T \tilde{\Lambda} \tilde{U}`, we find +where we have defined the non-orthogonal matrix :math:`U = \Sigma V`. Its +inverse defines the rotation matrix that diagonalizes the +:math:`\chi^2`, and is given by :math:`R^T \equiv U^{-1} = V^T \Sigma^{-1}`. +Therefore, the inverse of the fitting covariance matrix can be written as .. math:: - \chi^2 &= \tilde{\epsilon}^T \rho^{-1} \tilde{\epsilon} \\ - &= \tilde{\epsilon}^T (\tilde{U}^T \tilde{\Lambda} \tilde{U})^{-1} \tilde{\epsilon} \\ - &= \tilde{\epsilon}^T \tilde{U}^T \tilde{\Lambda}^{-1} \tilde{U} \tilde{\epsilon} \\ - &\equiv \tilde{\tilde{\epsilon}}^T \tilde{\Lambda}^{-1} \tilde{\tilde{\epsilon}} \, , + C_{\rm fit}^{-1} &= (U \Lambda U^T)^{-1} \\ + &= (U^T)^{-1} \Lambda^{-1} U^{-1} \\ + &= R \Lambda^{-1} R^T \, . -where on the last line we have defined +Considering the definition of the :math:`\chi^2` function in the NNPDF +methodology, we finally have .. math:: - \tilde{\tilde{\epsilon}} \equiv \tilde{U}\tilde{\epsilon} = \tilde{U}R^{-1}(D-T). + \chi^2 &= (D-T)^T C_{\rm fit}^{-1} (D-T) \\ + &= (D-T)^T R \Lambda^{-1} R^T (D-T) \\ + &= \epsilon^T \Lambda^{-1} \epsilon \\ + &= \lVert \epsilon \rVert^2_{\Lambda^{-1}} \, , + +where we have defined the residuals in the diagonal basis as :math:`\epsilon \equiv R^T(D-T)` or, writing it in index notation, + +.. math:: + + \epsilon_i = (V^T)_{ij} \frac{(D-T)_j}{\sqrt{C_{\rm fit, jj}}} \,. + +In this basis, the :math:`\chi^2` becomes a weighted norm of the residuals, +where the weights are given by the inverse of the eigenvalues of the correlation +matrix. -In index notation, this reads +The transformed data :math:`\epsilon` are statistically independent in the +diagonal basis of the correlation matrix :math:`\rho`. As a crosscheck, we +can compute the covariance of :math:`\epsilon`, .. math:: - \tilde{\tilde{\epsilon_i}} = \tilde{U}_{ij} \frac{(D-T)_j}{\sqrt{C_{0, jj}}} + \mathbb{E}[\epsilon \epsilon^T] &= \mathbb{E}[R^T(D-T)(D-T)^T R] \\ + &= R^T \mathbb{E}[(D-T)(D-T)^T] R \\ + &= R^T C_{\rm fit} R \\ + &= R^T U \Lambda U^T R \\ + &= \Lambda \,, -The transformed data :math:`\tilde{\tilde{\epsilon}}` is statistically independent in the diagonal basis of the correlation matrix :math:`\rho`. -Computing the covariance of :math:`\tilde{\tilde{\epsilon}}`, +where we have used the fact that :math:`R^T U = I` and the assumption that the +data are distributed according to the fitting covariance matrix :math:`C_{\rm fit}` .. math:: - \mathbb{E}[\tilde{\tilde{\epsilon}}\tilde{\tilde{\epsilon}}^T] - &= \mathbb{E} \big[ (\tilde{U} R^{-1}(D-T)) (\tilde{U} R^{-1}(D-T))^T \big] \\ - &= \tilde{U} R^{-1} \mathbb{E}[(D-T)(D-T)^T] R^{-1} \tilde{U}^T \\ - &= \tilde{U} \rho \tilde{U}^T \\ - &= \tilde{U}\tilde{U}^T \tilde{\Lambda} \tilde{U}\tilde{U}^T \\ - &= \tilde{\Lambda} \, , + \mathbb{E}[(D-T)(D-T)^T] = C_{\rm fit} \, . -we find that it is diagonal, which demonstrates that the training/validation data are statistically independent indeed. +This shows that the correlation is indeed diagonal, and demonstrates that the +training/validation data are uncorrelated. diff --git a/doc/sphinx/source/n3fit/runcard_detailed.rst b/doc/sphinx/source/n3fit/runcard_detailed.rst index f5ea1643eb..5fb2a73349 100644 --- a/doc/sphinx/source/n3fit/runcard_detailed.rst +++ b/doc/sphinx/source/n3fit/runcard_detailed.rst @@ -429,6 +429,13 @@ according to their experiment. Additionally, the union of these two is saved in ``/replica_/datacuts_theory_fitting_pseudodata_table.csv`` if one is not interested in the exact nature of the splitting. +When ``diagonal_basis: true`` is used (by default), the saved pseudodata indices are labeled as +``eigenmode `` corresponding to the diagonal basis used in the fit. In the presence of a theory covariance matrix, +``vp-setupfit`` writes one file with the eigenvalues of the total correlation matrix and the rotation matrix that diagonalises +the :math:`\chi2` under +``/tables/datacuts_theory_theorycovmatconfig_fitting_covmat_table.csv``. +Without a theory covariance matrix, ``vp-setupfit`` writes this file instead under +``/tables/datacuts_theory_fitting_covmat_table.csv``. Imposing sum rules ^^^^^^^^^^^^^^^^^^ diff --git a/n3fit/src/n3fit/scripts/n3fit_exec.py b/n3fit/src/n3fit/scripts/n3fit_exec.py index d8b6c408ac..b7071111d9 100755 --- a/n3fit/src/n3fit/scripts/n3fit_exec.py +++ b/n3fit/src/n3fit/scripts/n3fit_exec.py @@ -198,6 +198,9 @@ def from_yaml(cls, o, *args, **kwargs): ) N3FIT_FIXED_CONFIG['point_prescriptions'] = thconfig.get('point_prescriptions') N3FIT_FIXED_CONFIG['user_covmat_path'] = thconfig.get('user_covmat_path') + # vp-setupfit has already written the theory-covmat CSV. n3fit + # should load it instead of rebuilding from scratch. + N3FIT_FIXED_CONFIG['load_thcovmat_from_file'] = True file_content.update(N3FIT_FIXED_CONFIG) return cls(file_content, *args, **kwargs) diff --git a/n3fit/src/n3fit/scripts/vp_setupfit.py b/n3fit/src/n3fit/scripts/vp_setupfit.py index 8ed0d8766a..a5ba367d10 100644 --- a/n3fit/src/n3fit/scripts/vp_setupfit.py +++ b/n3fit/src/n3fit/scripts/vp_setupfit.py @@ -25,6 +25,7 @@ # top. +import copy import hashlib import logging import pathlib @@ -54,9 +55,10 @@ 'validphys.results', 'validphys.theorycovariance.construction', 'validphys.photon.compute', + 'validphys.n3fit_data', ] -SETUPFIT_DEFAULTS = dict(use_cuts='internal') +SETUPFIT_DEFAULTS = dict(use_cuts='internal', use_t0=True) log = logging.getLogger(__name__) @@ -141,6 +143,9 @@ class SetupFitConfig(Config): @classmethod def from_yaml(cls, o, *args, **kwargs): + # Create a fresh copy of the fixed config to avoid in-place modifications + fixed_config = copy.deepcopy(SETUPFIT_FIXED_CONFIG) + try: file_content = yaml_safe.load(o) except error.YAMLError as e: @@ -156,10 +161,10 @@ def from_yaml(cls, o, *args, **kwargs): # Use faketheoryid to create the L0 data to be stored into the filter folder # (L1 data is stored if fakedata is True) if 'faketheoryid' in closuredict: - # make sure theory key exists in SETUPFIT_FIXED_CONFIG - SETUPFIT_FIXED_CONFIG.setdefault('theory', {}) + # make sure theory key exists in fixed_config + fixed_config.setdefault('theory', {}) # overwrite theoryid with the faketheoryid - SETUPFIT_FIXED_CONFIG['theory']['theoryid'] = closuredict['faketheoryid'] + fixed_config['theory']['theoryid'] = closuredict['faketheoryid'] # download theoryid since it will be used in the fit try: loader.check_theoryID(file_content['theory']['theoryid']) @@ -171,8 +176,14 @@ def from_yaml(cls, o, *args, **kwargs): filter_action = 'datacuts::theory::fitting filter' check_n3fit_action = 'datacuts::theory::fitting n3fit_checks_action' + # Add rotation action for the total covariance matrix + if file_content.get('theorycovmatconfig') is not None: + rotation_action = 'datacuts::theory::theorycovmatconfig fitting_covmat_table' + else: + rotation_action = 'datacuts::theory fitting_covmat_table' + # The settings for these actions depend on the presence of closuretest - SETUPFIT_FIXED_CONFIG['actions_'] += [check_n3fit_action, filter_action] + fixed_config['actions_'] += [check_n3fit_action, filter_action, rotation_action] # Check theory covariance matrix configuration thconfig = file_content.get('theorycovmatconfig', {}) @@ -184,7 +195,7 @@ def from_yaml(cls, o, *args, **kwargs): "`point_prescriptions: ['9 point', '3 point']`" ) if thconfig: - SETUPFIT_FIXED_CONFIG['actions_'].append( + fixed_config['actions_'].append( 'datacuts::theory::theorycovmatconfig nnfit_theory_covmat' ) @@ -214,14 +225,14 @@ def from_yaml(cls, o, *args, **kwargs): if compute_in_setupfit: log.info("Forcing photon computation with FiatLux during setupfit.") # Since the photon will be computed, check that the luxset and additional_errors exist - SETUPFIT_FIXED_CONFIG['actions_'].append('fiatlux check_luxset_exists') + fixed_config['actions_'].append('fiatlux check_luxset_exists') if fiatlux.get("additional_errors"): - SETUPFIT_FIXED_CONFIG['actions_'].append('fiatlux check_additional_errors') - SETUPFIT_FIXED_CONFIG['actions_'].append('fiatlux::theory compute_photon_to_disk') + fixed_config['actions_'].append('fiatlux check_additional_errors') + fixed_config['actions_'].append('fiatlux::theory compute_photon_to_disk') # Check positivity bound if file_content.get('positivity_bound') is not None: - SETUPFIT_FIXED_CONFIG['actions_'].append('positivity_bound check_unpolarized_bc') + fixed_config['actions_'].append('positivity_bound check_unpolarized_bc') # Check hyperscan trials_config = file_content.get('trial_specs', {}) @@ -233,7 +244,7 @@ def from_yaml(cls, o, *args, **kwargs): file_content.setdefault(k, v) # Update file content with fixed configuration - file_content.update(SETUPFIT_FIXED_CONFIG) + file_content.update(fixed_config) return cls(file_content, *args, **kwargs) diff --git a/n3fit/src/n3fit/tests/test_fit.py b/n3fit/src/n3fit/tests/test_fit.py index c7f10e3faf..088c797000 100644 --- a/n3fit/src/n3fit/tests/test_fit.py +++ b/n3fit/src/n3fit/tests/test_fit.py @@ -283,8 +283,11 @@ def test_parallel_against_sequential(tmp_path, rep_from=6, rep_to=8): "ATLAS_TTBAR_8TEV_TOT_X-SEC", "CMS_SINGLETOP_13TEV_TCHANNEL-XSEC", ] - dataset_inputs = [{"dataset": d, "frac": 0.6, "variant": "legacy"} for d in datasets] + dataset_inputs = [{"dataset": d, "variant": "legacy"} for d in datasets] n3fit_input["dataset_inputs"] = dataset_inputs + # Using diaogonal basis + n3fit_input["diagonal_basis"] = True + n3fit_input["diagonal_frac"] = 0.5 # Exit inmediately n3fit_input["parameters"]["epochs"] = 1 # Save pseudodata @@ -311,8 +314,9 @@ def test_parallel_against_sequential(tmp_path, rep_from=6, rep_to=8): for csvfile_seq in folder_seq.glob("*/*.csv"): csvfile_par = folder_par / csvfile_seq.relative_to(folder_seq) - result_seq = pd.read_csv(csvfile_seq, sep="\t", index_col=[0, 1, 2], header=0) - result_par = pd.read_csv(csvfile_par, sep="\t", index_col=[0, 1, 2], header=0) + # Diagonal basis writes single-index csv files + result_seq = pd.read_csv(csvfile_seq, sep="\t", index_col=[0], header=0) + result_par = pd.read_csv(csvfile_par, sep="\t", index_col=[0], header=0) pd.testing.assert_frame_equal(result_seq, result_par) # Check the rest of the fit, while numerical differences are expected between sequential diff --git a/validphys2/src/validphys/config.py b/validphys2/src/validphys/config.py index 910fc1fd7b..3e27fd28d8 100644 --- a/validphys2/src/validphys/config.py +++ b/validphys2/src/validphys/config.py @@ -805,16 +805,36 @@ def produce_experiment_from_input(self, experiment_input, theoryid, use_cuts, fi } @configparser.explicit_node - def produce_dataset_inputs_fitting_covmat(self, use_thcovmat_in_fitting=False): + def produce_dataset_inputs_fitting_covmat( + self, use_thcovmat_in_fitting=False, load_thcovmat_from_file=False + ): """ - Produces the correct covmat to be used in fitting_data_dict according - to some options: whether to include the theory covmat, whether to - separate the multiplcative errors and whether to compute the - experimental covmat using the t0 prescription. + Dispatcher node for the covmat used in ``fitting_data_dict``. + + Returns the experimental t0 covmat (``dataset_inputs_t0_exp_covmat``) when + no theory covmat is requested. When ``use_thcovmat_in_fitting=True``, + returns the total (experimental + theory) covmat — either rebuilt from + scratch via ``nnfit_theory_covmat`` (``load_thcovmat_from_file=False``, + the default) or with the theory part loaded from a CSV previously + written by ``vp-setupfit`` (``load_thcovmat_from_file=True``). + + The two contexts: + + * **vp-setupfit** — leaves ``load_thcovmat_from_file`` at the default. It + *writes* the theory covmat CSV, so the load variant would raise + FileNotFoundError on a file that does not yet exist. The result feeds + ``setupfit_fitting_covmat``, which serialises either the full fitting + covmat or its diagonal-basis rotation table. + * **n3fit** — sets ``load_thcovmat_from_file=True`` (see + ``n3fit_exec.py``). The result feeds ``_inv_covmat_prepared``, which + prepares the inverse for the fit. """ + from validphys import covmats if use_thcovmat_in_fitting: + if load_thcovmat_from_file: + return covmats.dataset_load_inputs_t0_total_covmat return covmats.dataset_inputs_t0_total_covmat return covmats.dataset_inputs_t0_exp_covmat @@ -830,8 +850,13 @@ def produce_dataset_inputs_sampling_covmat( """ Produces the correct MC replica method sampling covmat to be used in make_replica according to some options: whether to sample using a t0 - covariance matrix, include the theory covmat and whether to - separate the multiplcative errors. + covariance matrix, include the theory covmat and whether to separate the + multiplcative errors. + + This node is never invoked by setupfit, but is used in n3fit when + sampling the MC replicas for the fit (``make_replica``). It routes to + the load variants under ``use_thcovmat_in_sampling=True``, which load + the theory covmat from the CSV file generated by setupfit. Parameters ---------- @@ -851,9 +876,9 @@ def produce_dataset_inputs_sampling_covmat( if use_t0_sampling: if use_thcovmat_in_sampling: if sep_mult: - return covmats.dataset_inputs_t0_total_covmat_separate + return covmats.dataset_load_inputs_t0_total_covmat_separate else: - return covmats.dataset_inputs_t0_total_covmat + return covmats.dataset_load_inputs_t0_total_covmat else: if sep_mult: return covmats.dataset_inputs_t0_exp_covmat_separate @@ -863,15 +888,28 @@ def produce_dataset_inputs_sampling_covmat( else: if use_thcovmat_in_sampling: if sep_mult: - return covmats.dataset_inputs_total_covmat_separate + return covmats.dataset_load_inputs_total_covmat_separate else: - return covmats.dataset_inputs_total_covmat + return covmats.dataset_load_inputs_total_covmat else: if sep_mult: return covmats.dataset_inputs_exp_covmat_separate else: return covmats.dataset_inputs_exp_covmat + def produce_fitting_covmat_name(self, fit): + """Produce the name of the covmat to be used in fitting, + according to how it was generated by vp-setupfit. + """ + runcard = fit.as_input() + use_thcovmat = runcard.get("theorycovmatconfig", {}).get("use_thcovmat_in_fitting", False) + if use_thcovmat: + covmat_name = "datacuts_theory_theorycovmatconfig_fitting_covmat_table.csv" + else: + covmat_name = "datacuts_theory_fitting_covmat_table.csv" + path = fit.path / "tables" / covmat_name + return path + def produce_loaded_theory_covmat( self, output_path, @@ -885,6 +923,7 @@ def produce_loaded_theory_covmat( Loads the theory covmat from the correct file according to how it was generated by vp-setupfit. """ + if not use_thcovmat_in_sampling and not use_thcovmat_in_fitting: return 0.0 # Load correct file according to how the thcovmat was generated by vp-setupfit @@ -1328,6 +1367,7 @@ def produce_nnfit_theory_covmat( This function is only used in vp-setupfit to store the necessary covmats as .csv files in the tables directory. """ + if point_prescriptions is not None: if user_covmat_path is not None: # Both scalevar and user uncertainties @@ -1336,9 +1376,9 @@ def produce_nnfit_theory_covmat( f = total_theory_covmat_fitting else: # Only scalevar uncertainties - from validphys.theorycovariance.construction import theory_covmat_custom + from validphys.theorycovariance.construction import theory_covmat_custom_fitting - f = theory_covmat_custom + f = theory_covmat_custom_fitting elif user_covmat_path is not None: # Only user uncertainties from validphys.theorycovariance.construction import user_covmat_fitting diff --git a/validphys2/src/validphys/covmats.py b/validphys2/src/validphys/covmats.py index 195371ad5d..7de450dcc0 100644 --- a/validphys2/src/validphys/covmats.py +++ b/validphys2/src/validphys/covmats.py @@ -225,6 +225,7 @@ def dataset_inputs_covmat_from_systematics( covmat = (covmat / sqrt_weights).T / sqrt_weights if norm_threshold is not None: covmat = regularize_covmat(covmat, norm_threshold=norm_threshold) + return covmat @@ -405,7 +406,7 @@ def dataset_inputs_t0_covmat_from_systematics( ) -def dataset_inputs_t0_total_covmat_separate( +def dataset_load_inputs_t0_total_covmat_separate( dataset_inputs_t0_exp_covmat_separate, loaded_theory_covmat ): """ @@ -416,6 +417,17 @@ def dataset_inputs_t0_total_covmat_separate( return dataset_inputs_t0_exp_covmat_separate + loaded_theory_covmat +def dataset_inputs_t0_total_covmat_separate( + dataset_inputs_t0_exp_covmat_separate, nnfit_theory_covmat +): + """ + Function to compute the covmat to be used for the sampling by make_replica. + In this case the t0 prescription is used for the experimental covmat and the multiplicative + errors are separated. Moreover, the theory covmat is added to experimental covmat. + """ + return dataset_inputs_t0_exp_covmat_separate + nnfit_theory_covmat + + def dataset_inputs_t0_exp_covmat_separate( dataset_inputs_loaded_cd_with_cuts, *, @@ -440,7 +452,9 @@ def dataset_inputs_t0_exp_covmat_separate( return covmat -def dataset_inputs_total_covmat_separate(dataset_inputs_exp_covmat_separate, loaded_theory_covmat): +def dataset_load_inputs_total_covmat_separate( + dataset_inputs_exp_covmat_separate, loaded_theory_covmat +): """ Function to compute the covmat to be used for the sampling by make_replica. In this case the t0 prescription is not used for the experimental covmat and the multiplicative @@ -449,6 +463,15 @@ def dataset_inputs_total_covmat_separate(dataset_inputs_exp_covmat_separate, loa return dataset_inputs_exp_covmat_separate + loaded_theory_covmat +def dataset_inputs_total_covmat_separate(dataset_inputs_exp_covmat_separate, nnfit_theory_covmat): + """ + Function to compute the covmat to be used for the sampling by make_replica. + In this case the t0 prescription is not used for the experimental covmat and the multiplicative + errors are separated. Moreover, the theory covmat is added to experimental covmat. + """ + return dataset_inputs_exp_covmat_separate + nnfit_theory_covmat + + def dataset_inputs_exp_covmat_separate( dataset_inputs_loaded_cd_with_cuts, *, @@ -472,7 +495,17 @@ def dataset_inputs_exp_covmat_separate( return covmat -def dataset_inputs_t0_total_covmat(dataset_inputs_t0_exp_covmat, loaded_theory_covmat): +def dataset_inputs_t0_total_covmat(dataset_inputs_t0_exp_covmat, nnfit_theory_covmat): + """ + Function to compute the covmat to be used for the sampling by make_replica and for the chi2 + by fitting_data_dict. In this case the t0 prescription is used for the experimental covmat + and the multiplicative errors are included in it. Moreover, the theory covmat is added to experimental covmat. + """ + + return dataset_inputs_t0_exp_covmat + nnfit_theory_covmat + + +def dataset_load_inputs_t0_total_covmat(dataset_inputs_t0_exp_covmat, loaded_theory_covmat): """ Function to compute the covmat to be used for the sampling by make_replica and for the chi2 by fitting_data_dict. In this case the t0 prescription is used for the experimental covmat @@ -502,10 +535,11 @@ def dataset_inputs_t0_exp_covmat( dataset_inputs_t0_predictions, False, ) + return covmat -def dataset_inputs_total_covmat(dataset_inputs_exp_covmat, loaded_theory_covmat): +def dataset_load_inputs_total_covmat(dataset_inputs_exp_covmat, loaded_theory_covmat): """ Function to compute the covmat to be used for the sampling by make_replica and for the chi2 by fitting_data_dict. In this case the t0 prescription is not used for the experimental covmat @@ -514,6 +548,15 @@ def dataset_inputs_total_covmat(dataset_inputs_exp_covmat, loaded_theory_covmat) return dataset_inputs_exp_covmat + loaded_theory_covmat +def dataset_inputs_total_covmat(dataset_inputs_exp_covmat, nnfit_theory_covmat): + """ + Function to compute the covmat to be used for the sampling by make_replica and for the chi2 + by fitting_data_dict. In this case the t0 prescription is not used for the experimental covmat + and the multiplicative errors are included in it. Moreover, the theory covmat is added to experimental covmat. + """ + return dataset_inputs_exp_covmat + nnfit_theory_covmat + + def dataset_inputs_exp_covmat( dataset_inputs_loaded_cd_with_cuts, *, diff --git a/validphys2/src/validphys/n3fit_data.py b/validphys2/src/validphys/n3fit_data.py index b5c5e6e239..8d94d4f80f 100644 --- a/validphys2/src/validphys/n3fit_data.py +++ b/validphys2/src/validphys/n3fit_data.py @@ -171,12 +171,12 @@ def __init__(self, group_name, seed, tr_masks, vl_masks): super().__init__(group_name, seed) -def diagonal_masks(data, replica_trvlseed, dataset_inputs_fitting_covmat, diagonal_frac=0.75): +def diagonal_masks(data, replica_trvlseed, dataset_inputs_covmat_t0_considered, diagonal_frac=0.75): nameseed = int(hashlib.sha256(str(data).encode()).hexdigest(), 16) % 10**8 nameseed += replica_trvlseed rng = np.random.Generator(np.random.PCG64(nameseed)) - ndata = len(dataset_inputs_fitting_covmat) + ndata = len(dataset_inputs_covmat_t0_considered) # construct training mask by selecting a fraction of the eigenvalues trmax = int(ndata * diagonal_frac) @@ -319,39 +319,60 @@ def _hashed_dataset_inputs_fitting_covmat(dataset_inputs_fitting_covmat) -> Hash @functools.lru_cache -def _inv_covmat_prepared(_hashed_dataset_inputs_fitting_covmat, diagonal_basis=True): - """Returns the inverse covmats for training, validation and total - attending to the right masks and whether it is diagonal or not. +def _inv_covmat_prepared( + _hashed_dataset_inputs_fitting_covmat, + output_path, + use_thcovmat_in_fitting=False, + diagonal_basis=True, +): + """Prepare the covariance matrix and its inverse, optionally transforming to diagonal basis. - Since the masks and number of datapoints need to be treated for 1-point datasets - it also returns the right ndata and masks for training and validation: + Parameters + ---------- + _hashed_dataset_inputs_fitting_covmat : Hashrray + Wrapped covariance matrix for caching purposes. + output_path : Path + Path to output directory containing diagonal basis data if needed. + diagonal_basis : bool, optional + If True, load pre-computed diagonal basis from tables. If False, use the + covariance matrix directly and compute the inverse via correlation matrix + inversion. Default is True. + Returns + ------- + covmat : np.ndarray + The covariance matrix (diagonal if diagonal_basis=True, full otherwise). + inv_total : np.ndarray + The inverse of the covariance matrix. + diag_rot : np.ndarray or None + The rotation matrix (eigenvectors transposed) for diagonal basis transformation. + Only returned when diagonal_basis=True, otherwise None. + eig_vals : np.ndarray or None + The eigenvalues of the correlation matrix. Only returned when + diagonal_basis=True, otherwise None. """ log.info( - f"_inv_covmat_prepared called with covmat hash={hash(_hashed_dataset_inputs_fitting_covmat)}, diagonal_basis={diagonal_basis}" + f"_inv_covmat_prepared called with covmat hash={hash(_hashed_dataset_inputs_fitting_covmat)}" ) - covmat = _hashed_dataset_inputs_fitting_covmat.array - diagonal_rotation = None + + diag_rot = None eig_vals = None if diagonal_basis: - log.info("working in diagonal basis.") - - # convert covmat to correlation - diag_inv_sqrt = 1 / np.sqrt(np.diag(covmat)) - cormat = np.einsum("i, ij, j -> ij", diag_inv_sqrt, covmat, diag_inv_sqrt) - - # diagonalise the correlation matrix - eig_vals, uT = np.linalg.eigh(cormat) - uT = np.einsum("i, ik -> ik", diag_inv_sqrt, uT) - diagonal_rotation = uT.T - - ndata = len(eig_vals) - + if use_thcovmat_in_fitting: + diagonal_basis_saved = "datacuts_theory_theorycovmatconfig_fitting_covmat_table.csv" + else: + diagonal_basis_saved = "datacuts_theory_fitting_covmat_table.csv" + path_diagonal_basis = output_path / "tables" / diagonal_basis_saved + eigensystem = pd.read_csv( + path_diagonal_basis, index_col=[0], header=[0], sep="\t|,", engine="python" + ) + diag_rot = eigensystem.iloc[:, 1:].values + eig_vals = eigensystem["eig_val"].values + covmat = np.diag(eig_vals) inv_total = np.diag(1 / eig_vals) - else: - + covmat = _hashed_dataset_inputs_fitting_covmat.array diag_inv_sqrt_total = 1 / np.sqrt(np.diag(covmat)) cormat_total = np.einsum("i, ij, j -> ij", diag_inv_sqrt_total, covmat, diag_inv_sqrt_total) inv_total = ( @@ -360,7 +381,60 @@ def _inv_covmat_prepared(_hashed_dataset_inputs_fitting_covmat, diagonal_basis=T @ np.diag(diag_inv_sqrt_total) ) - return (covmat, inv_total, diagonal_rotation, eig_vals) + return covmat, inv_total, diag_rot, eig_vals + + +def setupfit_fitting_covmat(dataset_inputs_fitting_covmat, diagonal_basis=True): + """Prepare the fitting covariance matrix in setupfit and store it for later use + in the fit. Theory contributions and diagonal basis transformation are specified + in the input configuration. + + Parameters + ---------- + dataset_inputs_fitting_covmat : np.ndarray + The experimental covariance matrix from the datasets. + diagonal_basis : bool, optional + If True, transform the covariance matrix to diagonal basis by extracting + eigenvalues and eigenvectors of the correlation matrix. Default is True. + + Returns + ------- + covmat : np.ndarray + The prepared covariance matrix (sum of experimental and theory covmats). + diagonal_rotation : np.ndarray or None + The rotation matrix (transposed eigenvectors) to transform data to diagonal basis. + Only returned if diagonal_basis=True, otherwise None. + eig_vals : np.ndarray or None + The eigenvalues of the correlation matrix in diagonal basis. + Only returned if diagonal_basis=True, otherwise None. + """ + covmat = dataset_inputs_fitting_covmat + diagonal_rotation = None + eig_vals = None + + if diagonal_basis: + log.info("working in diagonal basis.") + + # convert covmat to correlation + sigma = np.sqrt(np.diag(covmat)) + sigma_inv = 1 / sigma + cormat = np.einsum("i, ij, j -> ij", sigma_inv, covmat, sigma_inv) + + # diagonalise the correlation matrix + eig_vals, v = np.linalg.eigh(cormat) # cormat = V @ diag(eig_vals) @ V.T + u = np.einsum("i, ij -> ij", sigma, v) + rec_covmat = u @ np.diag(eig_vals) @ u.T + np.testing.assert_allclose( + covmat, + rec_covmat, + rtol=1e-4, + atol=1e-5, + err_msg="Diagonalisation failed to reproduce original covmat", + ) + + diagonal_rotation = np.einsum("ij, j -> ij", v.T, sigma_inv) + + return covmat, diagonal_rotation, eig_vals def fitting_data_dict( @@ -420,7 +494,7 @@ def fitting_data_dict( expdata = make_replica fittable_datasets = fittable_datasets_masked - # all covmat manipulation is shared across the replicas for memory purposes + # load covmat stored at the time of vp-setupfit covmat, inv_true, diag_rot, eig_vals = _inv_covmat_prepared # get the masks - different for each replica so fine to call here @@ -548,6 +622,27 @@ def fitting_data_dict( exps_fitting_data_dict = collect("fitting_data_dict", ("group_dataset_inputs_by_metadata",)) +@table +def fitting_covmat_table(output_path, setupfit_fitting_covmat, data_index, diagonal_basis=True): + """ + Stores the fitting covariance matrix if diagonal_basis is False, else store the rotation matrix and eigenvalues + """ + covmat, diagonal_rotation, eig_vals = setupfit_fitting_covmat + + if not diagonal_basis: + log.info("Saving fitting covmat") + if not hasattr(covmat, "index"): + return pd.DataFrame(covmat, index=data_index, columns=data_index) + else: + return covmat + + df_rotation = pd.DataFrame(diagonal_rotation) + df_rotation.insert(0, "eig_val", eig_vals) + df_rotation.index = pd.Index([f"eigenmode {i}" for i in range(len(eig_vals))]) + log.info("Saving diagonal-basis rotation table") + return df_rotation + + def replica_nnseed_fitting_data_dict(replica, exps_fitting_data_dict, replica_nnseed): """For a single replica return a tuple of the inputs to this function. Used with `collect` over replicas to avoid having to perform multiple @@ -575,7 +670,17 @@ def replica_nnseed_fitting_data_dict(replica, exps_fitting_data_dict, replica_nn ) -def replica_pseudodata(experiment_indexed_make_replica, replica): +def diagonal_indexed_make_replica(indexed_make_replica, fitting_data_dict): + """Rotate one group pseudodata block to the diagonal basis.""" + values = indexed_make_replica.iloc[:, 0].to_numpy() + diag_rot = fitting_data_dict.get("data_transformation") + rotated_values = values if diag_rot is None else diag_rot @ values + return pd.DataFrame(rotated_values, columns=["data"]) + + +def replica_pseudodata( + experiment_indexed_make_replica, diagonal_indexed_make_replica, replica, diagonal_basis=True +): """Creates a pandas DataFrame containing the generated pseudodata. The index is :py:func:`validphys.results.experiments_index` and the columns is the replica numbers. @@ -586,8 +691,16 @@ def replica_pseudodata(experiment_indexed_make_replica, replica): `fitting::savepseudodata` is `true` (as per the default setting) The table can be found in the replica folder i.e. /nnfit/replica_*/ """ - df = pd.concat(experiment_indexed_make_replica) - df.columns = [f"replica {replica}"] + replica_column = f"replica {replica}" + + if not diagonal_basis: + df = pd.concat(experiment_indexed_make_replica) + df.columns = [replica_column] + return df + + df = diagonal_indexed_make_replica.copy() + df.columns = [replica_column] + df.index = pd.Index([f"eigenmode {i}" for i in range(len(df))]) return df diff --git a/validphys2/src/validphys/pseudodata.py b/validphys2/src/validphys/pseudodata.py index de7dd36c69..9d811cfbd9 100644 --- a/validphys2/src/validphys/pseudodata.py +++ b/validphys2/src/validphys/pseudodata.py @@ -31,6 +31,46 @@ class ReplicaGenerationError(Exception): pass +def fit_diagonal_basis_rotation(fitting_covmat_name, fit): + """Rotation matrix taking pseudodata from the original to the diagonal basis, + or ``None`` if ``fit`` was not run in diagonal basis. + + Sources the matrix from the same eigensystem table that + ``_inv_covmat_prepared`` loads when running the fit, so the rotation + applied here is bit-identical to the one used at generation time. + """ + runcard = fit.as_input() + + if not runcard.get("diagonal_basis", True): + return None + + eigensystem = pd.read_csv( + fitting_covmat_name, index_col=[0], header=[0], sep="\t|,", engine="python" + ) + + return eigensystem.iloc[:, 1:].values + + +def diagonal_indexed_recreate_pseudodata(indexed_make_replica, fit_diagonal_basis_rotation): + """Recreation-time analogue of + :py:func:`validphys.n3fit_data.diagonal_indexed_make_replica`, but doesn't + need :py:func:`_inv_covmat_prepared` (through :py:func:`fitting_data_dict`) + which requires `output_path` to be available. + + Returns the pseudodata in the diagonal basis (eigenmode-indexed) when the + fit was run in diagonal basis, otherwise returns ``indexed_make_replica`` + untouched (preserving the original ``(group, dataset, id)`` MultiIndex). + """ + diag_rot = fit_diagonal_basis_rotation + if diag_rot is None: + return indexed_make_replica + values = indexed_make_replica.iloc[:, 0].to_numpy() + rotated = diag_rot @ values + return pd.DataFrame( + rotated, index=pd.Index([f"eigenmode {i}" for i in range(len(rotated))]), columns=["data"] + ) + + def read_replica_pseudodata(fit, context_index, replica): """Function to handle the reading of training and validation splits for a fit that has been produced with the ``savepseudodata`` flag set to ``True``. @@ -70,10 +110,9 @@ def read_replica_pseudodata(fit, context_index, replica): 5 3.117819 6 0.771079 """ - # List of length 1 due to the collect - context_index = context_index[0] - # The [0] is because of how pandas handles sorting a MultiIndex - sorted_index = context_index.sortlevel(level=range(1, 3))[0] + # Detect whether fit performed in diagonal basis + # TODO: change the fit object to return diagonal basis True or False depening on the NNPDF version + diagonal_basis = fit.as_input().get("diagonal_basis", True) log.debug(f"Reading pseudodata & training/validation splits from {fit.name}.") replica_path = fit.path / "nnfit" / f"replica_{replica}" @@ -87,9 +126,11 @@ def read_replica_pseudodata(fit, context_index, replica): tr_pseudodatafile = "datacuts_theory_fitting_training_pseudodata.csv" vl_pseudodatafile = "datacuts_theory_fitting_validation_pseudodata.csv" + index_col = [0] if diagonal_basis else [0, 1, 2] + try: - tr = pd.read_csv(replica_path / tr_pseudodatafile, index_col=[0, 1, 2], sep="\t", header=0) - val = pd.read_csv(replica_path / vl_pseudodatafile, index_col=[0, 1, 2], sep="\t", header=0) + tr = pd.read_csv(replica_path / tr_pseudodatafile, index_col=index_col, sep="\t", header=0) + val = pd.read_csv(replica_path / vl_pseudodatafile, index_col=index_col, sep="\t", header=0) except FileNotFoundError as e: raise FileNotFoundError( "Could not find saved training and validation data files. " @@ -97,22 +138,25 @@ def read_replica_pseudodata(fit, context_index, replica): ) from e tr["type"], val["type"] = "training", "validation" - pseudodata = pd.concat((tr, val)) - # In order for this function to work also with old fit, it is necessary to remap the names - # being read (since the names in the context have already been remapped) - # The following checks whether a given name is in both the context and the fit, and if not - # tries to get it from the old_to_new mapping. - mapping = {} - context_datasets = context_index.get_level_values("dataset").unique() - for dsname in pseudodata.index.get_level_values("dataset").unique(): - if dsname not in context_datasets: - new_name, _ = legacy_to_new_map(dsname) - mapping[dsname] = new_name + if not diagonal_basis: + # In order for this function to work also with old fit, it is necessary to remap the names + # being read (since the names in the context have already been remapped) + # The following checks whether a given name is in both the context and the fit, and if not + # tries to get it from the old_to_new mapping. + ctx = context_index[0] + sorted_index = ctx.sortlevel(level=range(1, 3))[0] - pseudodata = pseudodata.rename(mapping, level=1).sort_index(level=range(1, 3)) - pseudodata.index = sorted_index + mapping = {} + context_datasets = ctx.get_level_values("dataset").unique() + for dsname in pseudodata.index.get_level_values("dataset").unique(): + if dsname not in context_datasets: + new_name, _ = legacy_to_new_map(dsname) + mapping[dsname] = new_name + + pseudodata = pseudodata.rename(mapping, level=1).sort_index(level=range(1, 3)) + pseudodata.index = sorted_index tr = pseudodata[pseudodata["type"] == "training"] val = pseudodata[pseudodata["type"] == "validation"] @@ -195,8 +239,7 @@ def make_replica( # Set random seed rng = np.random.default_rng(seed=group_replica_mcseed) # construct covmat - covmat = dataset_inputs_sampling_covmat - covmat_sqrt = sqrt_covmat(covmat) + covmat_sqrt = sqrt_covmat(dataset_inputs_sampling_covmat) full_mask = ( group_positivity_mask @@ -224,7 +267,7 @@ def make_replica( mult_shifts.append(mult_shift) # If sep_mult is true then the multiplicative shifts were not included in the covmat - shifts = covmat_sqrt @ rng.normal(size=covmat.shape[1]) + shifts = covmat_sqrt @ rng.normal(size=dataset_inputs_sampling_covmat.shape[1]) mult_part = 1.0 if sep_mult: special_mult_errors = group_multiplicative_errors["special_mult"] @@ -432,11 +475,17 @@ def make_level1_data(data, level0_commondata_wc, filterseed, data_index, sep_mul ) # ================== generation of Level1 data ======================# - central_vals= central_values_array(level0_commondata_wc) + central_vals = central_values_array(level0_commondata_wc) group_mult_errs = group_multiplicative_errors(level0_commondata_wc, sep_mult=sep_mult) group_pos_mask = group_positivity_mask(level0_commondata_wc) level1_data = make_replica( - central_vals, filterseed, covmat, group_multiplicative_errors=group_mult_errs, group_positivity_mask=group_pos_mask, sep_mult=sep_mult, genrep=True, + central_vals, + filterseed, + covmat, + group_multiplicative_errors=group_mult_errs, + group_positivity_mask=group_pos_mask, + sep_mult=sep_mult, + genrep=True, ) indexed_level1_data = indexed_make_replica(data_index, level1_data) @@ -457,7 +506,9 @@ def make_level1_data(data, level0_commondata_wc, filterseed, data_index, sep_mul return level1_commondata_instances_wc -_group_recreate_pseudodata = collect('indexed_make_replica', ('group_dataset_inputs_by_metadata',)) +_group_recreate_pseudodata = collect( + 'diagonal_indexed_recreate_pseudodata', ('group_dataset_inputs_by_metadata',) +) _recreate_fit_pseudodata = collect('_group_recreate_pseudodata', ('fitreplicas', 'fitenvironment')) _recreate_pdf_pseudodata = collect('_group_recreate_pseudodata', ('pdfreplicas', 'fitenvironment')) diff --git a/validphys2/src/validphys/results.py b/validphys2/src/validphys/results.py index f760d35544..346cbc035c 100644 --- a/validphys2/src/validphys/results.py +++ b/validphys2/src/validphys/results.py @@ -242,13 +242,11 @@ def data_index(data): procs_data = collect("data", ("group_dataset_inputs_by_process",)) -def groups_index(groups_data, diagonal_basis=True): +def groups_index(groups_data): """Return a pandas.MultiIndex with levels for group, dataset and point respectively, the group is determined by a key in the dataset metadata, and controlled by `metadata_group` key in the runcard. - In case diagonal_basis is True, the dataset name is replaced by the eigenmode, because - individual datasets appear mixed in the diagonal basis Example ------- @@ -275,8 +273,8 @@ def groups_index(groups_data, diagonal_basis=True): return df.index -def experiments_index(experiments_data, diagonal_basis=True): - return groups_index(experiments_data, diagonal_basis) +def experiments_index(experiments_data): + return groups_index(experiments_data) def procs_index(procs_data): diff --git a/validphys2/src/validphys/tests/conftest.py b/validphys2/src/validphys/tests/conftest.py index 528786722b..ac5c3002c0 100644 --- a/validphys2/src/validphys/tests/conftest.py +++ b/validphys2/src/validphys/tests/conftest.py @@ -69,13 +69,13 @@ def tmp(tmpdir): HESSIAN_PDF = "NNPDF40_nnlo_as_01180_hessian" THEORYID = 40_000_000 THEORY_QED = 40_000_100 -FIT_3REPLICAS = "FIT_3REPLICAS_250616" -FIT_3REPLICAS_DIAG = "FIT_3REPLICAS_251114" -FIT_3REPLICAS_DCUTS = "FIT_3REPLICAS_250616_diffcuts" +FIT_3REPLICAS = "FIT_3REPLICAS_260519" +FIT_3REPLICAS_DIAG = "FIT_3REPLICAS_DIAG_260519" +FIT_3REPLICAS_DCUTS = "FIT_3REPLICAS_260519_diffcuts" FIT = "NNPDF40_nnlo_like_CI_testing_250616" FIT_ITERATED = "NNPDF40_nnlo_like_CI_testing_250616_iterated" -PSEUDODATA_FIT = "pseudodata_test_fit_n3fit_250616" -PSEUDODATA_FIT_DIAG = "pseudodata_test_fit_n3fit_251104" +PSEUDODATA_FIT = "pseudodata_test_fit_n3fit_260518" +PSEUDODATA_FIT_DIAG = "pseudodata_test_fit_diag_n3fit_260518" # These fits contain _only_ data MULTICLOSURE_FITS = ["250618-test-multiclosure-001", "250618-test-multiclosure-002"] diff --git a/validphys2/src/validphys/tests/test_fitdata.py b/validphys2/src/validphys/tests/test_fitdata.py index 552880175f..f1cf265fa1 100644 --- a/validphys2/src/validphys/tests/test_fitdata.py +++ b/validphys2/src/validphys/tests/test_fitdata.py @@ -18,10 +18,10 @@ def test_print_different_cuts(): res = print_different_cuts(fits, testi) # NMC assert "121 out of 260" in res - assert "59 out of 260" in res + assert "73 out of 260" in res # SLAC assert "34 out of 211" in res - assert "0 out of 211" in res + assert "5 out of 211" in res def test_print_systype_overlap(): diff --git a/validphys2/src/validphys/tests/test_pseudodata.py b/validphys2/src/validphys/tests/test_pseudodata.py index cbb1cc8b8d..e88e4c88df 100644 --- a/validphys2/src/validphys/tests/test_pseudodata.py +++ b/validphys2/src/validphys/tests/test_pseudodata.py @@ -81,18 +81,20 @@ def test_no_savepseudodata(): func(fit=FIT) -def test_read_matches_recreate(): - - for fit in [PSEUDODATA_FIT, PSEUDODATA_FIT_DIAG]: - diagonal_basis = True if fit == PSEUDODATA_FIT_DIAG else False - reads = API.read_fit_pseudodata(fit=fit, diagonal_basis=diagonal_basis) - recreates = API.recreate_fit_pseudodata(fit=fit, diagonal_basis=diagonal_basis) - for read, recreate in zip(reads, recreates): - # We ignore the absolute ordering of the dataframes and just check - # that they contain identical elements. - pd.testing.assert_frame_equal(read.pseudodata, recreate.pseudodata, check_like=True) - pd.testing.assert_index_equal(read.tr_idx, recreate.tr_idx, check_order=False) - pd.testing.assert_index_equal(read.val_idx, recreate.val_idx, check_order=False) +@pytest.mark.parametrize( + "fit, diagonal_basis", + [(PSEUDODATA_FIT, False), (PSEUDODATA_FIT_DIAG, True)], + ids=["standard", "diagonal"], +) +def test_read_matches_recreate(fit, diagonal_basis): + reads = API.read_fit_pseudodata(fit=fit, diagonal_basis=diagonal_basis) + recreates = API.recreate_fit_pseudodata(fit=fit, diagonal_basis=diagonal_basis) + for read, recreate in zip(reads, recreates): + # We ignore the absolute ordering of the dataframes and just check + # that they contain identical elements. + pd.testing.assert_frame_equal(read.pseudodata, recreate.pseudodata, check_like=True) + pd.testing.assert_index_equal(read.tr_idx, recreate.tr_idx, check_order=False) + pd.testing.assert_index_equal(read.val_idx, recreate.val_idx, check_order=False) def test_level0_commondata_wc(): diff --git a/validphys2/src/validphys/theorycovariance/construction.py b/validphys2/src/validphys/theorycovariance/construction.py index 8c916f89d8..fec49566b9 100644 --- a/validphys2/src/validphys/theorycovariance/construction.py +++ b/validphys2/src/validphys/theorycovariance/construction.py @@ -456,13 +456,10 @@ def total_theory_covmat(theory_covmat_custom, user_covmat): return theory_covmat_custom + user_covmat -def theory_covmat_custom_fitting(theory_covmat_custom_per_prescription, procs_index_matched): - """theory_covmat_custom_per_prescription but reindexed so the order of the datasets matches - those in the experiment covmat so they are aligned when fitting.""" - df = theory_covmat_custom_per_prescription.reindex(procs_index_matched).T.reindex( - procs_index_matched - ) - return df +def _reindex_covmat_to_fitting_order(covmat, index): + """Reindex a covmat DataFrame so the dataset ordering matches the experiment + covmat (the grouped/runcard order) used when fitting.""" + return covmat.reindex(index).T.reindex(index) theory_covmats_fitting = collect(theory_covmat_custom_per_prescription, ("point_prescriptions",)) @@ -474,16 +471,33 @@ def theory_covmat_custom(theory_covmats_fitting): return sum(theory_covmats_fitting) -def total_theory_covmat_fitting(total_theory_covmat, procs_index_matched): - """total_theory_covmat but reindexed so the order of the datasets matches - those in the experiment covmat so they are aligned when fitting.""" - return theory_covmat_custom_fitting(total_theory_covmat, procs_index_matched) +def data_input_matched_procs_index(data_input, procs_index): + """procs_index ordered according to data_input (runcard order), so that + nnfit_theory_covmat is stored in the same order as loaded_theory_covmat.""" + tups = [] + for ds in data_input: + for orig in procs_index: + if orig[1] == str(ds): + tups.append(orig) + return pd.MultiIndex.from_tuples(tups, names=("process", "dataset", "id")) + + +def theory_covmat_custom_fitting(theory_covmat_custom, data_input_matched_procs_index): + """theory_covmat_custom (summed over all point prescriptions) reindexed to + data_input (runcard) order so it aligns with loaded_theory_covmat in n3fit.""" + return _reindex_covmat_to_fitting_order(theory_covmat_custom, data_input_matched_procs_index) + + +def total_theory_covmat_fitting(total_theory_covmat, data_input_matched_procs_index): + """total_theory_covmat reindexed to data_input (runcard) order so it aligns + with loaded_theory_covmat in n3fit.""" + return _reindex_covmat_to_fitting_order(total_theory_covmat, data_input_matched_procs_index) -def user_covmat_fitting(user_covmat, procs_index_matched): - """user_covmat but reindexed so the order of the datasets matches - those in the experiment covmat so they are aligned when fitting.""" - return theory_covmat_custom_fitting(user_covmat, procs_index_matched) +def user_covmat_fitting(user_covmat, data_input_matched_procs_index): + """user_covmat reindexed to data_input (runcard) order so it aligns with + loaded_theory_covmat in n3fit.""" + return _reindex_covmat_to_fitting_order(user_covmat, data_input_matched_procs_index) def procs_index_matched(groups_index, procs_index):