Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cms-2016-simulated-datasets: updates done #207

Merged
merged 13 commits into from
Sep 9, 2024
Merged

Conversation

katilp
Copy link
Member

@katilp katilp commented Oct 20, 2023

Addresses #182

Adds code for all steps.
The logic has been changed to find the provenance through the production chain.

Input files are for testing only.

Tested on 3 datasets only, for them, it works fine: gives the full provenance, LHE included.
Ready for the final updates in #182 (comment)

cms-2016-simulated-datasets/code/conffiles_records.py Outdated Show resolved Hide resolved
cms-2016-simulated-datasets/inputs/recid_info.py Outdated Show resolved Hide resolved
cms-2016-simulated-datasets/code/mcm_store.py Show resolved Hide resolved
cms-2016-simulated-datasets/code/mcm_store.py Outdated Show resolved Hide resolved
cms-2016-simulated-datasets/code/mcm_store.py Outdated Show resolved Hide resolved
cmd = 'dasgoclient -query "'
if query != "dataset":
cmd += query + ' '
cmd += 'dataset=' + dataset + '" -json'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW since we are using Python 3, expressions like this can be made more readable by using f-strings:

old:

    cmd += 'dataset=' + dataset + '" -json'

new:

    cmd += f'dataset={dataset}" -json'

The same technique to format string with variable replacements could be used elsewhere in the code in order to simplify string concatenation etc.

cms-2016-simulated-datasets/code/mcm_store.py Outdated Show resolved Hide resolved
cms-2016-simulated-datasets/code/mcm_store.py Outdated Show resolved Hide resolved
cms-2016-simulated-datasets/code/mcm_store.py Outdated Show resolved Hide resolved
cms-2016-simulated-datasets/code/mcm_store.py Outdated Show resolved Hide resolved
@katilp katilp changed the title cms-2016-simulated-datasets: work in progress cms-2016-simulated-datasets: updates done Dec 14, 2023
@katilp
Copy link
Member Author

katilp commented Dec 14, 2023

@tiborsimko Missing file added and tested. Updates are done (apart from the print format)

Resulting record JSON of the six test datasets :
cms-simulated-datasets-2016.json

@tiborsimko tiborsimko force-pushed the cms-2016-sim-test branch 4 times, most recently from 1fb5083 to a36eec2 Compare January 16, 2024 13:26
@tiborsimko tiborsimko force-pushed the cms-2016-sim-test branch 7 times, most recently from 5ffc345 to 1496945 Compare February 5, 2024 10:41
@katilp
Copy link
Member Author

katilp commented Feb 5, 2024

For the pileup (see cernopendata/opendata.cern.ch#3569)

  • check a free RECID

  • in code/dataset_records.py: in pileup_dataset_recif =

    'Neutrino_E-10_gun/RunIISummer20ULPrePremix-UL16_106X_mcRun2_asymptotic_v13-v1/PREMIX' : <RECID>

@katilp
Copy link
Member Author

katilp commented Apr 5, 2024

Take care of the cases where DAS finds two parent datasets. This makes the script fail when AODSIM is taken as parent (instead of MINIAODSIM) in https://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/code/dataset_records.py#L456

Make sure that MINIAODSIM is picked.

There were 56 of such AODSIM error messages

@katilp
Copy link
Member Author

katilp commented Apr 11, 2024

Many datasets (also those with the gridpack available) miss the LHE information (or only have the production script displayed)
A madgraph example has only the script (but it is 404): https://opendata.cern.ch/record/33703
A powheg example displays the correct link but it shows 404: https://opendata.cern.ch/record/35757

There's ìndeed no 2016-sim directory under lhe_generators:

$ eos ls /eos/opendata/cms/lhe_generators/
2015-sim

@katilp
Copy link
Member Author

katilp commented May 20, 2024

For nano variable display, follow up from cernopendata/opendata.cern.ch#3607 and make the corresponding changes in https://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/external-scripts/inspectNanoFile.py#L316

The css file is in /eos/opendata/cms/upload/kati/patsize.css

This file is now https://opendata.cern.ch/eos/opendata/cms/dataset-semantics/patsize.css so inspectNanoFile should be updated to use it

(I'm updating the existing doc html files under /eos/opendata/cms/dataset-semantics/NanoAOD and /eos/opendata/cms/dataset-semantics/NanoAODSIM so no need to rerun them)

@katilp
Copy link
Member Author

katilp commented May 20, 2024

For the record, the list of 2016 MC datasets that do not have CODP records yet are in /eos/user/c/cmsdpoa/data-curation/cms-2016-simulated-datasets/missing-2024-03.txt (with / -> @ in the listing)

@katilp
Copy link
Member Author

katilp commented May 21, 2024

Add the mcdb handling to the lhe_generators.py as in https://github.com/cernopendata/data-curation/blob/master/cms-2015-simulated-datasets/code/lhe_generators.py#L51-L103

Also change the output directory name to lhe_generators/2016-sim/ and update the corresponding path in dataset_records.pyhttps://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/code/dataset_records.py#L605

An example record that should have mcdb info https://opendata.cern.ch/record/72661

@katilp
Copy link
Member Author

katilp commented May 30, 2024

Many datasets (also those with the gridpack available) miss the LHE information (or only have the production script displayed) A madgraph example has only the script (but it is 404): https://opendata.cern.ch/record/33703 A powheg example displays the correct link but it shows 404: https://opendata.cern.ch/record/35757

There's ìndeed no 2016-sim directory under lhe_generators:

$ eos ls /eos/opendata/cms/lhe_generators/
2015-sim

The directory has now been copied to /eos/opendata/cms/lhe_generators/2016-sim/

Notes:

  • lhe_generators/2016-sim/gridpacks contains also the 1705 datasets without LHE step (in that case There is no LHE directory in LOG.txt)
  • it now contains also the 132 datasets that will have <recid>_lhe_header.txt in lhe_generators/2016-sim/mcdb (in that case Skipping because of mcdb_id value in LOG.txt
  • in total, it contains 17536 <recid>subdirectories whereas input NANO datasets are 21707
  • most (but not all) datasets have the generator parameters in the <recid>/InputCards subdirectory whereas the code checks only flat under the <recid>subdirectory
    • the 1316 datasets with powheg.input show the generator parameters properly e.g. 35759
    • the 173 dataset with readInput.DAT show the generator parameters properly e.g. 40044
    • the 14019 datasets with InputCards do not show the generator parameters as the code only looks for files directly under the <recid>subdirectory
    • 3 datasets have only process directory 67225, 67227, 67229 with
      $ ls /eos/opendata/cms/lhe_generators/2016-sim/gridpacks/67225/process/madevent/Cards/
      param_card.dat  proc_card_mg5.dat  run_card.dat   
      
    • there are 182 datasets with runcmsgrid.sh with no generator parameter files (the one I checked was JHUGen), check those!

If we keep the directory structure as it is now, the code should

  • take the files from the InputCards subdir, if it exists
  • take the files from /process/madevent/Cards/ if InputCards does not exist

@katilp
Copy link
Member Author

katilp commented Jun 26, 2024

Observed still missing things in the provenance, unfortunately also among the ZZZ samples that get displayed first e.g. https://opendata.cern.ch/record/75597

That's due to lacking information in the lhe_generator/2016-sim/gridpacks
(also in my local cache)
I know why it happened and it is fixed in the code ( No 'cms.vstring(/cvmfs' found in fragment; skipping because args=cms.vstring(["/cvmfs/cms.cern.ch/phys_generator/gridpacks/slc7_amd64_gcc700/madgraph/V5_2.6....5/VVV/ZZZ_Dim6_cW_cHd_cHWB_cHW_4F_slc7_amd64_gcc700_CMSSW_10_6_19_tarball.tar.xz"]))

The fixed code would find these. But the lhe_generator directory for those datasets got generated before the fix.

I open a separate issue to fix the records that need to be completed because I do not want to rerun the full record generation.
Most likely not that many but some detective work needed to figure out which one.

As the code update is already done, it probably does not require changes in this PR.

And they are only 15, see cernopendata/opendata.cern.ch#3652

Copy link
Member

@tiborsimko tiborsimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebasing and merging, the set of scripts used to publish CMS 2016 SIM records, with several post-release updates.

@tiborsimko tiborsimko merged commit 0878405 into master Sep 9, 2024
6 checks passed
@tiborsimko tiborsimko deleted the cms-2016-sim-test branch September 9, 2024 16:39
This was referenced Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Build records
Development

Successfully merging this pull request may close these issues.

3 participants