Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify flsimulate metadata handling #188

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

drbenmorgan
Copy link
Member

To work towards the requests in #176, #178 this PR addresses some basic simplification of flsimulate's metadata organisation and output, of which random number seeds/state are a part.

The core change being implemented is to ensure that all metadata relating to an flsimulate run can only be stored in the output .brio file. In particular, the provision to output metadata to a separate file by flsimulate is removed, with a separate brio_file_dumper (name to be argued over!) program supplied that takes a .brio file as input and can spit out as much/little info about the file contents (including metadata as needed). At present it simply dumps the metadata to stdout in raw datatools::multi_properties, so a production job can obtain the metadata via:

$ flsimulate <args> -o output.brio
...
$ brio_file_dumper -f output.brio > output.meta.conf

This ensures that:

  1. We never have the chance to generate a file without metadata stored
  2. The handling/treatment of the metadata in MC production or similar is dealt with outside of flsimulate (which is purely responsible for generating that data)
  3. There's an easy interface for examining any raw .brio file, or extracting info in any form suitable for database/cataloguing. This can easily be extended to raw/reconstructed data as well.

This is still a "Draft" PR because not everything is implemented yet, and wanted early eyes/feedback from all of you (as it impacts production/database/reco/analysis). The main thing to work out is what your different areas need from the brio_file_dumper program, i.e. what interface/printed info is most useful to you?

Digitization will be handled in a separate module, so remove direct
(and not working!) digitzation references.
Remove flsimulates's ability to write metadata to anything other then
the output data file. This avoids ambiguity and complexity in storing
and understanding this data later on.

Replace separate file output with new command line program
`brio_file_dumper`. This dumps the metadata in an input brio file
to stdout in multiproperties format, allowing redirection to file if
required.
@lemiere
Copy link
Member

lemiere commented Apr 23, 2020

I need time to review that PR as I'm comfortable with putting meta in the data file.... Will be back with that soon

Copy link

@cherylepatrick cherylepatrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very much in favour of having metadata in the brio file, as it means there's no chance of mixing things up and working with a file that's been generated one configuration when you think it's another (and we have seen that happen).

We should think about ensuring that the dumped metadata is in a format that's conducive to storing it in a database, and particularly in storing it in a way that's easily searchable. Particularly for reconstructed files, and to an extent for simulated ones, the amount of configurable options is quite high, but almost all of them are likely to usually be set to default values - we shouldn't let that flexibility hamper our ability to very easily see the key bits of information, which are the vertex and event generators. It could also be good to have a comparator that can tell you what is the difference in config settings between two simulated/reconstructed files.

Most important for me are for a simulated file:

  • event generator
  • vertex generator
  • any non-standard detector configuration (but as the standards can change, in practice I guess we need the whole config)
  • being able to get at the seed so we could generate another with the same

For a reconstructed file

  • everything from the simulated file
  • also the hash of the simulated file so we can match them
  • pipeline settings - I guess again as the standards change it is probably going to have to be all of them, but as they are legion and most will be defaults, that's where a comparator tool would be useful

Hope that's useful feedback...

@emchauve
Copy link
Member

I also fully agree on having metadata within data files, but also would like to hear more from Yves about his concern. If I understand well, he mentioned the case of complex MC production, like having an ouput similar than DATA with all components together (bb, int_bg, ext_bg, radon, etc.). But as far as additional metadata required for such MC (list of components and their activity) can be stored using datatools::multi_properties, I don't see any issues. @lemiere let use know your thinking, thanks !

@lemiere
Copy link
Member

lemiere commented Apr 28, 2020

Hi everyone,

I tried to spend time to think about this proposal and :

  • up to now, simulation production has been done using several files per production (ex : N files and 10^7 events per file) then with this proposal we will duplicate several times the METADATA. That is disk consuming.
  • Having the METADATA to avoid chance of mixing is a complex question... That's the reason why we started to talk about DB. Event generator, Vertex generator make sense to keep track about the production setting BUT the full variant system should be store to be able to track each details. Having METADATA in the data file avoid the DB dev ?
  • sn_flash has been developed (have to be improved) to manage all this METADATA (setting + storing). As the collaboration agreed weeks ago, we have to focus on key point(new physics features, commissioning data, monitoring + training) so why should we loose time to do that instead of working on actual needs ?
  • In case of specific study, how could we merge data for many production ? (ex : files containing bkg model + simulated bb0nu) Surely, there is way but having independant METADATA files will be good enough.
  • We will have exactly the same discussions about the real data management then we will have the same argument. But in that case no one will encourage to store the METADATA in the data files as the metadata will be the full detector/electronic setting (like HV per channel, electronic threshold ...)
    That question should be discussed during next SW group meeting ? @drbenmorgan can you organize it with your group ?

@cherylepatrick
Copy link

A few responses to @lemiere - my personal thoughts, and maybe others will come up with smarter alternatives...

with this proposal we will duplicate several times the METADATA. That is disk consuming.

I assume that the size of the metadata must be trivial compared to 10 million events' worth of simulation?

Event generator, Vertex generator make sense to keep track about the production setting BUT the full variant system should be store to be able to track each details.

Yeah. In almost every case (because we mostly want to study the actual detector we built), the rest of the variant system will be at default. But it's possible that the default changes, and there are definitely situations where we simulate with an unusual variant (e.g. Hamzah's magnetic field work). But maybe the full variant system doesn't need to be stored in plain text? Could we create a system where the variants are stored as some kind of bitmap, with a interpreter program to read it from the brio file header, or something?

Having METADATA in the data file avoid the DB dev ?

You may disagree, but I don't think so at all, and for a few reasons. The two systems serve similar, but complementary purposes. First, the database will largely be useful for searching for/locating files. A disordered trawl of the file system and manually reading the metadata headers of each file to see if it's what you want isn't a practical replacement for that. Conversely, what the metadata in the file WOULD help with is for small individual simulation productions, particularly for testing, that aren't intended for long-term collaboration use, and that haven't been entered into the official collaboration database of approved simulated/reconstructed files. If an analyser has made a quick batch of simulation to test something, and wants to re-use it for another test, it's good that they can test the file is what they think it is. We already have analyses that have been done with an unwitting mistake in some parameter (e.g. wrong time cut for an alpha particle). Another bonus would be that having metadata in the file would simplify the process of adding it to the metadata database - the script that adds it could read the metadata from the file, and put that info into the database tables accordingly.

As the collaboration agreed weeks ago, we have to focus on key point... so why should we loose time to do that instead of working on actual needs ?

Having a reliable file identification and searching system IS an actual need. If we aren't confident that we can find and identify simulated and reconstructed files, we are going to waste time, and we are going to make mistakes that can lead to invalid results. So sorting out a good metadata management method ABSOLUTELY is a key point, and a good thing to do in the lockdown. Whether that is in the form of putting metadata into the files, or whether it's all managed by a database, is maybe a subject for debate, but developing a clear data management strategy is absolutely imperative before we move to the live data phase, and is possibly even overdue as we have real commissioning data.

In case of specific study, how could we merge data for many production?

If we want to do this (and do we? I'm not at all sure that we do; it doesn't allow for any floating of relative activities for different components, so is unlikely to be realistic, isn't it?), this actually seems like an argument FOR metadata in the file. How do we connect multiple metadata files to a single brio file otherwise? Without metadata in the files, I think there is a lot of scope for this to go wrong. A check on the metadata before merging would be necessary to ensure that the merged files have the same detector geometry, and that would be easier if the metadata were in the files. Without that check, we might end up with nonsensical merged files where half the file had events with an Se82 foil and the other half had an Nd150 foil...

We will have exactly the same discussions about the real data management then we will have the same argument. But in that case no one will encourage to store the METADATA in the data files...

Is that true? I suspect we will store a subset of metadata that is a shortcut to the full set - namely we will store things like a timestamp that can be used to look up the HV configurations at the time of the run. I would not advocate for us storing no metadata at all, would you? I don't think that is typical.

I agree with Yves that this topic is worth discussion - but it's fine to start that discussion on here. We are a relatively small group, and by discussing online initially, it gives us time to consider our ideas and give them proper thought - there is a danger that if it is discussed only in a meeting, we could make rushed and ill-considered decisions. So no harm in starting the chat here and then moving to in-person discussion later on if needed.

Sorry, that was bit of a long one...

@drbenmorgan
Copy link
Member Author

drbenmorgan commented Apr 29, 2020

with this proposal we will duplicate several times the METADATA. That is disk consuming.

I assume that the size of the metadata must be trivial compared to 10 million events' worth of simulation?

I did a quick check of this using the following config file (seeds identical to ensure event data is identical):

#@key_label  "name"
#@meta_label "type"
[name="flsimulate" type="flsimulate::section"]
numberOfEvents : integer = 100

[name="flsimulate.simulation" type="flsimulate::section"]
rngEventGeneratorSeed         : integer = 314159
rngVertexGeneratorSeed        : integer = 765432
rngGeant4GeneratorSeed        : integer = 123456
rngHitProcessingGeneratorSeed : integer = 987654

Running to generate files with/without metadata:

$ flsimulate -c test.conf -o full.brio -E1
$ flsimulate -c test.conf -o empty.brio -E0 -m test.meta

So this is only 100 events, but what we see is:

  • full.brio is 328005 bytes
  • empty.brio is 327111 bytes
  • meta.conf test.meta is 2389 bytes

So in fact storing the metadata in the file only adds 894 bytes, which is 2.5 times smaller than file+metadata.

Event generator, Vertex generator make sense to keep track about the production setting BUT the full variant system should be store to be able to track each details.

Yeah. In almost every case (because we mostly want to study the actual detector we built), the rest of the variant system will be at default. But it's possible that the default changes, and there are definitely situations where we simulate with an unusual variant (e.g. Hamzah's magnetic field work). But maybe the full variant system doesn't need to be stored in plain text? Could we create a system where the variants are stored as some kind of bitmap, with a interpreter program to read it from the brio file header, or something?

I need to double check this, but the variant "profile", i.e. the list of settings (just a list of strings), should be stored in the metadata (and I'll add it if not). Reconstituting, e.g., the geometry manager is then a case of starting the variant service with the "default+settings".

Having METADATA in the data file avoid the DB dev ?

It's not either/or, but both. The files and DB hold the metadata, the DB interface provides the convenience for searching for files matching criteria.

In case of specific study, how could we merge data for many production?

If we want to do this (and do we? I'm not at all sure that we do; it doesn't allow for any floating of relative activities for different components, so is unlikely to be realistic, isn't it?), this actually seems like an argument FOR metadata in the file. How do we connect multiple metadata files to a single brio file otherwise? Without metadata in the files, I think there is a lot of scope for this to go wrong. A check on the metadata before merging would be necessary to ensure that the merged files have the same detector geometry, and that would be easier if the metadata were in the files. Without that check, we might end up with nonsensical merged files where half the file had events with an Se82 foil and the other half had an Nd150 foil...

This is, I think, partially down to Falaise (or rather Bayeux.dpp) having the concept of "Run" locked at the file level. We'd need new input/output modules to handle that if we wanted to have/deal with more than one "Run" (i.e. configuration) in a single file.

@lemiere
Copy link
Member

lemiere commented Apr 29, 2020

About data management, I've one thing in mind :
To store the production, we will use HPSS (High Perf Storage System) available at CC.
The advise is to store huge files/archives. So what we proposed month/years ago to the collaboration is to store a directory per RUN which contains 2 parts :

  • a METADATA tarball
  • a directory of .brio files
    the METADATA tarball (light) will be reachable to get the production settings without getting the full DATA (huge). Knowing the usage of HPSS, It will be more convenient to get only the METADATA.

@drbenmorgan
Copy link
Member Author

That pattern would still be possible! The brio_file_dumper (or call it brio_metadata_dumper if preferred) could be run on the generated .brio file to dump the metadata in it to a file to subsequently add to the tarball.

Again, it's not an either/or proposition - don't permit metadata to be stripped from production files and provide tools to extract the metadata for use in other tasks as needed.

@cherylepatrick
Copy link

Also for the official files, we should have the HPSS location and metadata stored in a database, shouldn't we? We can have a field in the database to give a location in a tarball.

@robobre
Copy link
Contributor

robobre commented Apr 30, 2020

@cherylepatrick Yes we would like to store metadata and location in to database. We have discussed with Manu that we will add some function it to snflash where it will be stored this data in to database. But it is not prepare yet. I would like to work on it next days.

@cherylepatrick
Copy link

Sounds good @robobre ! A couple of things to bear in mind if we automate DB update with sn-flash:

  • People might make mistakes when they generate something, and not realise that they have done it wrong til they see the results. Similarly, people might want to run a quick test with sn-flash. So we need a way to remove sn-flash results from the DB if we turn out not to want them for the long term.
  • If I understand correctly, stuff generated with sn-flash will go onto SPS, but for long-term storage, we are likely to move files to HPSS (is that right?). We need to make sure that the database gets updated correctly when the files are moved.

@robobre
Copy link
Contributor

robobre commented Apr 30, 2020

@cherylepatrick exactly i want add this function to part when the file is moved to sps or HPSS

@drbenmorgan
Copy link
Member Author

A couple of brain-dump thoughts based on the above and discussion from the meeting this morning (and going back to DocDB-5183 and 4798). Those show that for a given production run, the following layout is generated (comments inline as to what I think the files are, corrections welcome as I don't see where the tar bundle of metadata for HPSS fits in/is created from)

Structure

proddir_RUNID/
+ - .sys/
|   ... files here related to jobs submission?...
+- config.d/ 
|  +- conf.d/
|  |  +- launch_file_0.conf # Actual script passed to `flsimulate -c <script>` for subrun 0?
|  |  +- ... 
|  |  +- launch_file_N.conf
|  +- seeds.d/
|  |  +- seeds_0.conf  # Set of seeds used for subrun 0
|  |  +- ...
|  |  +- seeds_N.conf  
|  +- variant.d/
|     +- variant.profile # Overall variant settings (global for all subruns)
+- output_files/
   +- file_0.brio # BRIO file only holding the "ER" tree for subrun 0
   +- file_0.meta # Metadata that would be in "GI" tree for subrun 0
   +- ...
   +- file_N.brio
   +- file_N.meta

This very much looks like using the filesystem to define a file format. Nothing wrong with that, all this PR is proposing is to reduce the level of complexity whilst retaining information. All of the data that's under config.d for each file can go into the file_N.meta (most of it is already). That .meta file should be stored in the .brio file. When creating a production run, what you would get is something like:

Given that the .meta files contain (or can contain) essentially the same info that's in the conf.d structure, all that this PR is proposing is to end up with something like this:

proddir_RUNID/
+ - .sys/
|   ... files here related to jobs submission?...
+- meta/ (or .tar, .zip, or JSON, or...)
|  +- 0.meta
|  +- ...
|  +- N.meta
+- data/
  +- 0.brio
  +- ...
  +- N.brio

Here, N.meta is generated by running brio_file_dump -f N.brio > N.meta after the generation of N.brio completes, then storing N.meta as needed in the "manifest".

Any of the .brio files is directly understandable by flreconstruct by passing just that file (not having to marshall two files (flreconstruct -M 0.meta -I 0.brio). The files/data under meta can be quickly inspected/dumped assuming a standard tool like tar/zip/json, or uploaded to the DB as needed. There's less risk of data desync as, o.k., we have "duplication", but this is only for the quickly accessible "manifest" and any of that data that's stored in the DB (pretty much as any photo storage site does).

Using sets of files with different Metadata

The issue here is that the concept of "Run" in Falaise (or rather Bayeux.dpp) is modelled by "brio file". So there is no mechanism (as yet) other than single file open/close that models "Run change, i.e. metadata change". That's why at the moment we can't easily merge (meta)data from several files together into one.

We've got the capability to do this because brio files are just ROOT TFiles, and the data "stores" TTrees. We just need to define a structure that allows multiple "Runs", maybe one Tree per run, and reader/writer operations that can update things like metadata/services on a Run boundary.

@drbenmorgan
Copy link
Member Author

To progress this, I'd suggest the following:

  1. Retain the flsimulate/flreconstruct option to write metadata out to a text file as part of the run (for compatibility with snflash et al).
  2. Include the brio_file_dumper utility to allow easy query/etc of brio files
  3. Always store metadata in file, whether or not it was also written to a separate text file (see 1).

This retains the required interface for production/ops, whilst ensuring files retain their metadata. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

General Improvements to FLSimulate Re-simulating specific events Simulated true particles from GEANT
5 participants