-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify flsimulate metadata handling #188
base: develop
Are you sure you want to change the base?
Conversation
Digitization will be handled in a separate module, so remove direct (and not working!) digitzation references.
Remove flsimulates's ability to write metadata to anything other then the output data file. This avoids ambiguity and complexity in storing and understanding this data later on. Replace separate file output with new command line program `brio_file_dumper`. This dumps the metadata in an input brio file to stdout in multiproperties format, allowing redirection to file if required.
I need time to review that PR as I'm comfortable with putting meta in the data file.... Will be back with that soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very much in favour of having metadata in the brio file, as it means there's no chance of mixing things up and working with a file that's been generated one configuration when you think it's another (and we have seen that happen).
We should think about ensuring that the dumped metadata is in a format that's conducive to storing it in a database, and particularly in storing it in a way that's easily searchable. Particularly for reconstructed files, and to an extent for simulated ones, the amount of configurable options is quite high, but almost all of them are likely to usually be set to default values - we shouldn't let that flexibility hamper our ability to very easily see the key bits of information, which are the vertex and event generators. It could also be good to have a comparator that can tell you what is the difference in config settings between two simulated/reconstructed files.
Most important for me are for a simulated file:
- event generator
- vertex generator
- any non-standard detector configuration (but as the standards can change, in practice I guess we need the whole config)
- being able to get at the seed so we could generate another with the same
For a reconstructed file
- everything from the simulated file
- also the hash of the simulated file so we can match them
- pipeline settings - I guess again as the standards change it is probably going to have to be all of them, but as they are legion and most will be defaults, that's where a comparator tool would be useful
Hope that's useful feedback...
I also fully agree on having metadata within data files, but also would like to hear more from Yves about his concern. If I understand well, he mentioned the case of complex MC production, like having an ouput similar than DATA with all components together (bb, int_bg, ext_bg, radon, etc.). But as far as additional metadata required for such MC (list of components and their activity) can be stored using datatools::multi_properties, I don't see any issues. @lemiere let use know your thinking, thanks ! |
Hi everyone, I tried to spend time to think about this proposal and :
|
A few responses to @lemiere - my personal thoughts, and maybe others will come up with smarter alternatives...
I assume that the size of the metadata must be trivial compared to 10 million events' worth of simulation?
Yeah. In almost every case (because we mostly want to study the actual detector we built), the rest of the variant system will be at default. But it's possible that the default changes, and there are definitely situations where we simulate with an unusual variant (e.g. Hamzah's magnetic field work). But maybe the full variant system doesn't need to be stored in plain text? Could we create a system where the variants are stored as some kind of bitmap, with a interpreter program to read it from the brio file header, or something?
You may disagree, but I don't think so at all, and for a few reasons. The two systems serve similar, but complementary purposes. First, the database will largely be useful for searching for/locating files. A disordered trawl of the file system and manually reading the metadata headers of each file to see if it's what you want isn't a practical replacement for that. Conversely, what the metadata in the file WOULD help with is for small individual simulation productions, particularly for testing, that aren't intended for long-term collaboration use, and that haven't been entered into the official collaboration database of approved simulated/reconstructed files. If an analyser has made a quick batch of simulation to test something, and wants to re-use it for another test, it's good that they can test the file is what they think it is. We already have analyses that have been done with an unwitting mistake in some parameter (e.g. wrong time cut for an alpha particle). Another bonus would be that having metadata in the file would simplify the process of adding it to the metadata database - the script that adds it could read the metadata from the file, and put that info into the database tables accordingly.
Having a reliable file identification and searching system IS an actual need. If we aren't confident that we can find and identify simulated and reconstructed files, we are going to waste time, and we are going to make mistakes that can lead to invalid results. So sorting out a good metadata management method ABSOLUTELY is a key point, and a good thing to do in the lockdown. Whether that is in the form of putting metadata into the files, or whether it's all managed by a database, is maybe a subject for debate, but developing a clear data management strategy is absolutely imperative before we move to the live data phase, and is possibly even overdue as we have real commissioning data.
If we want to do this (and do we? I'm not at all sure that we do; it doesn't allow for any floating of relative activities for different components, so is unlikely to be realistic, isn't it?), this actually seems like an argument FOR metadata in the file. How do we connect multiple metadata files to a single brio file otherwise? Without metadata in the files, I think there is a lot of scope for this to go wrong. A check on the metadata before merging would be necessary to ensure that the merged files have the same detector geometry, and that would be easier if the metadata were in the files. Without that check, we might end up with nonsensical merged files where half the file had events with an Se82 foil and the other half had an Nd150 foil...
Is that true? I suspect we will store a subset of metadata that is a shortcut to the full set - namely we will store things like a timestamp that can be used to look up the HV configurations at the time of the run. I would not advocate for us storing no metadata at all, would you? I don't think that is typical. I agree with Yves that this topic is worth discussion - but it's fine to start that discussion on here. We are a relatively small group, and by discussing online initially, it gives us time to consider our ideas and give them proper thought - there is a danger that if it is discussed only in a meeting, we could make rushed and ill-considered decisions. So no harm in starting the chat here and then moving to in-person discussion later on if needed. Sorry, that was bit of a long one... |
I did a quick check of this using the following config file (seeds identical to ensure event data is identical):
Running to generate files with/without metadata: $ flsimulate -c test.conf -o full.brio -E1
$ flsimulate -c test.conf -o empty.brio -E0 -m test.meta So this is only 100 events, but what we see is:
So in fact storing the metadata in the file only adds 894 bytes, which is 2.5 times smaller than file+metadata.
I need to double check this, but the variant "profile", i.e. the list of settings (just a list of strings), should be stored in the metadata (and I'll add it if not). Reconstituting, e.g., the geometry manager is then a case of starting the variant service with the "default+settings".
It's not either/or, but both. The files and DB hold the metadata, the DB interface provides the convenience for searching for files matching criteria.
This is, I think, partially down to Falaise (or rather Bayeux.dpp) having the concept of "Run" locked at the file level. We'd need new input/output modules to handle that if we wanted to have/deal with more than one "Run" (i.e. configuration) in a single file. |
About data management, I've one thing in mind :
|
That pattern would still be possible! The Again, it's not an either/or proposition - don't permit metadata to be stripped from production files and provide tools to extract the metadata for use in other tasks as needed. |
Also for the official files, we should have the HPSS location and metadata stored in a database, shouldn't we? We can have a field in the database to give a location in a tarball. |
@cherylepatrick Yes we would like to store metadata and location in to database. We have discussed with Manu that we will add some function it to snflash where it will be stored this data in to database. But it is not prepare yet. I would like to work on it next days. |
Sounds good @robobre ! A couple of things to bear in mind if we automate DB update with sn-flash:
|
@cherylepatrick exactly i want add this function to part when the file is moved to sps or HPSS |
A couple of brain-dump thoughts based on the above and discussion from the meeting this morning (and going back to DocDB-5183 and 4798). Those show that for a given production run, the following layout is generated (comments inline as to what I think the files are, corrections welcome as I don't see where the tar bundle of metadata for HPSS fits in/is created from) Structure
This very much looks like using the filesystem to define a file format. Nothing wrong with that, all this PR is proposing is to reduce the level of complexity whilst retaining information. All of the data that's under Given that the
Here, Any of the Using sets of files with different MetadataThe issue here is that the concept of "Run" in Falaise (or rather Bayeux.dpp) is modelled by "brio file". So there is no mechanism (as yet) other than single file open/close that models "Run change, i.e. metadata change". That's why at the moment we can't easily merge (meta)data from several files together into one. We've got the capability to do this because |
To progress this, I'd suggest the following:
This retains the required interface for production/ops, whilst ensuring files retain their metadata. Thoughts? |
To work towards the requests in #176, #178 this PR addresses some basic simplification of
flsimulate
's metadata organisation and output, of which random number seeds/state are a part.The core change being implemented is to ensure that all metadata relating to an
flsimulate
run can only be stored in the output.brio
file. In particular, the provision to output metadata to a separate file byflsimulate
is removed, with a separatebrio_file_dumper
(name to be argued over!) program supplied that takes a.brio
file as input and can spit out as much/little info about the file contents (including metadata as needed). At present it simply dumps the metadata tostdout
in rawdatatools::multi_properties
, so a production job can obtain the metadata via:This ensures that:
flsimulate
(which is purely responsible for generating that data).brio
file, or extracting info in any form suitable for database/cataloguing. This can easily be extended to raw/reconstructed data as well.This is still a "Draft" PR because not everything is implemented yet, and wanted early eyes/feedback from all of you (as it impacts production/database/reco/analysis). The main thing to work out is what your different areas need from the
brio_file_dumper
program, i.e. what interface/printed info is most useful to you?