Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A helper to produce access (download) stats #39

Open
yarikoptic opened this issue Apr 29, 2021 · 0 comments
Open

A helper to produce access (download) stats #39

yarikoptic opened this issue Apr 29, 2021 · 0 comments

Comments

@yarikoptic
Copy link
Member

Raised by @satra in emails: we would need a way to get data access stats.

We have a bucket with logs for s3 access, and IMHO it would be the ultimate source of stats since we share not only dandi-api urls but also direct S3 urls. Also we do not store/keep all versions of data in draft, so girder and/or dandi-api DB would not have all that information.

So I guess we should resort to using datalad dandisets and information (URLs) stored in git-annex history. Since we only update on 'cron' - that information also could be incomplete but AFAIK it would be the best we can get ATM. In the longer run - we might want to establish
We can try to establish access stats from those s3 access logs, ideally filtering out our own access (for backup etc) probably based on IP(s). For that we would need to

  • sweep through all assets of all dandisets (well -- git annex whereis --json --all or alike)
  • use urls the returned values to map S3 paths to dandisets (ATM likely to be unique, i.e. blob:dandiset but might already be violated for some tiny files)

The problem would eventually come that we would not be able to uniquely map from blobs uuid to a specific dandiset whenever we start creating meta-dandisets. Then S3 logs would not be sufficient and ATM I do not see any way we could disambiguate really (e.g. if data just accessed by a direct S3 URL). We could add a logic though to assume the 'earliest' (lowest dandiset id or commit date across dandisets) to be the origin of a file or explicitly record in DB for which dandiset every new blob was originally added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant