Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/8889 - second try to limit file pids management #9721

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Jul 19, 2023

What this PR does / why we need it: This PR replaces #9716 which was accidentally closed via a bad push to develop (see dv-tech slack channel). This PR recreates the prior one and adds additional changes to use a second :AllowEnablingFilePIDsPerCollection setting that enables/disables the API to change whether FIle PIDs are allowed per collection. (The :FilePIDsEnabled setting is now just a way to set the global default.)

Which issue(s) this PR closes:

Closes #8889

Special notes for your reviewer: I hopefully have cleaned up the develop branch and gotten everything in here correctly. I'll check again tomorrow morning and will scan the docs/code once more to see if I missed anything we talked about in slack and the old PR. (the c6d4269 reverts the squash of all the commits that were reverted in develop).

Suggestions on how to test this: Setting :AllowEnablingFilePIDsPerCollection=true should enable the API call for superusers to set FilePIDs on/off per collection. Setting the global default via :FilePIDsEnabled should allow you to compare a collection using that global default and one where you've used the API to flip from true to false or vice versa. There is an automated test that verifies publishing a dataset after a collection has had its setting flipped does indeed have it's file PIDs handled correctly.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: modified from the ones for the original PR - can be combined with those (I think it should be cut/paste to replace them).

Additional documentation:

This reverts commit 36c9be5.

Revert "no need to delete since we aren't using the null/unset val now"

This reverts commit a0ac9e3.

Revert "add second setting"

This reverts commit 6944823.

Revert "check collection-level setting"

This reverts commit 4d6ea3e.

Revert "flip to default being false"

This reverts commit 14e59ca.

Revert "check PID setting in create dataverse command"

This reverts commit 33f994d.

Revert "also only sleep when reigstering in this call"

This reverts commit d9f8e8a.

Revert "update docs"

This reverts commit dc5d000.

Revert "require superuser and for :FilePIDsEnabled to be set to allow changes"

This reverts commit 3e9262f.

Revert "don't mint file PIDs in collections where they are not allowed"

This reverts commit aef24bb.
@qqmyers qqmyers added the Size: 10 A percentage of a sprint. 7 hours. label Jul 19, 2023
@qqmyers qqmyers added this to the 5.14 milestone Jul 19, 2023
@coveralls
Copy link

coveralls commented Jul 19, 2023

Coverage Status

coverage: 20.383% (-0.009%) from 20.392% when pulling bde7890 on GlobalDataverseCommunityConsortium:IQSS/8889-limit-file-pids-management into e67dad0 on IQSS:develop.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I spotted something to change in the docs.

@@ -2775,14 +2775,35 @@ timestamps.
:FilePIDsEnabled
++++++++++++++++

Toggles publishing of file-level PIDs for the entire installation. By default this setting is absent and Dataverse Software assumes it to be true. If enabled, the registration will be performed asynchronously (in the background) during publishing of a dataset.
Toggles publishing of file-level PIDs for the entire installation. By default this setting is absent and Dataverse Software assumes it to be false. If enabled, the registration will be performed asynchronously (in the background) during publishing of a dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default this setting is no longer absent, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - a new installation will not have it set, which we now interpret as false. Existing installs which had it not set will now have it set true.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't deploy. Please see the Flyway error below.

Comment on lines 1 to 3
INSERT INTO setting(content, lang, name) VALUES
('true', '', ':FilePIDsEnabled')
ON CONFLICT DO NOTHING;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting this error (I'm on 687c872):

dev_dataverse> Caused by: org.flywaydb.core.internal.sqlscript.FlywaySqlScriptException: Migration V5.13.0.3__8889-change-filepids-default.sql failed
dev_dataverse> ------------------------------------------------------------
dev_dataverse> SQL State  : 23514
dev_dataverse> Error Code : 0
dev_dataverse> Message    : ERROR: new row for relation "setting" violates check constraint "non_empty_lang"
dev_dataverse>   Detail: Failing row contains (1, true, , :FilePIDsEnabled).
dev_dataverse> Location   : db/migration/V5.13.0.3__8889-change-filepids-default.sql (/opt/payara/deployments/dataverse/WEB-INF/classes/db/migration/V5.13.0.3__8889-change-filepids-default.sql)
dev_dataverse> Line       : 1
dev_dataverse> Statement  : INSERT INTO setting(content, lang, name) VALUES
dev_dataverse>               ('true', '', ':FilePIDsEnabled')
dev_dataverse>        ON CONFLICT DO NOTHING

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error goes away if you change empty string to NULL: https://github.com/IQSS/dataverse/pull/9721/files#r1270667037

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a suggestion for the sql script.

@@ -0,0 +1,3 @@
INSERT INTO setting(content, lang, name) VALUES
('true', '', ':FilePIDsEnabled')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
('true', '', ':FilePIDsEnabled')
('true', NULL, ':FilePIDsEnabled')

Is the above what we want instead? NULL instead of empty string? With empty string I'm getting the error below.

@scolapasta scolapasta self-assigned this Jul 24, 2023
@scolapasta scolapasta removed their assignment Jul 26, 2023
@kcondon kcondon self-assigned this Jul 27, 2023
@kcondon
Copy link
Contributor

kcondon commented Jul 27, 2023

Issues found:
Issues:

  1. Attempting to register a file pid when not enabled logs "starting..."" but not result, though result is shown on command line
    [2023-07-27T18:46:49.367+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.api.Admin] [tid: _ThreadID=77 _ThreadName=http-thread-pool::http-listener-1(5)] [timeMillis: 1690483609367] [levelValue: 800] [[
    Starting to register 333906 file id. Thu Jul 27 18:46:49 UTC 2023]]

curl "http://localhost:8080/api/admin/333906/registerDataFile"
{"status":"ERROR","message":"PIDs are not enabled for this file's collection."}

  1. curl http://localhost:8080/api/admin/registerDataFileAll
    Doesn't check :FilePIDsEnabled or :AllowEnablingFilePIDsPerCollection up front, rather it appears to go through every file, checking whether it is published first. Very inefficient.
    {"status":"OK","data":{"message":"Datafile registration complete.0 of 964 unregistered, published f

Final log entries provide several streams of info:
3:04
[2023-07-27T19:03:39.767+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.api.Admin] [tid: _ThreadID=77 _ThreadName=http-thread-pool::http-listener-1(5)] [timeMillis: 1690484619767] [levelValue: 800] [[
Final Results:]]
[2023-07-27T19:03:39.767+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.api.Admin] [tid: _ThreadID=77 _ThreadName=http-thread-pool::http-listener-1(5)] [timeMillis: 1690484619767] [levelValue: 800] [[
9431 of 19405 files were already registered. Thu Jul 27 19:03:39 UTC 2023]]
[2023-07-27T19:03:39.767+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.api.Admin] [tid: _ThreadID=77 _ThreadName=http-thread-pool::http-listener-1(5)] [timeMillis: 1690484619767] [levelValue: 800] [[
9492 of 19405 files are not yet published. Thu Jul 27 19:03:39 UTC 2023]]
[2023-07-27T19:03:39.768+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.api.Admin] [tid: _ThreadID=77 _ThreadName=http-thread-pool::http-listener-1(5)] [timeMillis: 1690484619768] [levelValue: 800] [[
964 of 19405 unregistered, published files to register. Thu Jul 27 19:03:39 UTC 2023]]
[2023-07-27T19:03:39.768+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.api.Admin] [tid: _ThreadID=77 _ThreadName=http-thread-pool::http-listener-1(5)] [timeMillis: 1690484619768] [levelValue: 800] [[
0 of 964 unregistered, published files registered successfully. Thu Jul 27 19:03:39 UTC 2023]]
[2023-07-27T19:03:39.768+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.api.Admin] [tid: _ThreadID=77 _ThreadName=http-thread-pool::http-listener-1(5)] [timeMillis: 1690484619768] [levelValue: 800] [[
9973 of 19405 files not in collections that allow file PIDs. Thu Jul 27 19:03:39 UTC 2023]]
3:05
3. Setting collection level file pids is allowed even when AllowEnablingFilePIDsPerCollection, enablefilepid isn't set.
curl -X PUT -H "X-Dataverse-key:xyz" "http://dataverse-internal.iq.harvard.edu:8080/api/dataverses/coll_pid_enabled/attribute/filePIDsEnabled?value=true"
{"status":"OK","message":{"message":"Update successful"},"data":{"id":333916,"alias":"coll_pid_enabled","name":"File PIDs Collection Enabled, Disabled on System","dataverseContacts":[{"displayOrder":0,"contactEmail":"[email protected]"}],"permissionRoot":true,"dataverseType":"DEPARTMENT","ownerId":1,"creationDate":"2023-07-27T19:31:11Z","filePIDsEnabled":true}}[root@dataverse-internal ~]#
[root@dataverse-internal ~]#
[root@dataverse-internal ~]#
[root@dataverse-internal ~]# curl http://localhost:8080/api/admin/settings/:AllowEnablingFilePIDsPerCollection
{"status":"ERROR","message":"Setting :AllowEnablingFilePIDsPerCollection not found"}

Also happens when AllowEnablingFilePIDsPerCollection is set to false and enablefilepid isn't set

  1. Sub collections appear to inherit AllowEnablingFilePIDsPerCollection= true value.

@qqmyers
Copy link
Member Author

qqmyers commented Jul 27, 2023

  1. FWIW: This is ~legacy behavior - that call writes Starting in the log and only writes again if there's been an exception/PID service failure, etc. (It also returns 'OK' whether there was a problem or not.)
  2. The current design/implementation retains the collection-level setting, regardless of the current state of the AllowEnabling setting, so there is no way to avoid going through the list. (It's a legacy issue that the call scans through all files instead of querying for only those without file PIDs. With the new code, one could try to cache the per-collection setting so it isn't looked up per file but I don't think that's significant compared to getting all files to start. FWIW: The register data files in a collection command does check the per-collection filePIDsEnabled setting once up front and will return immediately/before retrieving any files in the collection if it isn't enabled.)
    I'm not sure why you had 0 of 964 successes according to the log - were there log messages before the final tally? DataCite was having issues today.
    Also FYI: I think there was another logic flaw in the registerAll api call that would have registered files when filePIDsEnabled was false for a collection. That should also be fixed.
  3. The logic was backward (setting that attribute would have failed if you had AllowEnabling... = true), now fixed
  4. This is as designed/the same as other settings. I didn't change that in this PR.

@kcondon
Copy link
Contributor

kcondon commented Jul 27, 2023

  1. OK, legacy issues, guess I should open a separate issue then, thanks.
  2. When I ran that test, filepids was disabled. I reran it with it enabled and saw 140 or so failure to register errors and this as final results:
    failed_reg_regall.txt
    final_results_filepidenabled.txt
    I'll need to rerun this, the output is too confusing for me to make sense of it. I've done 2 runs but only see start/final posts once today.
    Command lines from today:
    Run1 with file pids disabled:
    [root@dataverse-internal ~]# curl http://localhost:8080/api/admin/registerDataFileAll
    {"status":"OK","data":{"message":"Datafile registration complete.0 of 964 unregistered, published

Run2 with file pids enabled:
curl http://localhost:8080/api/admin/registerDataFileAll

{"status":"OK","data":{"message":"Datafile registration complete.0 of 1008 unregistered, published files registered successfully."}}

  1. Will retest, thanks. [Kevin] Works now, thanks.
  2. Ok, I wasn't sure about this since doc said it was for a specific collection. Thanks.

@qqmyers
Copy link
Member Author

qqmyers commented Jul 27, 2023

re: 2. It looks like the check to get the authenticated user was inside the loop and it's error message was getting caught/logged and not causing the call to fail with a 403. I've changed that now. (My guess is that you had no/a bad api key and the method just ran without succeeding, looping through all the files, rather than failing up front.)

@kcondon
Copy link
Contributor

kcondon commented Jul 28, 2023

@qqmyers re: 2, registerDataFileAll, that is an admin endpoint so it shouldn't need an api key, right?
Now it does. I'll test as it is but I thought the practice was admin endpoints require localhost so implies person has access to machine and therefore already superuser.

@qqmyers
Copy link
Member Author

qqmyers commented Jul 28, 2023

There are many admin calls that require an authenticated user, and a few that require superuser. A lot of them, but not all, are ones I created so I may have not gone with the convention. FWIW - I think requiring a key does mean that the action log records who ran the commands in the api call.

@kcondon kcondon merged commit f536749 into IQSS:develop Jul 28, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 10 A percentage of a sprint. 7 hours.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Enable file PIDs at Dataverse Collection Level
5 participants