Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arXiv dubs #3575

Open
ksachs opened this issue Aug 2, 2018 · 3 comments
Open

arXiv dubs #3575

ksachs opened this issue Aug 2, 2018 · 3 comments

Comments

@ksachs
Copy link
Contributor

ksachs commented Aug 2, 2018

trying to trace why arXiv records are created twice.

instead of an update a new record is created

arXiv:1807.07025, 1683196, 1683259
arXiv:1807.06513, 1682949, 1682955

E.g.

001682955 541__ $$aarXiv$$chepcrawl$$d2018-07-18T03:36:57.313380$$e1132375  
Record added 2018-07-18, last modified 2018-07-18

001682949 541__ $$aarXiv$$chepcrawl$$d2018-07-19T03:36:40.280688$$e1134788    
Record added 2018-07-18, last modified 2018-07-19 
which is not true, the record was created on 2018-07-19

The second 1134788 is correctly identified as exact-match.
But instead of a replace a new record is created.
The creation date of this new record is inherited from the old record.

https://labs.inspirehep.net/api/holdingpen/1134788 contains:

"callback_result": {
  "marcxml": "<record>\n  <controlfield tag=\"001\">1682949</controlfield

For some reason a new recid is added to

/opt/cds-invenio/var/tmp-shared/batchupload_20180719050932_D9f4T8
<controlfield tag="001">1682949</controlfield>

Where is this new recid coming from??
Can it overwrite something?

update comes in while first record is halted

For these I'm not sure I understand the info in the api:
arXiv:1807.10190, 1684265, 1684268
arXiv:1807.09872, 1684269, 1684274
arXiv:1807.10163, 1684266, 1684273

What I belive, e.g.

001684265 541__ $$aarXiv$$chepcrawl$$d2018-07-27T03:35:47.430949$$e1147082  
001684268 541__ $$aarXiv$$chepcrawl$$d2018-07-28T03:35:35.671772$$e1148488   

the first is halted for match-approval.
While it is halted the second comes in.
Now they also somehow match themselves.
But both upload files contain no controlnumber, each creating a new record.

https://labs.inspirehep.net/api/holdingpen/1147082
"exact-matched": true, 
"fuzzy_match_approved_id": null, 
"holdingpen_matches": [
  1148488
], 
"is-update": true, 
"matches": {
  "approved": 1684265, 
  "exact": [
    1684265, 
    1684268
  ], 
  "fuzzy": [
      "control_number": 1665833, 
"marcxml": "<record>\n  <controlfield tag=\"001\">1684265</controlfield>


https://labs.inspirehep.net/api/holdingpen/1148488
"exact-matched": true, 
"fuzzy_match_approved_id": null, 
"holdingpen_matches": [
  1147082
], 
"is-update": true, 
"matches": {
  "approved": 1684265, 
  "exact": [
    1684265, 
    1684268
  ], 
  "fuzzy": [
      "control_number": 1665833, 
"marcxml": "<record>\n  <controlfield tag=\"001\">1684265</controlfield>\n
@ksachs
Copy link
Contributor Author

ksachs commented Aug 3, 2018

correction: all upload files I checked have controlnumber, i.e. recid.

@ksachs
Copy link
Contributor Author

ksachs commented Aug 17, 2018

each pair was created from the same workflow right after the other (successive recids)
All workflows have an error message in extra_data. Maybe the double upload was triggered by a restart?
​​

arXiv:1808.01257, 1685054, 1685055
001685055 541__ $$aarXiv$$chepcrawl$$d2018-08-06T03:35:25.423672$$e1160371
001685054 541__ $$aarXiv$$chepcrawl$$d2018-08-06T03:35:25.423672$$e1160371

arXiv:1808.01365, 1685234, 1685235
001685235 541__ $$aarXiv$$chepcrawl$$d2018-08-07T03:43:48.991131$$e1161286
001685234 541__ $$aarXiv$$chepcrawl$$d2018-08-07T03:43:48.991131$$e1161286

arXiv:1808.01473, 1685232, 1685233
001685233 541__ $$aarXiv$$chepcrawl$$d2018-08-07T03:43:50.714420$$e1161331
001685232 541__ $$aarXiv$$chepcrawl$$d2018-08-07T03:43:50.714420$$e1161331

@ksachs
Copy link
Contributor Author

ksachs commented Aug 28, 2018

Another update that came in while the first record was halted.
Somehow the order of actions might not be right.
The the second worflow (claims to) stop the first only after being halted for match approval.
The first wf continues anyhow, is again stopped for matching and in the end send_to_legacy.

001688926 037__ $$9arXiv$$aarXiv:1808.05450$$chep-ph
001688926 541__ $$aarXiv$$chepcrawl$$d2018-08-17T03:35:14.401921$$e1177634

001688751 541__ $$aarXiv$$chepcrawl$$d2018-08-18T03:35:02.440396$$e1179088


WorkFlow:1177634
  {
    "nicename": "\"Halted for matching approval.\"", 
    "time": "2018-08-17 03:53:26.347351"
  }, 
  {
    "nicename": "Mark the workflow object with stopped-by-wf:1179088.", 
    "time": "2018-08-20 15:00:45.534012"
      }, 
  ....
  {
    "nicename": "\"Halted for matching approval.\"", 
    "time": "2018-08-20 15:01:08.732753"
  }, 
  {
    "doc": "IF_ELSE: args(<function is_fuzzy_match_approved at 0x7f22106d0ed8>, ....
    "time": "2018-08-21 07:32:05.250600"
  }, 
  ....
  {
    "nicename": "send_to_legacy", 
    "time": "2018-08-21 07:32:56.497283"
  }, 
"holdingpen_matches": [
  1179088
], 


WorkFlow:1179088
  {
    "nicename": "Mark the workflow object with already-in-holding-pen:True.", 
    "time": "2018-08-18 03:42:35.836757"
  }, 
  ....
  {
    "nicename": "\"Halted for matching approval.\"", 
    "time": "2018-08-18 03:42:36.372337"
  }, 
  {
    "nicename": "Stop the matched workflow objects in the holdingpen.", 
    "time": "2018-08-20 15:00:45.712279"
  }, 
  ....
  {
    "nicename": "send_to_legacy", 
    "time": "2018-08-20 15:03:52.980712"
  }, 
  ....
  {
    "nicename": "Mark the workflow object with stopped-by-wf:1177634.", 
    "time": "2018-08-21 07:32:05.442135"
  }, 
"holdingpen_matches": [
  1177634
], 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant