Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Holding Pen: Merger Musings #3537

Open
ksachs opened this issue Jul 10, 2018 · 0 comments
Open

Holding Pen: Merger Musings #3537

ksachs opened this issue Jul 10, 2018 · 0 comments

Comments

@ksachs
Copy link
Contributor

ksachs commented Jul 10, 2018

Some musings about the merger.
Or more precisely: when do we need to update and merge.

Merging records is the most complex and error-prone action we have.
Don't do it unless necessary.

Process:

  1. harvest: oai from arXiv, feed from publisher - original.V1
  2. conversion to inspire-json - leads to basic_record.V1
  3. enrichment - adds enrichment_record.V1 to basic_record.V1 (visible in HoldingPen)
  4. harvest of updated version - original.v2
  5. conversion to inspire-json - leads to basic_record.V2
    Here we could compare basic_record.V2 and basic_record.V1.
    No change -> end workflow
    If the pdf changed (can we see that?), replace only the fulltext and re-run refextract
  6. enrichment
  7. auto-merge incl. info from BAI tables - leads to (partially-)merged_record.V2 (visible in HoldingPen)

It is difficult (impossible for me) to compare which info really came from arXiv and what should be updated. I'm not sure this is the best procedure.

To determine whether an update and merge is necessary I would base the comparison on the converted INSPIRE json record, not on the original harvest. Publisher metadata can be very rich and might change in a place we don't use. In addition the structure might change. The conversion to json is a good filter to avoid such problems.

Example: arXiv.1807.02123

At arXiv:

Current metadata
[v1] Thu, 5 Jul 2018 18:00:12 GMT (65kb,D)

 <?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2018-07-10T08:34:55Z</responseDate>
<request verb="GetRecord" identifier="oai:arXiv.org:1807.02123" metadataPrefix="arXiv">http://export.arxiv.org/oai2</request>
<GetRecord>
<record>
<header>
 <identifier>oai:arXiv.org:1807.02123</identifier>
 <datestamp>2018-07-09</datestamp>
 <setSpec>physics:astro-ph</setSpec>
 <setSpec>physics:gr-qc</setSpec>
 <setSpec>physics:hep-th</setSpec>
</header>
<metadata>
 <arXiv xmlns="http://arxiv.org/OAI/arXiv/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd">
  <id>1807.02123</id>
  <created>2018-07-05</created>
  <authors>
   <author>
    <keyname>Isi</keyname>
    <forenames>Maximiliano</forenames>
   </author>
   <author>
    <keyname>Stein</keyname>
    <forenames>Leo C.</forenames>
   </author>
  </authors>
  <title>Measuring stochastic gravitational-wave energy beyond general relativity</title>
  <categories>gr-qc astro-ph.CO astro-ph.HE hep-th</categories>
  <comments>18 pages (plus appendices), 1 figure</comments>
  <report-no>LIGO-P1700234</report-no>
  <license>http://arxiv.org/licenses/nonexclusive-distrib/1.0/</license>
  <abstract>  Gravity theories beyond general relativity (GR) can change the properties of ....</abstract>
 </arXiv>
</metadata>
</record>
</GetRecord>
</OAI-PMH>

INSPIRE 1st harvest

HP 1117913

Can anyone find out which arXiv metadata were harvested?

basic_record.V1

record after conversion to json

"id": 1117913, 
"metadata": {
  "$schema": "https://labs.inspirehep.net/schemas/records/hep.json", 
  "_collections": [
    "Literature"
  ], 
  "_files": [
    {
      "bucket": "2544991d-f864-4e3c-84db-175a3d9d796b", 
      "checksum": "md5:a4c818b1694a6a502a0a2f21674ca92e", 
      "key": "1807.02123.tar.gz", 
      "size": 66761, 
      "version_id": "13acfe1f-bc07-4965-82d8-d814fa47e17f"
    }, 
    {
      "bucket": "2544991d-f864-4e3c-84db-175a3d9d796b", 
      "checksum": "md5:005bb51602500a9a0b66c925205e2afd", 
      "key": "1807.02123.pdf", 
      "size": 916611, 
      "version_id": "33c371b6-69ca-4aa2-974d-8413d01be527"
    }
  ], 
  "abstracts": [
    {
      "source": "arXiv", 
      "value": "Gravity theories beyond general relativity (GR) can change t....."
    }
  ], 
  "acquisition_source": {
    "datetime": "2018-07-09T03:34:57.462577", 
    "method": "hepcrawl", 
    "source": "arXiv", 
    "submission_number": "1117913"
  }, 
  "arxiv_eprints": [
    {
      "categories": [
        "gr-qc", 
        "astro-ph.CO", 
        "astro-ph.HE", 
        "hep-th"
      ], 
      "value": "1807.02123"
    }
  ], 
  "authors": [
    {
      "full_name": "Isi, Maximiliano"
    }, 
    {
      "full_name": "Stein, Leo C."
    }
  ], 
  "documents": [
    {
      "fulltext": true, 
      "hidden": true, 
      "key": "1807.02123.pdf", 
      "material": "preprint", 
      "original_url": "http://export.arxiv.org/pdf/1807.02123", 
      "source": "arxiv", 
      "url": "/api/files/2544991d-f864-4e3c-84db-175a3d9d796b/1807.02123.pdf"
    }
  ], 
  "license": [
    {
      "license": "arXiv nonexclusive-distrib 1.0", 
      "material": "preprint", 
      "url": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/"
    }
  ], 
 "preprint_date": "2018-07-05", 
 "public_notes": [
    {
      "source": "arXiv", 
      "value": "18 pages (plus appendices), 1 figure"
    }
  ], 
  "report_numbers": [
    {
      "source": "arXiv", 
      "value": "LIGO-P1700234"
    }
  ], 
  "titles": [
    {
      "source": "arXiv", 
      "title": "Measuring stochastic gravitational-wave energy beyond general relativity"
    }
  ]

enrichment_record.V1

Information added during the worklow

"citeable": true, 
"control_number": 1681259, 
"core": true, 
"curated": false, 
"document_type": [
  "article"
], 
"inspire_categories": [
  {
    "source": "arxiv", 
    "term": "Gravitation and Cosmology"
  }, 
  {
    "source": "arxiv", 
    "term": "Astrophysics"
  }, 
  {
    "source": "arxiv", 
    "term": "Theory-HEP"
  }
], 
"number_of_pages": 18, 
"references": [
 ...    
], 

Update

HP 1119139

basic_record.V2

looks very much the same as basic_record.V1

"id": 1119139, 
"metadata": {
  "$schema": "https://labs.inspirehep.net/schemas/records/hep.json", 
  "_collections": [
    "Literature"
  ], 
  "_files": [
    {
      "bucket": "7a52c6cf-2889-4233-8fb6-4fdfccf87f53", 
      "checksum": "md5:a4c818b1694a6a502a0a2f21674ca92e", 
      "key": "1807.02123.tar.gz", 
      "size": 66761, 
      "version_id": "d8afbc29-0514-43a5-9263-2adef7b8d371"
    }, 
    {
      "bucket": "7a52c6cf-2889-4233-8fb6-4fdfccf87f53", 
      "checksum": "md5:005bb51602500a9a0b66c925205e2afd", 
      "key": "1807.02123.pdf", 
      "size": 916611, 
      "version_id": "ba0ccc5d-2c9e-42fd-9dd5-3ecb47bb412a"
    }
  ], 
  "abstracts": [
    {
      "source": "arXiv", 
      "value": "Gravity theories beyond general relativity (GR) can change the properties of gravitational waves: their polarizations, dispersion, speed, and, importantly, energy content are all heavily theory- dependent. All these corrections can potentially be probed by measuring the stochastic gravitational- wave background. However, most existing treatments of this background beyond GR overlook modifications to the energy carried by gravitational waves, or rely on GR assumptions that are invalid in other theories. This may lead to mistranslation between the observable cross-correlation of detector outputs and gravitational-wave energy density, and thus to errors when deriving observational constraints on theories. In this article, we lay out a generic formalism for stochastic gravitational- wave searches, applicable to a large family of theories beyond GR. We explicitly state the (often tacit) assumptions that go into these searches, evaluating their generic applicability, or lack thereof. Examples of problematic assumptions are: statistical independence of linear polarization amplitudes; which polarizations satisfy equipartition; and which polarizations have well-defined phase velocities. We also show how to correctly infer the value of the stochastic energy density in the context of any given theory. We demonstrate with specific theories in which some of the traditional assumptions break down: Chern-Simons gravity, scalar-tensor theory, and Fierz-Pauli massive gravity. In each theory, we show how to properly include the beyond-GR corrections, and how to interpret observational results."
    }
  ], 
  "acquisition_source": {
    "datetime": "2018-07-10T03:36:36.182790", 
    "method": "hepcrawl", 
    "source": "arXiv", 
    "submission_number": "1117913"
  }, 
  "arxiv_eprints": [
    {
      "categories": [
        "gr-qc", 
        "astro-ph.CO", 
        "astro-ph.HE", 
        "hep-th"
      ], 
      "value": "1807.02123"
    }
  ], 
  "authors": [
    {
      "full_name": "Isi, Maximiliano", 
    }, 
    {
      "full_name": "Stein, Leo C.", 
    }
  ], 
  "documents": [
    {
      "fulltext": true, 
      "hidden": true, 
      "key": "1807.02123.pdf", 
      "material": "preprint", 
      "original_url": "http://export.arxiv.org/pdf/1807.02123", 
      "source": "arxiv", 
      "url": "/api/files/7a52c6cf-2889-4233-8fb6-4fdfccf87f53/1807.02123.pdf"
    }
  ], 
  "license": [
    {
      "license": "arXiv nonexclusive-distrib 1.0", 
      "material": "preprint", 
      "url": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/"
    }, 
    {
      "license": "arXiv nonexclusive-distrib 1.0", 
      "url": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/"
    }
  ], 
  "preprint_date": "2018-07-05", 
  "public_notes": [
    {
      "source": "arXiv", 
      "value": "18 pages (plus appendices), 1 figure"
    }
  ], 
  "report_numbers": missing due to problem with merger
  "titles": [
    {
      "source": "arXiv", 
      "title": "Measuring stochastic gravitational-wave energy beyond general relativity"
    }
  ]

merged_record.V2

I assume this is additional information from enrichment and automerge.
Difficult to say since the steps in between are not accessible to me.

    "ids": [
      {
        "schema": "INSPIRE BAI", 
        "value": "M.Isi.1"
      }
    ], 
    "record": {
      "$ref": "http://labs.inspirehep.net/api/authors/1275240"
    }, 
    "signature_block": "ISm", 
    "uuid": "3ec51e6f-56c3-4a36-82fe-bdf56e91afd0"

    "ids": [
      {
        "schema": "INSPIRE BAI", 
        "value": "L.C.Stein.2"
      }
    ], 
    "record": {
      "$ref": "http://labs.inspirehep.net/api/authors/1056947"
    }, 
    "signature_block": "STANl", 
    "uuid": "34cd70d9-c68b-4f64-b77e-29240ecb0120"

"citeable": true, 
"control_number": 1681259, 
"core": true, 
"curated": false, 
"document_type": [
  "article"
], 
"inspire_categories": [
  {
    "source": "arxiv", 
    "term": "Gravitation and Cosmology"
  }, 
  {
    "source": "arxiv", 
    "term": "Astrophysics"
  }, 
  {
    "source": "arxiv", 
    "term": "Theory-HEP"
  }
], 
"legacy_creation_date": "2018-07-09", 
"number_of_pages": 18, 
"self": {
  "$ref": "http://labs.inspirehep.net/api/literature/1681259"
}, 
"texkeys": [
  "Isi:2018miq"
], 
@ksachs ksachs added this to the Ingestion tools in PROD milestone Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant