Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NeurIPS] Join fails with reference if read from fileObject #700

Open
EMCarrami opened this issue Jun 13, 2024 · 3 comments
Open

[NeurIPS] Join fails with reference if read from fileObject #700

EMCarrami opened this issue Jun 13, 2024 · 3 comments

Comments

@EMCarrami
Copy link

Hi,

Regarding your proposed solution to #651 and the implementation of simple_join. This solution only works if it is a reference to a field where data is manually added. When I load the data from fileObject it returns nan.

Below is an example that fails (returns nan for the sequence field of examples, despite uid match):

{
  "recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id": "sequences",
      "name": "sequences",
      "field": [
        {
          "@type": "cr:Field",
          "@id": "sequences/uid",
          "name": "uid",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "sequences.csv"
            },
            "extract": {
              "column": "uniprot_id"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "sequences/sequence",
          "name": "sequence",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "sequences.csv"
            },
            "extract": {
              "column": "sequence"
            }
          }
        }
      ]
    },
    {
      "@type": "cr:RecordSet",
      "@id": "examples",
      "name": "examples",
      "field": [
        {
          "@type": "cr:Field",
          "@id": "examples/uid",
          "name": "uid",
          "dataType": "sc:Text",
          "references": {
            "field": {
              "@id": "sequences/uid"
            }
          },
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "uniprot_id"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/type",
          "name": "type",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "type"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/annotation",
          "name": "annotation",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "annotation"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/sequence",
          "name": "sequence",
          "dataType": "sc:Text",
          "source": {
            "field": {
              "@id": "sequences/sequence"
            }
          }
        }
      ]
    }
  ]
}

But the following works (where data is manually added similar to the provided example)

{
  "recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id": "sequences",
      "name": "sequences",
      "field": [
        {
          "@type": "cr:Field",
          "@id": "sequences/uid",
          "name": "uid",
          "dataType": "sc:Text"
        },
        {
          "@type": "cr:Field",
          "@id": "sequences/sequence",
          "name": "sequence",
          "dataType": "sc:Text"
        }
      ],
      "data": [
        {
          "uid": "XYZ",
          "sequence": "MLCTHGHGHLMKNMNV"
        }
      ]
    },
    {
      "@type": "cr:RecordSet",
      "@id": "examples",
      "name": "examples",
      "field": [
        {
          "@type": "cr:Field",
          "@id": "examples/uid",
          "name": "uid",
          "dataType": "sc:Text",
          "references": {
            "field": {
              "@id": "sequences/uid"
            }
          },
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "uniprot_id"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/type",
          "name": "type",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "type"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/annotation",
          "name": "annotation",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "annotation"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/sequence",
          "name": "sequence",
          "dataType": "sc:Text",
          "source": {
            "field": {
              "@id": "sequences/sequence"
            }
          }
        }
      ]
    }
  ]
}
@msorkhpar
Copy link

I faced the same issue. I am unsure if I found the right solution, but having the property name removed or picking a name different than @id properties' values for recordSets solved the issue for me. Here is an example of my files:
https://github.com/msorkhpar/wiki-entity-summarization/tree/main/croissant

@EMCarrami
Copy link
Author

Thanks for the suggestion. I removed all the "name"s from recordSets and made all other name pointers unique, but it unfortunately doesn't seem to work for me.

@EMCarrami EMCarrami changed the title Join fails with reference if read from fileObject [NeurIPS] Join fails with reference if read from fileObject Jun 13, 2024
@gsaluja9
Copy link

@EMCarrami . I experienced the same problem as you describe here.
After debugging a bit, I could narrow it down to how the field is parsed from a data frame.

The EXPECTED_DATA_TYPES maps sc:text to bytes, and so a column from csv is a byte array, whereas the FileSet filename is a string. This fails the join.

I am not sure of whats the recommended way to specify a text field, and also why this Mapping is set, but @marcenacp might have some context on this. In an initial version text did map to str, but it was changed. Also the docs suggest to use the dataType sc:Text for a csv column.

I am attaching an example jsonld file to exemplify the join in the situation, perhaps that needs to be addressed?
cookbook-dataset-metadata.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants