[NeurIPS] Join fails with reference if read from fileObject #700

EMCarrami · 2024-06-13T11:28:40Z

Hi,

Regarding your proposed solution to #651 and the implementation of simple_join. This solution only works if it is a reference to a field where data is manually added. When I load the data from fileObject it returns nan.

Below is an example that fails (returns nan for the sequence field of examples, despite uid match):

{
  "recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id": "sequences",
      "name": "sequences",
      "field": [
        {
          "@type": "cr:Field",
          "@id": "sequences/uid",
          "name": "uid",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "sequences.csv"
            },
            "extract": {
              "column": "uniprot_id"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "sequences/sequence",
          "name": "sequence",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "sequences.csv"
            },
            "extract": {
              "column": "sequence"
            }
          }
        }
      ]
    },
    {
      "@type": "cr:RecordSet",
      "@id": "examples",
      "name": "examples",
      "field": [
        {
          "@type": "cr:Field",
          "@id": "examples/uid",
          "name": "uid",
          "dataType": "sc:Text",
          "references": {
            "field": {
              "@id": "sequences/uid"
            }
          },
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "uniprot_id"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/type",
          "name": "type",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "type"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/annotation",
          "name": "annotation",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "annotation"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/sequence",
          "name": "sequence",
          "dataType": "sc:Text",
          "source": {
            "field": {
              "@id": "sequences/sequence"
            }
          }
        }
      ]
    }
  ]
}

But the following works (where data is manually added similar to the provided example)

{
  "recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id": "sequences",
      "name": "sequences",
      "field": [
        {
          "@type": "cr:Field",
          "@id": "sequences/uid",
          "name": "uid",
          "dataType": "sc:Text"
        },
        {
          "@type": "cr:Field",
          "@id": "sequences/sequence",
          "name": "sequence",
          "dataType": "sc:Text"
        }
      ],
      "data": [
        {
          "uid": "XYZ",
          "sequence": "MLCTHGHGHLMKNMNV"
        }
      ]
    },
    {
      "@type": "cr:RecordSet",
      "@id": "examples",
      "name": "examples",
      "field": [
        {
          "@type": "cr:Field",
          "@id": "examples/uid",
          "name": "uid",
          "dataType": "sc:Text",
          "references": {
            "field": {
              "@id": "sequences/uid"
            }
          },
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "uniprot_id"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/type",
          "name": "type",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "type"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/annotation",
          "name": "annotation",
          "dataType": "sc:Text",
          "source": {
            "fileObject": {
              "@id": "annotations.csv"
            },
            "extract": {
              "column": "annotation"
            }
          }
        },
        {
          "@type": "cr:Field",
          "@id": "examples/sequence",
          "name": "sequence",
          "dataType": "sc:Text",
          "source": {
            "field": {
              "@id": "sequences/sequence"
            }
          }
        }
      ]
    }
  ]
}

The text was updated successfully, but these errors were encountered:

msorkhpar · 2024-06-13T17:24:37Z

I faced the same issue. I am unsure if I found the right solution, but having the property name removed or picking a name different than @id properties' values for recordSets solved the issue for me. Here is an example of my files:
https://github.com/msorkhpar/wiki-entity-summarization/tree/main/croissant

EMCarrami · 2024-06-13T22:40:13Z

Thanks for the suggestion. I removed all the "name"s from recordSets and made all other name pointers unique, but it unfortunately doesn't seem to work for me.

gsaluja9 · 2024-09-11T01:31:30Z

@EMCarrami . I experienced the same problem as you describe here.
After debugging a bit, I could narrow it down to how the field is parsed from a data frame.

The EXPECTED_DATA_TYPES maps sc:text to bytes, and so a column from csv is a byte array, whereas the FileSet filename is a string. This fails the join.

I am not sure of whats the recommended way to specify a text field, and also why this Mapping is set, but @marcenacp might have some context on this. In an initial version text did map to str, but it was changed. Also the docs suggest to use the dataType sc:Text for a csv column.

I am attaching an example jsonld file to exemplify the join in the situation, perhaps that needs to be addressed?
cookbook-dataset-metadata.json

EMCarrami changed the title ~~Join fails with reference if read from fileObject~~ [NeurIPS] Join fails with reference if read from fileObject Jun 13, 2024

EMCarrami mentioned this issue Jun 17, 2024

"images/filename" should have an attribute "@type": "https://schema.org/Text". Got http://mlcommons.org/croissant/Field instead. #651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NeurIPS] Join fails with reference if read from fileObject #700

[NeurIPS] Join fails with reference if read from fileObject #700

EMCarrami commented Jun 13, 2024

msorkhpar commented Jun 13, 2024

EMCarrami commented Jun 13, 2024

gsaluja9 commented Sep 11, 2024

[NeurIPS] Join fails with reference if read from fileObject #700

[NeurIPS] Join fails with reference if read from fileObject #700

Comments

EMCarrami commented Jun 13, 2024

msorkhpar commented Jun 13, 2024

EMCarrami commented Jun 13, 2024

gsaluja9 commented Sep 11, 2024