Migrate to venomx metadata #81

justaddcoffee · 2024-09-11T15:45:38Z

Per convo with @caufieldjh @iQuxLE et al, change curategpt to use venomx for metadata

justaddcoffee · 2024-09-12T20:24:05Z

@cmungall - what are you thoughts?

It seems like it'd make sense to align with venomx, since most (all?) of the metadata is going to be dataset and embedding-related

cmungall · 2024-09-12T20:39:23Z

venomx assumes each indexed object has a unique id

curategpt doesn't make any assumptions about indexed objects, it can be any json obj / python dict.

some wrappers (e.g. ontology have a primary key)

but others like the maxoa wrapper return associations, which don't have a natural primary key

some options are

relax the venomx model so that objects don't require a PK
force everything in curgpt to have an ID, autogenerating if it doesn't exist

But I don't think either of these are ideal

I think it's best if we say the mapping is to vx is only supported if the collection declares an identifier field

https://github.com/monarch-initiative/curate-gpt/blob/main/src/curate_gpt/store/db_adapter.py#L342-L353

iQuxLE · 2024-09-13T10:52:06Z

@cmungall
@justaddcoffee

venomx assumes each indexed object has a unique id

Than it actually works well with DuckDB as this also wants unique ids for each indexed object.
ChromaDB does not necessarily need this.

I kind of like the idea 2.

force everything in curgpt to have an ID, autogenerating if it doesn't exist

Just a thought:
Can we use a UUID feature for this problem? For DuckDB this would mean a seperate column, in chromaDB I think it is already implemented.

However for the beginning we could also test it a bit by not incorporating the whole venomx model/schema into the metadata but just adding a field for it. This way we can see and test it out, and roll back easily in any case.

justaddcoffee · 2024-09-13T13:34:50Z

I kind of like 2) also. For collections that have IDs it works fine, and for those that do not have IDs, it doesn't seem like it hurts anything. Maybe we can mint them using a hash function of all the fields so they are deterministic?

def make_md5_id(data):
    # Concatenate data fields into a single string
    concatenated_data = f"{data['field1']}|{data['field2']}|{data['field3']}"
    
    # Create an MD5 hash
    id_hash = hashlib.md5(concatenated_data.encode()).hexdigest()
    
    return id_hash

(or is that too slow)

caufieldjh · 2024-09-13T15:06:35Z

I'm hesitant to include autogenerated identifiers if the process is opaque to users, i.e., if it's just made by CurateGPT for purposes of fitting the metadata model, then it isn't clear whether the ID refers to the some original source or the newly created data (though in this case it will be the latter). It works in the KGs because most edges don't start with IDs but in this setting there's likely to be a mishmash of different sources with and without IDs, plus the newly generated things.
Perhaps a user-defined toggle for ID generation would work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to venomx metadata #81

Migrate to venomx metadata #81

justaddcoffee commented Sep 11, 2024

justaddcoffee commented Sep 12, 2024 •

edited

Loading

cmungall commented Sep 12, 2024

iQuxLE commented Sep 13, 2024 •

edited

Loading

justaddcoffee commented Sep 13, 2024 •

edited

Loading

caufieldjh commented Sep 13, 2024

Migrate to venomx metadata #81

Migrate to venomx metadata #81

Comments

justaddcoffee commented Sep 11, 2024

justaddcoffee commented Sep 12, 2024 • edited Loading

cmungall commented Sep 12, 2024

iQuxLE commented Sep 13, 2024 • edited Loading

justaddcoffee commented Sep 13, 2024 • edited Loading

caufieldjh commented Sep 13, 2024

justaddcoffee commented Sep 12, 2024 •

edited

Loading

iQuxLE commented Sep 13, 2024 •

edited

Loading

justaddcoffee commented Sep 13, 2024 •

edited

Loading