Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pushing embeddings/metadata to huggingface #79

Merged
merged 9 commits into from
Sep 11, 2024

Conversation

justaddcoffee
Copy link
Member

@justaddcoffee justaddcoffee commented Sep 6, 2024

@cmungall @caufieldjh is this a reasonable way to do this?

Still a WIP, just planning the classes, methods, etc

@justaddcoffee justaddcoffee changed the title First pass at possible class structure for this Add support for pushing embeddings/metadata to huggingface Sep 6, 2024
@justaddcoffee justaddcoffee marked this pull request as draft September 6, 2024 20:12
@caufieldjh
Copy link
Member

How does HF Hub handle auth? I think it (the HF API wrapper) has changed since I tried using it in OntoGPT and I don't know how it handles cases like private repos or personal vs. organization-managed repos.

@justaddcoffee
Copy link
Member Author

How does HF Hub handle auth? I think it (the HF API wrapper) has changed since I tried using it in OntoGPT and I don't know how it handles cases like private repos or personal vs. organization-managed repos.

I think basically you use a user access token - haven't done it but shouldn't be hard

@iQuxLE
Copy link
Collaborator

iQuxLE commented Sep 10, 2024

It seems like HuggingFaceAdapter is inheriting from DBAdapter, even though it does not need most of the methods. This makes the code feel heavier and harder to maintain.
I recommend creating a separate interface for Hugging Face-related functionality (like upload_collection) instead of tying it to the DBAdapter. This will keep the code cleaner and more focused on what each adapter actually needs to do.

Instead of forcing HuggingFaceAdapter to inherit from DBAdapter, we can keep the core logic for uploading in DBAdapter and create a separate HuggingFaceAgent (or wrapper) that handles the Hugging Face-specific operations. This way, DBAdapter stays clean, and we decouple the Hugging Face logic without mixing concerns.

I suggest:

class DBAdapter(ABC):
    @abstractmethod
    def fetch_all_objects_from_collection(self, collection: str = None, batch_size: int = 100, **kwargs) -> Iterator[OBJECT]:
        """
        Fetch all objects from a collection, optionally in batches to avoid memory overload.
        """
        collection = self._get_collection(collection)
        offset = 0
        while True:
            objects = list(self.peek(collection=collection, limit=batch_size, offset=offset, **kwargs))
            if not objects:
                break  
            yield from objects
            offset += batch_size

HuggingFaceAgent (uploading and storing data)

class HuggingFaceAgent:
    def upload_to_huggingface(self, objects, metadata, repo_id, private=False, **kwargs):
        venomx_metadata = self._transform_metadata_to_venomx(metadata)
        embedding_file = "embeddings.parquet"
        metadata_file = "metadata.yaml"
        pd.DataFrame(objects).to_parquet(embedding_file)
        with open(metadata_file, "w") as f:
            yaml.dump(venomx_metadata, f)

        self._create_repo(repo_id, private=private)
        self._upload_files(repo_id, {
            embedding_file: embedding_file,
            metadata_file: metadata_file
        })

    def _create_repo(self, repo_id, private):
        pass

    def _upload_files(self, repo_id, objects, metadata):
        pass

CLI

@embeddings.command(name="upload")
@path_option
@collection_option
@click.option("--repo-id", required=True, help="Repository ID on Hugging Face, e.g., 'biomedical-translator/[repo_name]'.")
@click.option("--private/--public", default=False, help="Whether the repository should be private.")
@database_type_option
def upload_embeddings(path, collection, repo_id, private, adapter, database_type):
    """
    Upload embeddings and their metadata from a specified collection to a repository, e.g., Hugging Face.
    """
    db = get_store(database_type, path)
    try:
        objects = list(db.fetch_all_objects_from_collection(collection=collection)) 
        metadata = db.collection_metadata(collection=collection)
    except Exception as e:
        raise ...

    huggingface_agent = HuggingFaceAgent()
    try:
        huggingface_agent.upload_to_huggingface(objects=objects, metadata=metadata, repo_id=repo_id, private=private)
    except Exception as e:
         raise ...

Let me know if you need help with the change!

EDIT: we might need to change the peek logic, as currently this does not peek into embeddings, maybe a seperate all metadata peek would be beneficial for this

@justaddcoffee
Copy link
Member Author

Thank you @iQuxLE! This makes good sense. The HF adapter really probably shouldn't inherit from DBAdapter

Could you make a PR on this PR with your suggested changes?

@justaddcoffee justaddcoffee marked this pull request as ready for review September 11, 2024 15:44
@justaddcoffee
Copy link
Member Author

@caufieldjh if this looks good, could we merge?

@caufieldjh
Copy link
Member

Should work for now. Thanks!

@caufieldjh caufieldjh merged commit e51bef9 into main Sep 11, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants