You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for this nice project and the features already developed. I am currently trying to create a big faiss index in a distributed way and found that autofaiss library can help me achieve this.
I am working in a Glue notebook with pyspark and have my embeddings as a pyspark dataframe. Since I saw in other issues that using the build_index function directly with the pyspark dataframe is not possible, I am storing the embeddings and the ids in parquet files in s3 (using no compression, because this was affecting the file extension and I saw in a previous closed issue that the file has to have .parquet extension).
Currently I am running:
build_index(embeddings="s3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/embeddings_pubmed",
index_path="s3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/knn.index",
index_infos_path="s3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/index_infos.json",
max_index_memory_usage="4G",
file_format = 'parquet',
distributed = 'pyspark',
metric_type = 'l2',
embedding_column_name = 'author_name_embeddings',
id_columns = 'author_name_id',
ids_path = 's3://gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/', #where to store id emb mapping
current_memory_available="4G",
nb_indices_to_keep=10)
This file while also the other partition files with the embeddings indeed exist (see screenshot from s3)
I am wondering what the issue can be here, since I have tried everything to make it work like this and I can't, having as a result not being able to create the index.
P.S. I have tried reading the data back in the glue notebook from s3 and the results look correct (correct columns and data types). So I have also excluded access related issues of the notebook to s3
Any help would be greatly appreciated
Thank you
Athina
The text was updated successfully, but these errors were encountered:
Hello everyone!
Thank you for this nice project and the features already developed. I am currently trying to create a big faiss index in a distributed way and found that autofaiss library can help me achieve this.
I am working in a Glue notebook with pyspark and have my embeddings as a pyspark dataframe. Since I saw in other issues that using the build_index function directly with the pyspark dataframe is not possible, I am storing the embeddings and the ids in parquet files in s3 (using no compression, because this was affecting the file extension and I saw in a previous closed issue that the file has to have .parquet extension).
Currently I am running:
and I am getting the following error:
FileNotFoundError: gtm-core-eks-uat-euc1-s3-bucket-glo-hcp-linking/test/embedding_partitions/embeddings_pubmed/part-00000-f2ea6b6c-f41c-4d0d-979f-66347536b1d6-c000.parquet
This file while also the other partition files with the embeddings indeed exist (see screenshot from s3)
I am wondering what the issue can be here, since I have tried everything to make it work like this and I can't, having as a result not being able to create the index.
P.S. I have tried reading the data back in the glue notebook from s3 and the results look correct (correct columns and data types). So I have also excluded access related issues of the notebook to s3
Any help would be greatly appreciated
Thank you
Athina
The text was updated successfully, but these errors were encountered: