Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 Feature: Use S3- bucket as the vector store #724

Open
1 task done
jolo-dev opened this issue Oct 29, 2023 · 17 comments
Open
1 task done

🚀 Feature: Use S3- bucket as the vector store #724

jolo-dev opened this issue Oct 29, 2023 · 17 comments
Assignees

Comments

@jolo-dev
Copy link

🔖 Feature description

The user should be able to add an S3 bucket for storing and accessing their documents.

🎤 Why is this feature needed ?

The documents are not just stored in the cloud but are also easier to share.
That would reduce the storage usage on your hard drive.

✌️ How do you aim to achieve this?

In order to store documents in an S3, you can pass a variable S3_STORE=my-bucket-name via the .env file. However, if you are running the application on your local machine, you will need to provide AWS credentials. The good news is that you can choose how to provide these credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

When running scripts, the result should be uploaded to the given S3-bucket.

The store in the application should access the documents from the S3-bucket.

It could look like this

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import S3Loader

embeddings = OpenAIEmbeddings() 

if(os.getenv("S3_STORE")):
  loader = S3Loader(os.getenv("S3_STORE"))
  documents = loader.load()

faiss_index = FAISS.from_documents(documents, embeddings) 

# Save index to S3
faiss_index.save(os.getenv("S3_STORE") + "/faiss-index")

# Load index from S3 
faiss_index = FAISS.load(os.getenv("S3_STORE") + "/faiss-index")

🔄️ Additional Information

No response

👀 Have you spent some time to check if this feature request has been raised before?

  • I checked and didn't find similar issue

Are you willing to submit PR?

Yes I am willing to submit a PR!

@jaredbradley243
Copy link
Collaborator

@jolo-dev I see you're assigned to this issue. Is that because you created it, or are you currently working on it? If you're not working on it, I'd love to take this issue.

Thanks! Happy coding! 😊

@jolo-dev
Copy link
Author

jolo-dev commented Nov 2, 2023

@jaredbradley243 thanks for your interest.
I am currently working on that but if you want you can assign this to you 😉

@jaredbradley243
Copy link
Collaborator

That's very kind, but If you're already working on it, keep going! 😃

@jolo-dev
Copy link
Author

jolo-dev commented Nov 2, 2023

@jaredbradley243 No no. Really. Let me be your reviewer then ;)

@jaredbradley243
Copy link
Collaborator

jaredbradley243 commented Nov 2, 2023

Hahah. It's gonna take me a bit of time to work on this. I'll need to re-familiarize myself with the codebase and I won't have time to get started for a week or two, but if everyone is alright with waiting, I'll happily take it off your hands!

@dartpain
Copy link
Contributor

dartpain commented Nov 3, 2023

@jaredbradley243 No worries!
Ill check back in a few weeks on this

@jaredbradley243 jaredbradley243 self-assigned this Nov 13, 2023
@Rajesh983
Copy link

@jaredbradley243 Any update on this ????

@jaredbradley243
Copy link
Collaborator

@jaredbradley243 Any update on this ????

Hey @Rajesh983. Sorry for the delay, I just saw your comment!

I have updated the script to allow S3 to be used as document and vector storage. If AWS credentials are detected in the .env file, the script will download documents from a given S3 bucket/folder and parse the documents. Once the documents are parsed, the resulting index.faiss and index.pkl files are saved back into the S3 bucket in the folder of the user's choosing.

However, I still need to implement AWS role assumption.

I had to take a break as I have an exam coming up, but I should finish the script soon!

If you'd like to preview my code and test it out early, let me know.

@fundmatch-dev
Copy link

Hey quick question, I need this feature and since it isn't out yet I am planning on building it out myself just for my own use-case.

Do I need to save the pickle file locally first before uploading to S3? or is there a way to write the langchain.vectorstores.faiss.FAISS object (the vector store) straight into S3 as a pickle file? Apologies for my naïveté.

@jaredbradley243
Copy link
Collaborator

Hey quick question, I need this feature and since it isn't out yet I am planning on building it out myself just for my own use-case.

Do I need to save the pickle file locally first before uploading to S3? or is there a way to write the langchain.vectorstores.faiss.FAISS object (the vector store) straight into S3 as a pickle file? Apologies for my naïveté.

Hey @fundmatch-dev!

I finished this feature yesterday, I'm just writing a readme for it! Would you like to test it out for me?

@fundmatch-dev
Copy link

Hi @jaredbradley243 I just finished implementing it manually for myself, it seems to work. But hey I don't mind helping if you can just explain what I need to do! Want to hop on a call or?

@jaredbradley243
Copy link
Collaborator

jaredbradley243 commented Dec 23, 2023

Hi @jaredbradley243 I just finished implementing it manually for myself, it seems to work. But hey I don't mind helping if you can just explain what I need to do! Want to hop on a call or?

I'm happy to hop on a call with you tomorrow, if you're free! (It's 8PM here in Los Angeles).

In the meantime, you can replace your ingest.py and open_ai_func.py files with these updated versions:

docsgpt.zip

And here are some instructions:

Script Functionality

  1. Local Mode (default): Processes documents from local directories specified by the user.
  2. S3 Mode (--s3):
    • Downloads documents from an S3 bucket to a temporary local storage (s3_temp_storage).
    • Processes these documents.
    • Uploads the processed documents back to the S3 bucket.

Enabling S3 Storage

To enable S3 storage, use the --s3 flag when running the script.

  1. Environment Variables: Set these variables in your .env file:

    • S3_BUCKET: Name of your S3 bucket.
    • S3_DOCUMENTS_FOLDER: Folder within the S3 bucket where your documents are stored. If left blank, all files in S3 bucket will be downloaded (except .faiss and .pkl).
    • S3_SAVE_FOLDER: Folder within the S3 bucket in which you would like to save the vector files. Leave blank to use the root of the bucket.
  2. Running the Script:

    • python ingest.py ingest --s3

Enabling Role Assumption

If accessing an S3 bucket requires assuming an IAM role (e.g., for cross-account access), the script supports this through the --s3-assume flag and proper AWS configuration.

  1. Environment Variable:
  • Add AWS_ASSUME_ROLE_PROFILE to your .env file with the name of the AWS profile for role assumption. Ex: AWS_ASSUME_ROLE_PROFILE="dev"
  1. AWS Configuration:
  • Credentials File (~/.aws/credentials):
    [default]
    aws_access_key_id = YOUR_DEFAULT_ACCESS_KEY
    aws_secret_access_key = YOUR_DEFAULT_SECRET_KEY
    
    [iamadmin]
    aws_access_key_id = EXAMPLEKEY123456
    aws_secret_access_key = EXAMPLESECRETKEY123456
  • Config File (~/.aws/config):
    [default]
    region = us-west-2
    output = json
    
    [profile dev]
    region = us-west-2
    role_arn = arn:aws:iam::123456789012:role/YourRoleName
    source_profile = iamadmin
  1. Running the Script with Role Assumption:

    • python your_script.py --s3 --s3-assume

This configuration allows the script to assume YourRoleName using the credentials from the iamadmin profile.

Note

  • Ensure that the IAM role (YourRoleName) has necessary permissions for accessing the specified S3 bucket.
  • The script will create a temporary local storage (s3_temp_storage) for processing S3 documents, which will be cleaned up after processing.

@jaredbradley243
Copy link
Collaborator

Let me know if you have any difficulty, or if you find the instructions difficult to follow! 😁

This seems to be a sought after feature, I'm glad I got the change to work on it!

@jolo-dev
Copy link
Author

Hey @jaredbradley243,
Great work. But why is this completed? The PR is still open :D

@jaredbradley243
Copy link
Collaborator

Hey @jaredbradley243, Great work. But why is this completed? The PR is still open :D

Thank you! Over excitement I blame on the holiday season. 😂 Issue reopened.

@bazooka720
Copy link

Folks: What is the ETA of this feature Completion? This would allow stand-alone conversion of S3 documents into vector version right? Will we have a separate index/id for each document after the conversion? Trying to wrap head around it

@pandey0039
Copy link

Hi, I am trying to store my FAISS vectorstore in Azure blob storage. Is there any functionality present that can help me with that.
Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

7 participants