🚀 Feature: Use S3- bucket as the vector store #724

jolo-dev · 2023-10-29T11:21:48Z

🔖 Feature description

The user should be able to add an S3 bucket for storing and accessing their documents.

🎤 Why is this feature needed ?

The documents are not just stored in the cloud but are also easier to share.
That would reduce the storage usage on your hard drive.

✌️ How do you aim to achieve this?

In order to store documents in an S3, you can pass a variable S3_STORE=my-bucket-name via the .env file. However, if you are running the application on your local machine, you will need to provide AWS credentials. The good news is that you can choose how to provide these credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

When running scripts, the result should be uploaded to the given S3-bucket.

The store in the application should access the documents from the S3-bucket.

It could look like this

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import S3Loader

embeddings = OpenAIEmbeddings() 

if(os.getenv("S3_STORE")):
  loader = S3Loader(os.getenv("S3_STORE"))
  documents = loader.load()

faiss_index = FAISS.from_documents(documents, embeddings) 

# Save index to S3
faiss_index.save(os.getenv("S3_STORE") + "/faiss-index")

# Load index from S3 
faiss_index = FAISS.load(os.getenv("S3_STORE") + "/faiss-index")

🔄️ Additional Information

No response

👀 Have you spent some time to check if this feature request has been raised before?

I checked and didn't find similar issue

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

jaredbradley243 · 2023-11-02T16:06:38Z

@jolo-dev I see you're assigned to this issue. Is that because you created it, or are you currently working on it? If you're not working on it, I'd love to take this issue.

Thanks! Happy coding! 😊

jolo-dev · 2023-11-02T16:23:47Z

@jaredbradley243 thanks for your interest.
I am currently working on that but if you want you can assign this to you 😉

jaredbradley243 · 2023-11-02T17:14:46Z

That's very kind, but If you're already working on it, keep going! 😃

jolo-dev · 2023-11-02T17:22:46Z

@jaredbradley243 No no. Really. Let me be your reviewer then ;)

jaredbradley243 · 2023-11-02T18:32:21Z

Hahah. It's gonna take me a bit of time to work on this. I'll need to re-familiarize myself with the codebase and I won't have time to get started for a week or two, but if everyone is alright with waiting, I'll happily take it off your hands!

dartpain · 2023-11-03T12:47:21Z

@jaredbradley243 No worries!
Ill check back in a few weeks on this

Rajesh983 · 2023-12-05T10:04:01Z

@jaredbradley243 Any update on this ????

jaredbradley243 · 2023-12-09T05:46:13Z

@jaredbradley243 Any update on this ????

Hey @Rajesh983. Sorry for the delay, I just saw your comment!

I have updated the script to allow S3 to be used as document and vector storage. If AWS credentials are detected in the .env file, the script will download documents from a given S3 bucket/folder and parse the documents. Once the documents are parsed, the resulting index.faiss and index.pkl files are saved back into the S3 bucket in the folder of the user's choosing.

However, I still need to implement AWS role assumption.

I had to take a break as I have an exam coming up, but I should finish the script soon!

If you'd like to preview my code and test it out early, let me know.

fundmatch-dev · 2023-12-23T03:16:03Z

Hey quick question, I need this feature and since it isn't out yet I am planning on building it out myself just for my own use-case.

Do I need to save the pickle file locally first before uploading to S3? or is there a way to write the langchain.vectorstores.faiss.FAISS object (the vector store) straight into S3 as a pickle file? Apologies for my naïveté.

jaredbradley243 · 2023-12-23T03:43:57Z

Hey quick question, I need this feature and since it isn't out yet I am planning on building it out myself just for my own use-case.

Do I need to save the pickle file locally first before uploading to S3? or is there a way to write the langchain.vectorstores.faiss.FAISS object (the vector store) straight into S3 as a pickle file? Apologies for my naïveté.

Hey @fundmatch-dev!

I finished this feature yesterday, I'm just writing a readme for it! Would you like to test it out for me?

fundmatch-dev · 2023-12-23T03:49:23Z

Hi @jaredbradley243 I just finished implementing it manually for myself, it seems to work. But hey I don't mind helping if you can just explain what I need to do! Want to hop on a call or?

jaredbradley243 · 2023-12-23T04:04:11Z

Hi @jaredbradley243 I just finished implementing it manually for myself, it seems to work. But hey I don't mind helping if you can just explain what I need to do! Want to hop on a call or?

I'm happy to hop on a call with you tomorrow, if you're free! (It's 8PM here in Los Angeles).

In the meantime, you can replace your ingest.py and open_ai_func.py files with these updated versions:

docsgpt.zip

And here are some instructions:

Script Functionality

Local Mode (default): Processes documents from local directories specified by the user.
S3 Mode (--s3):
- Downloads documents from an S3 bucket to a temporary local storage (s3_temp_storage).
- Processes these documents.
- Uploads the processed documents back to the S3 bucket.

Enabling S3 Storage

To enable S3 storage, use the --s3 flag when running the script.

Environment Variables: Set these variables in your .env file:
- S3_BUCKET: Name of your S3 bucket.
- S3_DOCUMENTS_FOLDER: Folder within the S3 bucket where your documents are stored. If left blank, all files in S3 bucket will be downloaded (except .faiss and .pkl).
- S3_SAVE_FOLDER: Folder within the S3 bucket in which you would like to save the vector files. Leave blank to use the root of the bucket.
Running the Script:
- python ingest.py ingest --s3

Enabling Role Assumption

If accessing an S3 bucket requires assuming an IAM role (e.g., for cross-account access), the script supports this through the --s3-assume flag and proper AWS configuration.

Environment Variable:

Add AWS_ASSUME_ROLE_PROFILE to your .env file with the name of the AWS profile for role assumption. Ex: AWS_ASSUME_ROLE_PROFILE="dev"

AWS Configuration:

Credentials File (~/.aws/credentials):

[default]
aws_access_key_id = YOUR_DEFAULT_ACCESS_KEY
aws_secret_access_key = YOUR_DEFAULT_SECRET_KEY

[iamadmin]
aws_access_key_id = EXAMPLEKEY123456
aws_secret_access_key = EXAMPLESECRETKEY123456

Config File (~/.aws/config):

[default]
region = us-west-2
output = json

[profile dev]
region = us-west-2
role_arn = arn:aws:iam::123456789012:role/YourRoleName
source_profile = iamadmin

Running the Script with Role Assumption:
- python your_script.py --s3 --s3-assume

This configuration allows the script to assume YourRoleName using the credentials from the iamadmin profile.

Note

Ensure that the IAM role (YourRoleName) has necessary permissions for accessing the specified S3 bucket.
The script will create a temporary local storage (s3_temp_storage) for processing S3 documents, which will be cleaned up after processing.

jaredbradley243 · 2023-12-23T04:06:47Z

Let me know if you have any difficulty, or if you find the instructions difficult to follow! 😁

This seems to be a sought after feature, I'm glad I got the change to work on it!

jolo-dev · 2023-12-27T16:26:21Z

Hey @jaredbradley243,
Great work. But why is this completed? The PR is still open :D

jaredbradley243 · 2023-12-27T23:17:44Z

Hey @jaredbradley243, Great work. But why is this completed? The PR is still open :D

Thank you! Over excitement I blame on the holiday season. 😂 Issue reopened.

bazooka720 · 2024-01-15T16:56:34Z

Folks: What is the ETA of this feature Completion? This would allow stand-alone conversion of S3 documents into vector version right? Will we have a separate index/id for each document after the conversion? Trying to wrap head around it

pandey0039 · 2024-05-16T10:11:48Z

Hi, I am trying to store my FAISS vectorstore in Azure blob storage. Is there any functionality present that can help me with that.
Thanks,

dartpain assigned jolo-dev Oct 30, 2023

jaredbradley243 self-assigned this Nov 13, 2023

jaredbradley243 mentioned this issue Dec 25, 2023

Update Scripts to integrate AWS S3 #807

Closed

jaredbradley243 closed this as completed Dec 27, 2023

jaredbradley243 reopened this Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Feature: Use S3- bucket as the vector store #724

🚀 Feature: Use S3- bucket as the vector store #724

jolo-dev commented Oct 29, 2023

jaredbradley243 commented Nov 2, 2023

jolo-dev commented Nov 2, 2023

jaredbradley243 commented Nov 2, 2023

jolo-dev commented Nov 2, 2023

jaredbradley243 commented Nov 2, 2023 •

edited

Loading

dartpain commented Nov 3, 2023

Rajesh983 commented Dec 5, 2023

jaredbradley243 commented Dec 9, 2023

fundmatch-dev commented Dec 23, 2023

jaredbradley243 commented Dec 23, 2023

fundmatch-dev commented Dec 23, 2023

jaredbradley243 commented Dec 23, 2023 •

edited

Loading

jaredbradley243 commented Dec 23, 2023

jolo-dev commented Dec 27, 2023

jaredbradley243 commented Dec 27, 2023

bazooka720 commented Jan 15, 2024

pandey0039 commented May 16, 2024

🚀 Feature: Use S3- bucket as the vector store #724

🚀 Feature: Use S3- bucket as the vector store #724

Comments

jolo-dev commented Oct 29, 2023

🔖 Feature description

🎤 Why is this feature needed ?

✌️ How do you aim to achieve this?

🔄️ Additional Information

👀 Have you spent some time to check if this feature request has been raised before?

Are you willing to submit PR?

jaredbradley243 commented Nov 2, 2023

jolo-dev commented Nov 2, 2023

jaredbradley243 commented Nov 2, 2023

jolo-dev commented Nov 2, 2023

jaredbradley243 commented Nov 2, 2023 • edited Loading

dartpain commented Nov 3, 2023

Rajesh983 commented Dec 5, 2023

jaredbradley243 commented Dec 9, 2023

fundmatch-dev commented Dec 23, 2023

jaredbradley243 commented Dec 23, 2023

fundmatch-dev commented Dec 23, 2023

jaredbradley243 commented Dec 23, 2023 • edited Loading

Script Functionality

Enabling S3 Storage

Enabling Role Assumption

Note

jaredbradley243 commented Dec 23, 2023

jolo-dev commented Dec 27, 2023

jaredbradley243 commented Dec 27, 2023

bazooka720 commented Jan 15, 2024

pandey0039 commented May 16, 2024

jaredbradley243 commented Nov 2, 2023 •

edited

Loading

jaredbradley243 commented Dec 23, 2023 •

edited

Loading