Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at batched versions of insert / upsert #134

Merged
merged 11 commits into from
Jan 10, 2024

Conversation

erichare
Copy link
Collaborator

@erichare erichare commented Dec 1, 2023

No description provided.

@erichare erichare self-assigned this Dec 4, 2023
@erichare erichare marked this pull request as ready for review December 11, 2023 19:26
@erichare erichare removed the do_not_merge Don't merge yet! label Dec 11, 2023
@erichare erichare requested a review from cbornet January 3, 2024 15:26
astrapy/db.py Outdated Show resolved Hide resolved
astrapy/db.py Outdated Show resolved Hide resolved
executor.submit(insert_batch, batch, i)
for i, batch in enumerate(batched_list)
]
response_list = [future.result() for future in futures]
Copy link
Collaborator

@cbornet cbornet Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some futures error, it would be nice to return the result of successful futures and the exceptions of failed ones.
Especially if the flag partial_failures_allowed is true.
The returned value could be List[Union[API_RESPONSE, Exception]] and we know that an Exception corresponds to a batch size.

Copy link
Collaborator

@hemidactylus hemidactylus Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cassandra Python driver enriches that with a boolean to easily tell them apart, so the return type is List[Tuple[bool, Union[API_RESPONSE,Exception]]] (see here). I wonder if that would be a good model to adopt here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not. I find that it's not too hard to test if the result is an instance of Exception. But I don't have a strong opinion on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - it can be left as simple as possible, no boolean is all good for me.

astrapy/db.py Outdated
# Perform the bulk upsert with concurrency
with ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = [executor.submit(self.upsert, document) for document in documents]
response_list = [future.result() for future in futures]
Copy link
Collaborator

@cbornet cbornet Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, the returned value could be List[Union[str, Exception]] with an optional partial_failures_allowed flag

astrapy/db.py Show resolved Hide resolved
astrapy/db.py Outdated Show resolved Hide resolved
astrapy/db.py Show resolved Hide resolved
astrapy/db.py Outdated Show resolved Hide resolved
astrapy/db.py Outdated Show resolved Hide resolved
astrapy/db.py Outdated Show resolved Hide resolved
) -> None:
_id0 = str(uuid.uuid4())
_id2 = str(uuid.uuid4())
documents: List[API_DOC] = [
Copy link
Collaborator

@cbornet cbornet Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This long list could be replaced by a repeated element.
Something like

[{"_id": str(uuid.uuid4()), "name": "Abba"} for _ in range(30)]

or even

[{"name": "Abba"}] * 30

as we can let cassandra set the id.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rewritten the whole test code.

},
]

response = writable_vector_collection.batched_concurrent_insert_many(documents)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set concurrency=2 ?

astrapy/db.py Outdated

return self.insert_many(
documents=batch,
options=options,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If concurrency > 1, we could maybe set options["ordered"] = False as the insertions will be unordered anyway.

@hemidactylus
Copy link
Collaborator

I'd like to raise a point which goes back to the suggestion to collate the response from the various API calls. I am not sure what the best design would be.

  1. The API specs make it clear that the only fields are status.insertedIds (a list) and errors (a list), so collating seems to be as futureproof as it gets
  2. In support of collating, the caller hardly needs to know the details of the splitting in chunks. They pass a list of 45 documents and would be surprised to get a list of 3 response JSON from the API, each about 20, 20, 5 insertions.
  3. But then what about a single call failing with an exception? (not speaking of the case of "partial_failures_allowed", where the caller would not get an exception and know how to deal with the non-inserted entries).

To elaborate more on the last point: the caller sends 45 documents and the second API call raises an exception. What is more useful to the caller:

  • getting back [{...}, Exception, {...}], where they have to collate the insertedIds partial results and find out what went through (at the cost of being exposed to chunking, originally not a planned breach of abstraction)
  • getting back a merged response (but then how would the Exceptions be passed back ?)
  • seeing an exception being raised (but then how to find out which id got inserted, a very useful info) ?

Probably any choice other than keeping the response/exception separate by chunk is sacrificing some info, and I would avoid that - at the cost of exposing the "surprising" chunked structure in the response item.

@erichare erichare merged commit 4e63996 into master Jan 10, 2024
2 checks passed
@erichare erichare deleted the feature/#133-insert-very-many branch January 10, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AstraPy-provided insert_very_many and upsert_very_many methods with baked-in concurrency?
3 participants