Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to emit indices in bulk action with return_stream?: true #1367

Open
maennchen opened this issue Aug 7, 2024 · 9 comments
Open
Labels
enhancement New feature or request

Comments

@maennchen
Copy link
Contributor

Is your feature request related to a problem? Please describe.

I'm writing a Mix task, which imports a large file via a bulk insert. To display progress, I would like to be able to log a message every n entries so that I can see the progress. TO do this, I have to enable return_records? at the moment. This uses memory unnecessarily.

Describe the solution you'd like

A new option emit_indices? which will return the current index instead of the whole record. Only available when return_stream?: true and return_records?: false.

Describe alternatives you've considered

None

Express the feature either with a change to resource syntax, or with a change to the resource interface

For example

rows
|> Ash.bulk_create(Resource, :action, return_stream?: true, emit_indices?: true)
|> Enum.each(fn
  {:omitted_record, i} when mod(i, 100) -> IO.puts "imported #{i} records"
  _ -> :discard
end)

Additional context

None

@maennchen
Copy link
Contributor Author

We should consider if we'd rather add the option emit_insert_counts?. This would allow to not set RETURNING in postgres and therefore save a lot of memory.

@zachdaniel
Copy link
Contributor

We would synthesize the indices based on batch sizes, so we won't have to return records

@zachdaniel
Copy link
Contributor

Like add the next 500 integers to the stream.

@maennchen
Copy link
Contributor Author

I don't think we always can. If it is an upsert, we won't know which indexes were created and which not, right?

@zachdaniel
Copy link
Contributor

🤔 I was thinking that emit_indices returns the number of inputs that were handled, not the number of records that were created.

@maennchen
Copy link
Contributor Author

@zachdaniel I think that would be confusing given that it would also not show up in the result records...

@zachdaniel
Copy link
Contributor

🤔 potentially. It could be confusing the other way around as well, like if you do huge bulk upsert and everything is match and you get back 0. But maybe not. Perhaps we should emit both, each batch? So ask the data layer to return a count of inserts, or records if it can't do that, and then emit something like {500, 350} after each batch, being the number of inputs handled and the number of resulting created records.

Perhaps for future proof-ness we should do something like %Ash.BulkResult.BatchStatus{inputs: 500, created: 350}

@maennchen
Copy link
Contributor Author

That sounds like a good compromise :)

@maennchen
Copy link
Contributor Author

Started working on a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

2 participants