Import transactions by quarter, in chunks of a few hundred #212

hancush · 2024-09-18T17:32:37Z

Overview

The transaction import makes something like 10 database calls per iteration. That can really slow us down when our database connection and/or cloud compute environment isn't zippy (#210).

This PR reduces the number of iterations from n to roughly n/4 by slicing the import by quarter and reduces the number of database calls (made from transaction saves, other queries are unaffected) from n to n/500 by batching transaction saves in chunks of 500. Both of these options are configurable, but are given reasonable defaults.

It further reduces run time by importing only one transaction type per job, rather than both contributions and expenditures. Jobs are now created per a matrix strategy, so that all transaction imports run concurrently.

Connects #210.

Notes

I initially did a month filter, but filing periods tend to correspond to quarters, so some months have no filings. I thought it better to do quarters for jobs of roughly equal size, but with no waste jobs (i.e., jobs that don't import anything).

The quarter and year parameters are a little confusing.

When filing period spans years, its transactions can occur across two data files. The year is the vintage of the data file, not the transaction date or filing period, and is used to remove only transactions from a given filing in the given year (i.e., ones that should be reimported in a given run).

The quarter is used to filter filings to only those with a filing period beginning in the given quarter of any year. In this way, it accounts for filings spanning years. For example, consider a filing period starting in December 2023 and ending in February 2024. Transactions would be split across the 2023 and 2024 files. To get them all, you would run the Q4 import for both 2023 and 2024.

Testing Instructions

Succeeding run, here: https://github.com/datamade/openness-project-nmid/actions/runs/10949625887

hancush · 2024-09-19T19:55:56Z

Makefile

+.SECONDEXPANSION:
+import/% : _data/sorted/$$(word 1, $$(subst _, , $$*))_$$(word 3, $$(subst _, , $$*)).csv


Parse the transaction type and year out of a pattern like CON_1_2023. One transaction file covers the entire year, so we don't need to download it again for each quarterly import.

hancush · 2024-09-19T19:57:27Z

Makefile

+define quarterly_target
+	$(foreach YEAR,$(1),$(patsubst %,import/$(2)_%_$(YEAR),1 2 3 4))
+endef


Create a target for each year ($1), transaction type ($2), and quarter.

hancush · 2024-09-23T13:29:09Z

.github/workflows/etl.yml

-        with:
-          ref: "deploy"
-      - name: Import data for 2024
+          ref: "hcg/batch-it-up"


TODO: Change back to deploy before merge.

fgregg

looking good. one bug fix, and some suggestions.

camp_fin/management/commands/import_transactions.py

fgregg · 2024-09-23T13:40:51Z

camp_fin/management/commands/import_transactions.py

+        """
+        return filter(
+            lambda x: get_quarter(x[0][2]) in filing_quarters,
+            groupby(tqdm(records), key=filing_key),


this tqdm is a bit confusing, as it is over all the records, not just the filtered records.

would something like

return groupby(tqdm(filter(records, lambda x: ...)), key=filing_key)

work?

fgregg · 2024-09-23T13:43:21Z

camp_fin/management/commands/import_transactions.py

+
+                if not len(batch) % batch_size:
+                    self._save_batch(batch)
+                    batch = []


you also need to handle the case that you have iterated through all the records, but the batch isn't modulo the batch_size

Great catch, thanks.

fgregg · 2024-09-23T13:44:08Z

camp_fin/management/commands/import_transactions.py

+                contribution = self.make_contribution(record, None, filing)
+                batch.append(contribution)
+
+                if not len(batch) % batch_size:


when doing modulo, i think it's better to have the form that

len(batch) % batch_size == 0

i think it's just a touch more explicit.

Love it. Done.

fgregg

nice work!

antidipyramid

Very curious to see what the performance improvement looks like!

antidipyramid · 2024-09-23T18:50:06Z

camp_fin/management/commands/import_transactions.py

+        ):
+            yield cls.objects.bulk_create(cls_records)
+
+    def import_contributions(self, f, quarters, year, batch_size):


Such a clean function.

Add month filter to transaction import

2b8dd5e

hancush temporarily deployed to openness-pro-hcg-batch--pqadyk September 18, 2024 17:34 Inactive

Import by quarter rather than month

17b9b27

hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 18, 2024 20:35 Inactive

Batch transaction saves

9bf8ad3

hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 19:45 Inactive

hancush changed the title ~~Add month filter to transaction import~~ Import transactions by quarter, in chunks of a few hundred Sep 19, 2024

hancush commented Sep 19, 2024

View reviewed changes

Don't filter filings to total by year

422d309

hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 20:00 Inactive

Use a matrix strategy for imports

45d6ab6

hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 20:05 Inactive

Check out my branch

2eb539d

hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 20:13 Inactive

Spell the dang variable name correctly

33f3c33

hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 22:00 Inactive

Remove version from docker-compose.yml

c5645ce

hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 22:03 Inactive

Strike version from test Compose file

c453cb5

hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 22:05 Inactive

hancush requested review from fgregg and antidipyramid September 23, 2024 13:27

hancush marked this pull request as ready for review September 23, 2024 13:28

hancush commented Sep 23, 2024

View reviewed changes

fgregg requested changes Sep 23, 2024

View reviewed changes

Apply tqdm to filtered records, import last batch

0c189ca

hancush requested a review from fgregg September 23, 2024 18:27

hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 23, 2024 18:27 Inactive

fgregg approved these changes Sep 23, 2024

View reviewed changes

antidipyramid approved these changes Sep 23, 2024

View reviewed changes

hancush merged commit 25ba297 into main Sep 23, 2024
2 checks passed

hancush deleted the hcg/batch-it-up branch September 23, 2024 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import transactions by quarter, in chunks of a few hundred #212

Import transactions by quarter, in chunks of a few hundred #212

hancush commented Sep 18, 2024 •

edited

Loading

hancush Sep 19, 2024

hancush Sep 19, 2024

hancush Sep 23, 2024

fgregg left a comment

fgregg Sep 23, 2024

hancush Sep 23, 2024

fgregg Sep 23, 2024

hancush Sep 23, 2024

fgregg Sep 23, 2024

hancush Sep 23, 2024

fgregg left a comment

antidipyramid left a comment

antidipyramid Sep 23, 2024

		.SECONDEXPANSION:
		import/% : _data/sorted/$$(word 1, $$(subst _, , $$))_$$(word 3, $$(subst _, , $$)).csv

Import transactions by quarter, in chunks of a few hundred #212

Import transactions by quarter, in chunks of a few hundred #212

Conversation

hancush commented Sep 18, 2024 • edited Loading

Overview

Notes

Testing Instructions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgregg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgregg left a comment

Choose a reason for hiding this comment

antidipyramid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hancush commented Sep 18, 2024 •

edited

Loading