Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reparse Memory Management #3172

Merged
merged 48 commits into from
Sep 30, 2024
Merged

Reparse Memory Management #3172

merged 48 commits into from
Sep 30, 2024

Conversation

elipe17
Copy link

@elipe17 elipe17 commented Sep 5, 2024

Summary of Changes

How to Test

List the steps to test the PR
These steps are generic, please adjust as necessary.

cd tdrs-frontend && docker-compose up
cd tdrs-backend && docker-compose up
  1. Open http://localhost:3000/ and sign in.
  2. Submit ADS.E2J.NDM1.TS53_fake.txt four or five times to get one million or more records into the DB. (This goes faster if you add more celery workers to gunicorn_start.sh!)
  3. Exec into web and run python manage.py clean_and_reparse -y 2023. If you use the -a argument you actually won't see the issue occur because -a doesn't pass the queryset to Elastic DSL which is what causes it to come into memory.
  4. Monitor the backend memory consumption while deleting records. It should increase, but only around 50-100MB.

Deliverables

More details on how deliverables herein are assessed included here.

Deliverable 1: Accepted Features

Checklist of ACs:

  • Clean and reparse command does not get killed (in general) when large querysets are involved.

Deliverable 2: Tested Code

  • Are all areas of code introduced in this PR meaningfully tested?
    • If this PR introduces backend code changes, are they meaningfully tested?
    • If this PR introduces frontend code changes, are they meaningfully tested?
  • Are code coverage minimums met?
    • Frontend coverage: [insert coverage %] (see CodeCov Report comment in PR)
    • Backend coverage: [insert coverage %] (see CodeCov Report comment in PR)

Deliverable 3: Properly Styled Code

  • Are backend code style checks passing on CircleCI?
  • Are frontend code style checks passing on CircleCI?
  • Are code maintainability principles being followed?

Deliverable 4: Accessible

  • Does this PR complete the epic?
  • Are links included to any other gov-approved PRs associated with epic?
  • Does PR include documentation for Raft's a11y review?
  • Did automated and manual testing with iamjolly and ttran-hub using Accessibility Insights reveal any errors introduced in this PR?

Deliverable 5: Deployed

  • Was the code successfully deployed via automated CircleCI process to development on Cloud.gov?

Deliverable 6: Documented

  • Does this PR provide background for why coding decisions were made?
  • If this PR introduces backend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces frontend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces dependencies, are their licenses documented?
  • Can reviewer explain and take ownership of these elements presented in this code review?

Deliverable 7: Secure

  • Does the OWASP Scan pass on CircleCI?
  • Do manual code review and manual testing detect any new security issues?
  • If new issues detected, is investigation and/or remediation plan documented?

Deliverable 8: User Research

Research product(s) clearly articulate(s):

  • the purpose of the research
  • methods used to conduct the research
  • who participated in the research
  • what was tested and how
  • impact of research on TDP
  • (if applicable) final design mockups produced for TDP development

@elipe17 elipe17 added bug backend dev raft review This issue is ready for raft review labels Sep 5, 2024
@elipe17 elipe17 self-assigned this Sep 5, 2024
Copy link

codecov bot commented Sep 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.66%. Comparing base (95fc24b) to head (a981311).
Report is 49 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #3172   +/-   ##
========================================
  Coverage    92.66%   92.66%           
========================================
  Files           47       47           
  Lines         1009     1009           
  Branches       169      169           
========================================
  Hits           935      935           
  Misses          42       42           
  Partials        32       32           
Flag Coverage Δ
dev-frontend 92.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95fc24b...a981311. Read the comment docs.

Copy link
Collaborator

@andrew-jameson andrew-jameson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@raftmsohani raftmsohani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we start writing test for clean_and_reparse

@ADPennington ADPennington added the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Sep 24, 2024
@ADPennington
Copy link
Collaborator

testing re-parsing command again for FY23Q1 files. started at 9:18a ET...

@ADPennington
Copy link
Collaborator

ADPennington commented Sep 24, 2024

temporarily blocked

testing re-parsing command again for FY23Q1 files. started at 9:18a ET...

temporarily blocked on testing. command froze after attempting to delete approx 1.5mil TANF_T3 records and now the rds is down.

@ADPennington ADPennington added Blocked Label for Pull Requests that are currently blocked by a dependency and removed Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Sep 24, 2024
@ADPennington
Copy link
Collaborator

temporarily blocked

testing re-parsing command again for FY23Q1 files. started at 9:18a ET...

temporarily blocked on testing. command froze after attempting to delete approx 1.5mil TANF_T3 records and now the rds is down.

per @elipe17, and with the help of cloudgov support, the dev environment rds service instance was upgraded to medium-gp-psql with storage_type=gp3 to support qasp review. Will re-try the test today.

@ADPennington ADPennington added Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI and removed Blocked Label for Pull Requests that are currently blocked by a dependency labels Sep 26, 2024
@ADPennington ADPennington added Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI and removed Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Sep 26, 2024
@ADPennington
Copy link
Collaborator

temporarily blocked

testing re-parsing command again for FY23Q1 files. started at 9:18a ET...

temporarily blocked on testing. command froze after attempting to delete approx 1.5mil TANF_T3 records and now the rds is down.

per @elipe17, and with the help of cloudgov support, the dev environment rds service instance was upgraded to medium-gp-psql with storage_type=gp3 to support qasp review. Will re-try the test today.

Screenshot 2024-09-26 160312

@ADPennington
Copy link
Collaborator

QASP review update:

this ticket is not blocked, but we are standing by for a response from cloudgov support team re: if there are cost implications for us if we implement the following rds service changes:

  • Dev: from micro-psql to medium-gp-psql
  • Staging: from micro-psql to medium-gp-psql
  • Prod: from medium-psql to medium-gp-psql or large-gp-psql

https://cloud.gov/docs/services/relational-database/

another consideration (if there are cost implications) is #3106 or #3108, which will give us more control over how many records are re-parsed at a time.

@ADPennington
Copy link
Collaborator

ADPennington commented Sep 30, 2024

QASP review update:

this ticket is not blocked, but we are standing by for a response from cloudgov support team re: if there are cost implications for us if we implement the following rds service changes:

  • Dev: from micro-psql to medium-gp-psql
  • Staging: from micro-psql to medium-gp-psql
  • Prod: from medium-psql to medium-gp-psql or large-gp-psql

https://cloud.gov/docs/services/relational-database/

another consideration (if there are cost implications) is #3106 or #3108, which will give us more control over how many records are re-parsed at a time.

we received confirmation that, at least for now, there are no extra costs for the upgrade. @elipe17 should we proceed with making the terraform changes to staging and prod in this ticket?

@elipe17
Copy link
Author

elipe17 commented Sep 30, 2024

QASP review update:
this ticket is not blocked, but we are standing by for a response from cloudgov support team re: if there are cost implications for us if we implement the following rds service changes:

  • Dev: from micro-psql to medium-gp-psql
  • Staging: from micro-psql to medium-gp-psql
  • Prod: from medium-psql to medium-gp-psql or large-gp-psql

https://cloud.gov/docs/services/relational-database/
another consideration (if there are cost implications) is #3106 or #3108, which will give us more control over how many records are re-parsed at a time.

we received confirmation that, at least for now, there are no extra costs for the upgrade. @elipe17 should we proceed with making the terraform changes to staging and prod in this ticket?

@ADPennington I think we should. If/when cost comes into the equation we can change the TF again to mitigate that. Since we don't have to take any manual steps to change the service plan this is easy to manipulate and change as our needs change. I will go ahead and make the changes. Note the tests are going to fail on this after a I make changes until this PR merges.

@ADPennington ADPennington removed the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Sep 30, 2024
@ADPennington ADPennington added the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Sep 30, 2024
Copy link
Collaborator

@ADPennington ADPennington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great @elipe17 🚀 testing notes below cc: @ttran-hub @klinkoberstar ⬇️


after upgrading rds to medium-gp-psql, tried the following fiscal periods:

  • python manage.py clean_and_reparse -y 2023 -q Q1 (N=90 files) 👍
  • python manage.py clean_and_reparse -y 2024 (N=300 files) 👍
  • python manage.py clean_and_reparse -y 2023 -q Q2 (N=20 files) 👍

evidence of logentries
logentries

evidence of deleting objects

Backup complete! Commencing clean and reparse.
2024-09-30 17:40:02,518 INFO clean_and_reparse.py::_delete_summaries:L88 :  Deleting 20 datafile summary objects.
Deleting 20 datafile summary objects.
2024-09-30 17:40:02,525 INFO clean_and_reparse.py::_delete_summaries:L90 :  Successfully deleted datafile summary objects.
Successfully deleted datafile summary objects.
2024-09-30 17:40:03,306 INFO clean_and_reparse.py::_delete_errors:L158 :  Deleting 2154232 parser errors.
Deleting 2154232 parser errors.
2024-09-30 17:40:08,748 INFO clean_and_reparse.py::_delete_errors:L160 :  Successfully deleted parser errors.
Successfully deleted parser errors.
2024-09-30 17:40:08,868 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T1'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T1'>.
2024-09-30 17:40:09,636 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T2'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T2'>.
2024-09-30 17:40:10,765 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T3'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T3'>.
2024-09-30 17:40:11,426 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T4'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T4'>.
2024-09-30 17:40:11,435 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T5'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T5'>.
2024-09-30 17:40:11,446 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T6'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T6'>.
2024-09-30 17:40:11,452 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T7'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tanf.TANF_T7'>.
2024-09-30 17:40:11,512 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 266541 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M1'>.
Deleting 266541 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M1'>.
2024-09-30 17:41:40,659 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 401730 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M2'>.
Deleting 401730 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M2'>.
2024-09-30 17:44:08,643 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 652485 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M3'>.
Deleting 652485 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M3'>.
2024-09-30 17:47:32,931 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 30 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M4'>.
Deleting 30 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M4'>.
2024-09-30 17:47:32,970 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 104 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M5'>.
Deleting 104 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M5'>.
2024-09-30 17:47:33,012 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M6'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M6'>.
2024-09-30 17:47:33,019 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M7'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.ssp.SSP_M7'>.
2024-09-30 17:47:33,028 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T1'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T1'>.
2024-09-30 17:47:33,039 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T2'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T2'>.
2024-09-30 17:47:33,050 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T3'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T3'>.
2024-09-30 17:47:33,059 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T4'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T4'>.
2024-09-30 17:47:33,065 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T5'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T5'>.
2024-09-30 17:47:33,073 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 6 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T6'>.
Deleting 6 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T6'>.
2024-09-30 17:47:33,152 INFO clean_and_reparse.py::_delete_records:L129 :  Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T7'>.
Deleting 0 records of type: <class 'tdpservice.search_indexes.models.tribal.Tribal_TANF_T7'>.
2024-09-30 17:47:33,156 INFO clean_and_reparse.py::_calculate_timeout:L255 :  Setting timeout for the reparse event to be 2:06:02.973328 seconds from meta model creation date.
Setting timeout for the reparse event to be 2:06:02.973328 seconds from meta model creation date.
2024-09-30 17:47:33,159 INFO clean_and_reparse.py::handle:L367 :  Deleted a total of 1320896 records accross 20 files.
Deleted a total of 1320896 records accross 20 files.
2024-09-30 17:47:33,159 INFO clean_and_reparse.py::handle:L370 :  Deleting and re-parsing 20 files
Deleting and re-parsing 20 files
2024-09-30 17:47:33,379 INFO clean_and_reparse.py::handle:L380 :  Done. All tasks have been queued to parse the selected datafiles.
Done. All tasks have been queued to parse the selected datafiles.

evidence of reparse model

model

@ADPennington ADPennington added Ready to Merge and removed QASP Review Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Sep 30, 2024
@elipe17 elipe17 merged commit c5f87eb into develop Sep 30, 2024
23 checks passed
@elipe17 elipe17 deleted the 3170-reparse-memory-mgmt branch September 30, 2024 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reparse Command Fails when Queryset is Large
5 participants