Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Professional Education ETL Pipeline #1210

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Professional Education ETL Pipeline #1210

wants to merge 10 commits into from

Conversation

mbertrand
Copy link
Member

@mbertrand mbertrand commented Jul 1, 2024

Blocked until the new API has production-ready data.

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/4731

Related ol-infrastructure PR: mitodl/ol-infrastructure#2536

Description (What does it do?)

  • Adds a new ETL pipeline to ingest courses and programs directly from Professional Education's own API feed instead of Prolearn's.
  • Excludes PE courses from being ingested by the Prolearn ETL pipeline.

Screenshots (if appropriate):

Screenshot 2024-07-08 at 8 59 20 AM

How can this be tested?

  • Set the following values in your .env file:

     PROLEARN_CATALOG_API_URL=https://prolearn.mit.edu/graphql
     MITPE_BASE_API_URL=https://professional.mit.edu/
    
  • On the main branch, run ./manage.py backpopulate_prolearn_data

  • When it is done, there should be roughly >130 learning resources with etl_source=prolearn and offered_by=mitpe. They will all be courses and not programs. http://api.open.odl.local:8063/admin/learning_resources/learningresource/?o=2&offered_by__code__exact=mitpe

  • If you don't already have a populated featured learning path for Professional Education, run ./manage.py populate_featured_lists to create one

  • Switch to this branch. Run ./manage.py backpopulate_mitpe_data followed by ./manage.py backpopulate_prolearn_data) When it is done, most/all the same courses returned above should now have etl_source=mitpe and have new readable_id values. Some that were courses in prolearn should be unpublished, and created as new programs instead of courses (about 27 published programs). There should not be any published dupes from different sources (prolearn and mitpe). Some will be unpublished now due to changed criteria (past enrollment end date or past end date).

  • If you go to http://open.odl.local:8062/learningpaths and open the Professional Education featured learning path, it will be empty. Run the following to reassign the old featured courses from prolearn to. their new equivalents

./manage.py transfer_list_resources course course url prolearn mitpe
./manage.py transfer_list_resources course program url prolearn mitpe
  • Once complete, you should have courses/programs in the MITPE featured learning path again. These commands will need to be run on RC/production as well to retain the current featured lists.

Copy link
Contributor

@abeglova abeglova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great! Back to you with a request for some variable name changes and a question

**{unique_field: unique_value},
platform=platform,
resource_type=resource_type,
published=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused by this logic. Why is the delete necessary? And does that mean that courses that are imported as unpublished, such as mitpe courses with no future runs in the logic below are deleted and recreated every run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should be run only when the readable_id is not the unique_field value, and in that case if there may be multiple unpublished resources with the same value for unique_field that need to be removed to avoid database duplicate errors.

Good catch on the mitpe courses with no future runs though, those probably should just be skipped altogether instead of (re)created as unpublished. Not much point in importing unpublished resources.

img_metadata = get_related_object(img_url)
if img_metadata:
field_image = img_metadata["relationships"]["field_media_image"]
img_url_2 = field_image["links"]["related"]["href"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called image_data_url or something like that?

Returns:
list of dict: list containing topic dicts with a name attribute
"""
topic_data = resource_data["relationships"]["field_course_topics"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be

topics_exist = resource_data["relationships"]["field_course_topics"]["data"] 

instead of reusing the variable name topic_data to mean different things

@mbertrand
Copy link
Member Author

@abeglova Peter and I are going to have a meeting at some point with the Professional Education API developer to discuss possible enhancements like clearly delineating programs vs courses. So I'm going to label this as "Blocked" until after that meeting.

Copy link

gitguardian bot commented Jul 15, 2024

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@mbertrand mbertrand removed the Needs Review An open Pull Request that is ready for review label Jul 17, 2024
@mbertrand mbertrand force-pushed the mb/profed_etl branch 3 times, most recently from d48bc1a to 9a5299a Compare August 7, 2024 14:51
@mbertrand mbertrand force-pushed the mb/profed_etl branch 2 times, most recently from 580fcc4 to f3495f4 Compare August 9, 2024 20:07
@mbertrand mbertrand changed the title Professional Education ETL PIpeline Professional Education ETL Pipeline Aug 9, 2024
@mbertrand mbertrand force-pushed the mb/profed_etl branch 3 times, most recently from 45537cf to ddcb055 Compare August 28, 2024 14:16
@mbertrand mbertrand force-pushed the mb/profed_etl branch 3 times, most recently from 27aec69 to 76542a7 Compare September 20, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants