Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation Extension Support #200

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

JVickery-TBS
Copy link
Contributor

Adds support for the ckanext-validation plugin. Behind the new config option ckanext.xloader.requires_validation, Resources will be required to be validated before being xloadered.

- Add config option for resources to require validation.
- Only submit to xloader if validation is successful.
- Added debug logging.
- Do not use 2.10+ helper.
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
- Check toolkit action instead of plugin name.
- Added method docs to explain new method.
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
# Conflicts:
#	ckanext/xloader/plugin.py
### RESOLVED.
- Fixed typo.
- Fixed try/catch.
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
- Prevent POST to upload to DS if validation is required.
- flake8 if wrap indents.
@JVickery-TBS
Copy link
Contributor Author

@ThrawnCA I added the flash message to the resource_data POST request here too now. So will prevent users from using the "Upload to DataStore" button if the resource has not passed validation yet.

ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
- Syntax fixes from flake8.
@JVickery-TBS
Copy link
Contributor Author

@ThrawnCA fixed flake8 check here now as well.

- Moved logic into util method.
- Renamed config options.
- Added new config option to enforce validation schema existance.
# Conflicts:
#	ckanext/xloader/plugin.py
#	ckanext/xloader/utils.py
### RESOLVED.
- Added automated tests.
- Changed the logic to be better and more clear.
- Fixed a logic case.
@duttonw
Copy link
Collaborator

duttonw commented Feb 3, 2024

Love the work you have done on this @JVickery-TBS . One thing that I think would make this perfect is to see if xloader can hook into validation to run the xloader load in sync mode right after we have a success, or maybe another priority queue so its not delayed again. Since validation success ensure you don't have blank lines or duplicate headers, the fast load should always run and not defer back to messytables slow python orm style commits.

Just a bit of background:
Reason being is that in data.qld.gov.au we defer validation to the batch queue so that the user does not have a stalled process on large datasets or slow url collections. We sometimes get alot of 1million row updated csv's loaded and if they fail the fast load take up to 1hour to be ingested. If we had a validation job occur and then had the update datastore put behind such a resource, then the author would get impatient and lodge a support ticket on why their resource is not updated.

Something to think about with this integration:
Another problem I forsee, mostly in the realm of the validation extension, is that when you save the dataset or re order said dataset, it may then re-validate ALL resources. It would be awesome if we could utilise part of xloader skip checks if the filehash is the same to save on duplicate runs.

Possible other extension enhancement for performance:
Unsure if your system uses the archiver extension. data.qld.gov.au does. This means we have downloaded from url or off disk/blob store the file 3 times. or more due to validation loops. Thankfully the archiver is quite fast while the validation and xloader is quite slow.

# Conflicts:
#	ckanext/xloader/config_declaration.yaml
#	ckanext/xloader/plugin.py
#	ckanext/xloader/tests/test_plugin.py
### RESOLVED.
- Better conditional syntax.
@JVickery-TBS
Copy link
Contributor Author

@duttonw Yeah we do the same thing with our setup as well in which we have to do async mode for validation and xloader as most of the CSVs that get upload are gigantic.

As for the UX for users waiting for these things, I am currently working on some UI things (open-data/ckanext-canada#1430) to show the status of Xloader for a Resource, e.g. "Pending" or "Running" etc. Still kind of a work in progress, but let me know if this would be a good thing to put back into the Xloader repo.

Regarding the archiver, we actually use the cloud storage plugin, so things have to be gathered from there every time. But with the loop that happens, we kind of actually found out that this was an issue with the Core CKAN code and how the Resource plugin hooks work. Things tend to always get run multiple times because the Resource actions always call the Package actions. Still discussions to go on how to fix that.

open-data/ckanext-validation#9 <- I put in a temporary fix on our Validation fork to prevent the looping. Along with a temporary fix in our Xloader one (open-data@df5a4c0). These two things prevent any job looping. This was for sure a major issue as the resource_patch from Xloader would cause another Validation run, which would then cause another Xloader run and stop because file hash did not change, moreover, it would do that to ALL the resources in a dataset. So yeah, updating a single Resource should have enqueued 1 or 2 jobs, but could end up enqueuing like 70.

YES the validation and xloader were slow, so because of this loop, some datasets with 200+ resources in them backed up the job queue to a couple thousand. It was terrible and that is when I looked to fix this job looping thing. I will try and get my job looping prevention into the ckanext-validation repo this week. Even if this gets fixed sometime in upstream, should not really have a major impact with the work-around in Validation.

But as for your request to have validation do resource_patch and have Xloader ALWAYS do sync instead of async in that specific case, I shall look into that this week as well. Because, yeah, I feel like it could be annoying for a user to have to wait for Validation, and then after that passes, wait again for their Xloader job (that is now at the end of the queue) to run. Especially on busier sites or at busier times when the queue might be larger. (Especially with the job loop stuff happening, this is basically always hahaha)

- Fixed inline comments to make more sense.
@ThrawnCA
Copy link
Collaborator

ThrawnCA commented Feb 5, 2024

Another issue with asynchronous processing is that the logic in plugin.py to check whether XLoader can proceed will always see the previous validation_status.

@duttonw
Copy link
Collaborator

duttonw commented Feb 6, 2024

Hi @JVickery-TBS

As for the UX for users waiting for these things, I am currently working on some UI things (open-data/ckanext-canada#1430) to show the status of Xloader for a Resource, e.g. "Pending" or "Running" etc. Still kind of a work in progress, but let me know if this would be a good thing to put back into the Xloader repo.

Please do create a new pr for the inclusion of the badge.

Place it behind a config option to disable badge (default is on)

When enabled:
Have these messages always visible:

 'pending': _('Data awaiting load to DataStore'),
 'running': _('Loading data into DataStore'),
 'error': _('Failed to load data into DataStore'),

Have another flag to enable all (debug level) messages:

 'complete': _('Data loaded into DataStore'),
 'active': _('Data available in DataStore'),
 'inactive': _('Resource not active in DataStore'),
 'unknown': _('DataStore status unknown'),

My reasoning for this is to reduce the clutter on resources. You only want to know when things are changing most of the time. Once its done its purpose is served.

I'd also only include the url for people who have author or above. Reason being is that we allow non-authors to register to provide 'data requests' as well as commenting and email notification on dataset/resources/groups etc (atm this is paused due to very bad sql performance on dashboard viewing which seem to be a recurring problem ckan/ckan#6028 and ckan/ckan#7878 )

pusher_url = t.h.url_for('xloader.resource_data',
                             id=resource.get('package_id'),
                             resource_id=resource.get('id'))

@JVickery-TBS
Copy link
Contributor Author

@duttonw okay sounds great I will make time in these coming weeks to do the above. (and yeaahhh the Following and the Dashboard stuff gets pretty slow, I think we have disabled following things and the dashboard as well).

@ThrawnCA Yeah, I was thinking of how to do the Validation Pass -> Sync Mode Xloader. I am just trying to figure out how to do it all in this plugin. As we have the button to "Upload to Datastore". So I don't know the logic to determine if the caller is from the View or from the Validation Job yet. (maybe I can see if we are currently in a job and see what the job callback is from the ckan.lib.jobs). We shall see!

- Started doing sync mode for xloader right after validation.
@duttonw
Copy link
Collaborator

duttonw commented Feb 6, 2024

Hi @JVickery-TBS ,

@ThrawnCA did something similar in our branch for the CLI click command for sys admins to force a datastore when under the crunch, you can see in in this pr qld-gov-au#82

Regards,

@duttonw

@ThrawnCA
Copy link
Collaborator

ThrawnCA commented Feb 6, 2024

So I don't know the logic to determine if the caller is from the View or from the Validation Job yet. (maybe I can see if we are currently in a job and see what the job callback is from the ckan.lib.jobs). We shall see!

That could be easily fixed by putting a hidden input flag in the view. It wouldn't even be a problem if someone spotted it and chose to insert it into an API call.

- Continued sync mode chaining from validation.
@JVickery-TBS
Copy link
Contributor Author

@ThrawnCA @duttonw thanks for sending over that commit, I will check it out to see if I can use some of the code.

The issue that I am seeing here is now how RQ runs the jobs in workers. Firstly, you would have to have a worker running for that queue, and have it in the top priority. I would like to not have users need to run a separate queue. Have not read Thrawn's commit yet, but maybe it just executes the burst of that custom queue right away?

I would also like to support other devs in which they call the submit to xloader action, it should probably always enqueue by default.

I am currently having difficulties preventing loops when running the xloader directly from the validation job. So just trying to figure that one out.

- Continued sync mode chaining from validation.
@JVickery-TBS
Copy link
Contributor Author

@ThrawnCA @duttonw its not exactly where I want it, but am at a bit of a block right now.

I got it to a point in which if it is chaining from the Validation job, then it will enqueue a job and push it to the start of the queue.

Bad things with this are:

  • the condition to check if we are inside of a validation job kinda sucks;
  • validation can do sync mode itself, so does not necessarily even have a job;
  • if a user has another queue other than default with a higher priority, then the push to top of default queue might not actually work great.

So I am not really sure at this point how to move forward, I feel like the best/easiest thing would be the add some feature to ckanext-validation. The extension DOES have a context variable called _performed_validation. This WOULD solve all the issues with the bad job check, and the sync validation style. But as it is inside of the context variable, the notify method from the IDomainObjectModification implementation does not have the context passed to it. So I am also trying to figure out a way to get that variable

ckanext/xloader/action.py Outdated Show resolved Hide resolved
@ThrawnCA
Copy link
Collaborator

ThrawnCA commented Feb 7, 2024

@JVickery-TBS Have you looked at the interaction between the archiver and QA plugins? Archiver defines an IPipe interface to receive callbacks, QA implements that interface, so when Archiver has finished uploading a resource, QA gets nudged to go and examine it.

A similar pattern could be used here, but would require changes to the Validation plugin to add callback support.

# Conflicts:
#	ckanext/xloader/action.py
### RESOLVED.
- Implement experimental `IPipeValidation` implement.
@JVickery-TBS
Copy link
Contributor Author

requires: ckan/ckanext-validation#97

- Cannot do tests without `IPipeValidation`.
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
- Clearer comments.
- Clearer log messages.
- Added `ignore_not_sysadmin` validator to the sync key.
if sync:
log.exception('Unable to xloader res_id=%s', res_id)
else:
log.exception('Unable to enqueued xloader res_id=%s', res_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Unable to enqueued" -> "Unable to enqueue"

Copy link
Collaborator

@duttonw duttonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than @ThrawnCA debug wording tweak. It looks good.

Is these commits used anywhere else (i.e. is it a candidate for squashing for clean history).

Can you also update the README.md and maybe the Change log of what you added.

Since I've recently become a mantainer of the pypi module for this so I'd like to push this out to that soon after. (Doing it correctly with a github action pipeline)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants