Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve data.json record change detection #4414

Closed
FuhuXia opened this issue Aug 3, 2023 · 8 comments
Closed

improve data.json record change detection #4414

FuhuXia opened this issue Aug 3, 2023 · 8 comments
Assignees
Labels
bug Software defect or bug CKAN Explore O&M Operations and maintenance tasks for the Data.gov platform

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Aug 3, 2023

We are using source_hash to detect changes on a data.json record, and based on the hash comparison result, harvesting decides to update or skip this record.

We have noticed some dynamically generated data.json sources do not have a consistent output of data.json records. Even the content is the same, but the random order in certain fields (keyword, distribution...) will give the record a different hash. This creates a lot of harvesting overhead updating same record over and over.

How to reproduce

Check harvest jobs for DOI (with distribution random order) and NASA (with keyword random order)

Expected behavior

Dataset should not be updated if the content is same with different listing order in certain field.

Actual behavior

Dataset is updated by harvesting.

Sketch

We have so far only observed values in a list using a random order, but the key order, even not observed yet, could also give a record a different hash.

For keys, json.dumps has an option to use sort_keys=True. For values in a list, we may have to come up with our own code to sort nested lists.

@FuhuXia FuhuXia added the bug Software defect or bug label Aug 3, 2023
@FuhuXia
Copy link
Member Author

FuhuXia commented Aug 4, 2023

We can use the solution in this StackOverflow thread.

It recursively converts all dict to list and have all list sorted. It is one way conversion, but the hash of end object is guaranteed to be identical for a json record regardless of its key/array ordering.

@hkdctol hkdctol added the O&M Operations and maintenance tasks for the Data.gov platform label Aug 4, 2023
@rshewitt
Copy link
Contributor

dict comparison python library

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Aug 11, 2023

I believe the reason it wouldn't be so simple to use deepdiff would be because we compare hashes, not real datasets. We would need to do a full package-lookup for the comparison which (might?) add more load to DB. The current algorithm is:

pkg = model_dictize.package_dictize(hobj[1], self.context())
if dataset['identifier'] in existing_datasets:
    pkg = existing_datasets[dataset["identifier"]]
    pkg_id = pkg["id"]
    seen_datasets.add(dataset['identifier'])

    # We store a hash of the dict associated with this dataset
    # in the package so we can avoid updating datasets that
    # don't look like they've changed.
    source_hash = self.find_extra(pkg, "source_hash")
    if source_hash is None:
        try:
            source_hash = json.loads(self.find_extra(pkg, "extras_rollup")).get("source_hash")
        except TypeError:
            source_hash = None
    if pkg.get("state") == "active" \
            and dataset['identifier'] not in existing_parents_demoted \
            and dataset['identifier'] not in existing_datasets_promoted \
            and source_hash == self.make_upstream_content_hash(dataset,
                                                               source,
                                                               catalog_extras,
                                                               schema_version):
        log.info('SKIP: {}'.format(dataset['identifier']))
        continue
else:
    pkg_id = uuid.uuid4().hex

Without having to edit gather_stage, package_create and adding more time to the pipeline, I think sorting the dataset before creation reduces the amount of work needed to change the current code. The most optimal solution (as it stands) seems to be sorting, not changing the way we compare.

@nickumia
Copy link

Note about @FuhuXia's solution: keep in mind that if the lists are non-homogeneous, the sorting algorithm will fail. But for the use case, it may be okay to assume lists will be homogeneous. (Just noting as something to perform error checking on.)

>>> a = [True, None, 1, "asdf"]
>>> sorted(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'NoneType' and 'bool'
>>> a = [None, 1, "asdf"]
>>> sorted(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'int' and 'NoneType'

@nickumia
Copy link

@FuhuXia Here is my solution as promised (published on pypi):

@FuhuXia
Copy link
Member Author

FuhuXia commented Aug 18, 2023

@nickumia Nicely done.
For our hashing purpose we assume all lists are homogeneous, or make all bool, None, numeric homogeneous str before sorting. But your sansJson package does the job well to handle non-homogeneous list. Let us pick and choose the one that has less impact on the system resource and less code changes.

@FuhuXia
Copy link
Member Author

FuhuXia commented Aug 31, 2023

sansJson was added into ckanext-datajson to sort json before hashing. issue resolved.

@FuhuXia FuhuXia closed this as completed Aug 31, 2023
@FuhuXia
Copy link
Member Author

FuhuXia commented Sep 12, 2023

To demo the way new hashing is working, take DOI EDI source as example. This source is reliably changing 50% datasets on every week. It means 17,000 update every reharvest which takes a whole day . After new hashing, each harvest cuts update count to half, so 17,000, 8000, 4000, ... eventually to 0 when all old hash converted to new hash.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug CKAN Explore O&M Operations and maintenance tasks for the Data.gov platform
Projects
Status: 🗄 Closed
Development

No branches or pull requests

5 participants