improve data.json record change detection #4414

FuhuXia · 2023-08-03T21:03:47Z

We are using source_hash to detect changes on a data.json record, and based on the hash comparison result, harvesting decides to update or skip this record.

We have noticed some dynamically generated data.json sources do not have a consistent output of data.json records. Even the content is the same, but the random order in certain fields (keyword, distribution...) will give the record a different hash. This creates a lot of harvesting overhead updating same record over and over.

How to reproduce

Check harvest jobs for DOI (with distribution random order) and NASA (with keyword random order)

Expected behavior

Dataset should not be updated if the content is same with different listing order in certain field.

Actual behavior

Dataset is updated by harvesting.

Sketch

We have so far only observed values in a list using a random order, but the key order, even not observed yet, could also give a record a different hash.

For keys, json.dumps has an option to use sort_keys=True. For values in a list, we may have to come up with our own code to sort nested lists.

The text was updated successfully, but these errors were encountered:

FuhuXia · 2023-08-04T14:08:26Z

We can use the solution in this StackOverflow thread.

It recursively converts all dict to list and have all list sorted. It is one way conversion, but the hash of end object is guaranteed to be identical for a json record regardless of its key/array ordering.

rshewitt · 2023-08-11T17:21:58Z

dict comparison python library

nickumia-reisys · 2023-08-11T17:36:11Z

I believe the reason it wouldn't be so simple to use deepdiff would be because we compare hashes, not real datasets. We would need to do a full package-lookup for the comparison which (might?) add more load to DB. The current algorithm is:

pkg = model_dictize.package_dictize(hobj[1], self.context())
if dataset['identifier'] in existing_datasets:
    pkg = existing_datasets[dataset["identifier"]]
    pkg_id = pkg["id"]
    seen_datasets.add(dataset['identifier'])

    # We store a hash of the dict associated with this dataset
    # in the package so we can avoid updating datasets that
    # don't look like they've changed.
    source_hash = self.find_extra(pkg, "source_hash")
    if source_hash is None:
        try:
            source_hash = json.loads(self.find_extra(pkg, "extras_rollup")).get("source_hash")
        except TypeError:
            source_hash = None
    if pkg.get("state") == "active" \
            and dataset['identifier'] not in existing_parents_demoted \
            and dataset['identifier'] not in existing_datasets_promoted \
            and source_hash == self.make_upstream_content_hash(dataset,
                                                               source,
                                                               catalog_extras,
                                                               schema_version):
        log.info('SKIP: {}'.format(dataset['identifier']))
        continue
else:
    pkg_id = uuid.uuid4().hex

Without having to edit gather_stage, package_create and adding more time to the pipeline, I think sorting the dataset before creation reduces the amount of work needed to change the current code. The most optimal solution (as it stands) seems to be sorting, not changing the way we compare.

nickumia · 2023-08-13T02:56:02Z

Note about @FuhuXia's solution: keep in mind that if the lists are non-homogeneous, the sorting algorithm will fail. But for the use case, it may be okay to assume lists will be homogeneous. (Just noting as something to perform error checking on.)

>>> a = [True, None, 1, "asdf"]
>>> sorted(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'NoneType' and 'bool'
>>> a = [None, 1, "asdf"]
>>> sorted(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'int' and 'NoneType'

nickumia · 2023-08-14T12:48:39Z

@FuhuXia Here is my solution as promised (published on pypi):

https://github.com/nickumia/sansJson

FuhuXia · 2023-08-18T14:43:25Z

@nickumia Nicely done.
For our hashing purpose we assume all lists are homogeneous, or make all bool, None, numeric homogeneous str before sorting. But your sansJson package does the job well to handle non-homogeneous list. Let us pick and choose the one that has less impact on the system resource and less code changes.

FuhuXia · 2023-08-31T17:45:01Z

sansJson was added into ckanext-datajson to sort json before hashing. issue resolved.

FuhuXia · 2023-09-12T14:05:47Z

To demo the way new hashing is working, take DOI EDI source as example. This source is reliably changing 50% datasets on every week. It means 17,000 update every reharvest which takes a whole day . After new hashing, each harvest cuts update count to half, so 17,000, 8000, 4000, ... eventually to 0 when all old hash converted to new hash.

FuhuXia added the bug Software defect or bug label Aug 3, 2023

hkdctol added the O&M Operations and maintenance tasks for the Data.gov platform label Aug 4, 2023

FuhuXia mentioned this issue Aug 21, 2023

WAF file change detection #4425

Open

nickumia-reisys assigned FuhuXia and nickumia-reisys Aug 22, 2023

FuhuXia mentioned this issue Aug 30, 2023

New hash with sansjson sort GSA/ckanext-datajson#148

Merged

FuhuXia closed this as completed Aug 31, 2023

nickumia-reisys added CKAN Explore labels Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve data.json record change detection #4414

improve data.json record change detection #4414

FuhuXia commented Aug 3, 2023

FuhuXia commented Aug 4, 2023

rshewitt commented Aug 11, 2023

nickumia-reisys commented Aug 11, 2023 •

edited

Loading

nickumia commented Aug 13, 2023

nickumia commented Aug 14, 2023

FuhuXia commented Aug 18, 2023

FuhuXia commented Aug 31, 2023

FuhuXia commented Sep 12, 2023

improve data.json record change detection #4414

improve data.json record change detection #4414

Comments

FuhuXia commented Aug 3, 2023

How to reproduce

Expected behavior

Actual behavior

Sketch

FuhuXia commented Aug 4, 2023

rshewitt commented Aug 11, 2023

nickumia-reisys commented Aug 11, 2023 • edited Loading

nickumia commented Aug 13, 2023

nickumia commented Aug 14, 2023

FuhuXia commented Aug 18, 2023

FuhuXia commented Aug 31, 2023

FuhuXia commented Sep 12, 2023

nickumia-reisys commented Aug 11, 2023 •

edited

Loading