-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve data.json record change detection #4414
Comments
We can use the solution in this StackOverflow thread. It recursively converts all dict to list and have all list sorted. It is one way conversion, but the hash of end object is guaranteed to be identical for a json record regardless of its key/array ordering. |
dict comparison python library |
I believe the reason it wouldn't be so simple to use pkg = model_dictize.package_dictize(hobj[1], self.context())
if dataset['identifier'] in existing_datasets:
pkg = existing_datasets[dataset["identifier"]]
pkg_id = pkg["id"]
seen_datasets.add(dataset['identifier'])
# We store a hash of the dict associated with this dataset
# in the package so we can avoid updating datasets that
# don't look like they've changed.
source_hash = self.find_extra(pkg, "source_hash")
if source_hash is None:
try:
source_hash = json.loads(self.find_extra(pkg, "extras_rollup")).get("source_hash")
except TypeError:
source_hash = None
if pkg.get("state") == "active" \
and dataset['identifier'] not in existing_parents_demoted \
and dataset['identifier'] not in existing_datasets_promoted \
and source_hash == self.make_upstream_content_hash(dataset,
source,
catalog_extras,
schema_version):
log.info('SKIP: {}'.format(dataset['identifier']))
continue
else:
pkg_id = uuid.uuid4().hex Without having to edit |
Note about @FuhuXia's solution: keep in mind that if the lists are non-homogeneous, the sorting algorithm will fail. But for the use case, it may be okay to assume lists will be homogeneous. (Just noting as something to perform error checking on.) >>> a = [True, None, 1, "asdf"]
>>> sorted(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'NoneType' and 'bool'
>>> a = [None, 1, "asdf"]
>>> sorted(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'int' and 'NoneType' |
@FuhuXia Here is my solution as promised (published on pypi): |
@nickumia Nicely done. |
sansJson was added into ckanext-datajson to sort json before hashing. issue resolved. |
To demo the way new hashing is working, take DOI EDI source as example. This source is reliably changing 50% datasets on every week. It means 17,000 update every reharvest which takes a whole day . After new hashing, each harvest cuts update count to half, so 17,000, 8000, 4000, ... eventually to 0 when all old hash converted to new hash. |
We are using source_hash to detect changes on a data.json record, and based on the hash comparison result, harvesting decides to update or skip this record.
We have noticed some dynamically generated data.json sources do not have a consistent output of data.json records. Even the content is the same, but the random order in certain fields (keyword, distribution...) will give the record a different hash. This creates a lot of harvesting overhead updating same record over and over.
How to reproduce
Check harvest jobs for DOI (with distribution random order) and NASA (with keyword random order)
Expected behavior
Dataset should not be updated if the content is same with different listing order in certain field.
Actual behavior
Dataset is updated by harvesting.
Sketch
We have so far only observed values in a list using a random order, but the key order, even not observed yet, could also give a record a different hash.
For keys,
json.dumps
has an option to usesort_keys=True
. For values in a list, we may have to come up with our own code to sort nested lists.The text was updated successfully, but these errors were encountered: