diff --git a/docs/source/getting-started.rst b/docs/source/getting-started.rst index 1cccdf5d..ec51455d 100644 --- a/docs/source/getting-started.rst +++ b/docs/source/getting-started.rst @@ -315,12 +315,8 @@ Item validation --------------- Item validators allows you to match your returned items with predetermined structure -ensuring that all fields contains data in the expected format. Spidermon allows -you to choose between schematics_ or `JSON Schema`_ to define the structure -of your item. - -In this tutorial, we will use a schematics_ model to make sure that all required -fields are populated and they are all of the correct format. +ensuring that all fields contains data in the expected format. supports `JSON Schema`_ +to define the structure of your item. First step is to change our actual spider code to use `Scrapy items`_. Create a new file called `items.py`: @@ -367,25 +363,43 @@ And then modify the spider code to use the newly defined item: ) ) -Now we need to create our schematics model in `validators.py` file that will contain +Now we need to create our jsonschema model in the `schemas/quote_item.json` file that will contain all the validation rules: .. _quote-item-validation-schema: -.. code-block:: python - - # tutorial/validators.py - from schematics.models import Model - from schematics.types import URLType, StringType, ListType - - class QuoteItem(Model): - quote = StringType(required=True) - author = StringType(required=True) - author_url = URLType(required=True) - tags = ListType(StringType) +.. code-block:: json + + { + "$schema": "http://json-schema.org/draft-07/schema", + "type": "object", + "properties": { + "quote": { + "type": "string" + }, + "author": { + "type": "string" + }, + "author_url": { + "type": "string", + "pattern": "" + }, + "tags": { + "type": "array", + "items": { + "type":"string" + } + } + }, + "required": [ + "quote", + "author", + "author_url" + ] + } To allow Spidermon to validate your items, you need to include an item pipeline and -inform the name of the model class used for validation: +inform the path of the json schema used for validation: .. code-block:: python @@ -394,8 +408,8 @@ inform the name of the model class used for validation: 'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800, } - SPIDERMON_VALIDATION_MODELS = ( - 'tutorial.validators.QuoteItem', + SPIDERMON_VALIDATION_SCHEMAS = ( + './schemas/quote_item.json', ) After that, every time you run your spider you will have a new set of stats in @@ -408,7 +422,7 @@ your spider log providing information about the results of the validations: 'spidermon/validation/fields': 400, 'spidermon/validation/items': 100, 'spidermon/validation/validators': 1, - 'spidermon/validation/validators/item/schematics': True, + 'spidermon/validation/validators/item/jsonschema': True, [scrapy.core.engine] INFO: Spider closed (finished) You can then create a new monitor that will check these new statistics and raise @@ -473,7 +487,6 @@ The resulted item will look like this: } .. _`JSON Schema`: https://json-schema.org/ -.. _`schematics`: https://schematics.readthedocs.io/en/latest/ .. _`Scrapy`: https://scrapy.org/ .. _`Scrapy items`: https://docs.scrapy.org/en/latest/topics/items.html .. _`Scrapy Tutorial`: https://doc.scrapy.org/en/latest/intro/tutorial.html diff --git a/docs/source/index.rst b/docs/source/index.rst index aaa114e0..df115871 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -11,11 +11,8 @@ following features: * It can check the output data produced by Scrapy (or other sources) and verify it against a schema or model that defines the expected structure, - data types and value restrictions. It supports data validation based on two - external libraries: - - * jsonschema: ``_ - * Schematics: ``_ + data types and value restrictions. It supports data validation based on + the jsonschema library (``_). * It allows you to define conditions that should trigger an alert based on Scrapy stats. * It supports notifications via email, Slack, Telegram and Discord. diff --git a/docs/source/installation.rst b/docs/source/installation.rst index 2c49966c..cce19c6f 100644 --- a/docs/source/installation.rst +++ b/docs/source/installation.rst @@ -9,15 +9,12 @@ build your monitors on top of it. The library depends on jsonschema_ and If you want to set up any notifications, additional `monitoring` dependencies will help with that. -If you want to use schematics_ validation, you probably want `validation`. - So the recommended way to install the library is by adding both: .. code-block:: bash - pip install "spidermon[monitoring,validation]" + pip install "spidermon[monitoring]" .. _`jsonschema`: https://pypi.org/project/jsonschema/ .. _`python-slugify`: https://pypi.org/project/python-slugify/ -.. _`schematics`: https://pypi.org/project/schematics/ diff --git a/docs/source/item-validation.rst b/docs/source/item-validation.rst index 540096d8..f86607a2 100644 --- a/docs/source/item-validation.rst +++ b/docs/source/item-validation.rst @@ -21,37 +21,8 @@ the first step is to enable the built-in item pipeline in your project settings: subsequent pipeline changes the content of the item, ignoring the validation already performed. -After that, you need to choose which validation library will be used. Spidermon -accepts schemas defined using schematics_ or `JSON Schema`_. - -With schematics ---------------- - -Schematics_ is a validation library based on ORM-like models. These models include -some common data types and validators, but they can also be extended to define -custom validation rules. - -.. warning:: - - You need to install `schematics`_ to use this feature. - -.. code-block:: python - - # Usually placed in validators.py file - from schematics.models import Model - from schematics.types import URLType, StringType, ListType - - class QuoteItem(Model): - quote = StringType(required=True) - author = StringType(required=True) - author_url = URLType(required=True) - tags = ListType(StringType) - -Check `schematics documentation`_ to learn how to define a model and how to extend the -built-in data types. - -With JSON Schema ----------------- +Using JSON Schema +----------------- `JSON Schema`_ is a powerful tool for validating the structure of JSON data. You can define which fields are required, the type assigned to each field, a regular expression @@ -133,36 +104,6 @@ Default: ``_validation`` The name of the field added to the item when a validation error happens and `SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS`_ is enabled. -SPIDERMON_VALIDATION_MODELS -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Default: ``None`` - -A `list` containing the `schematics models`_ that contain the definition of the items -that need to be validated. - -.. code-block:: python - - # settings.py - - SPIDERMON_VALIDATION_MODELS = [ - 'tutorial.validators.DummyItemModel' - ] - -If you are working on a spider that produces multiple items types, you can define it -as a `dict`: - -.. code-block:: python - - # settings.py - - from tutorial.items import DummyItem, OtherItem - - SPIDERMON_VALIDATION_MODELS = { - DummyItem: 'tutorial.validators.DummyItemModel', - OtherItem: 'tutorial.validators.OtherItemModel', - } - SPIDERMON_VALIDATION_SCHEMAS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -235,9 +176,6 @@ Some examples: # checks that no errors is present in any fields self.check_field_errors_percent() -.. _`schematics`: https://schematics.readthedocs.io/en/latest/ -.. _`schematics documentation`: https://schematics.readthedocs.io/en/latest/ .. _`JSON Schema`: https://json-schema.org/ .. _`guide`: http://json-schema.org/learn/getting-started-step-by-step.html -.. _`schematics models`: https://schematics.readthedocs.io/en/latest/usage/models.html .. _`jsonschema`: https://pypi.org/project/jsonschema/ diff --git a/docs/source/settings.rst b/docs/source/settings.rst index 9c723d88..c6b884bb 100644 --- a/docs/source/settings.rst +++ b/docs/source/settings.rst @@ -182,3 +182,85 @@ If this setting is not provided or set to ``False``, spider statistics will be: 'spidermon_item_scraped_count/dict/field_2': 2, 'spidermon_field_coverage/dict/field_1': 1, # Did not ignore None value 'spidermon_item_scraped_count/dict/field_2': 1, + +SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS +------------------------------------- +Default: ``0`` + +If larger than 0, field coverage will be computed for items inside fields that are lists. +The number represents how deep in the objects tree the coverage is computed. +Be aware that enabling this might have a significant impact in performance. + +Considering your spider returns the following items: + +.. code-block:: python + + [ + { + "field_1": None, + "field_2": [{"nested_field1": "value", "nested_field2": "value"}], + }, + { + "field_1": "value", + "field_2": [ + {"nested_field2": "value", "nested_field3": {"deeper_field1": "value"}} + ], + }, + { + "field_1": "value", + "field_2": [ + { + "nested_field2": "value", + "nested_field4": [ + {"deeper_field41": "value"}, + {"deeper_field41": "value"}, + ], + } + ], + }, + ] + +If this setting is not provided or set to ``0``, spider statistics will be: + +.. code-block:: python + + 'item_scraped_count': 3, + 'spidermon_item_scraped_count': 3, + 'spidermon_item_scraped_count/dict': 3, + 'spidermon_item_scraped_count/dict/field_1': 3, + 'spidermon_item_scraped_count/dict/field_2': 3 + +If set to ``1``, spider statistics will be: + +.. code-block:: python + + 'item_scraped_count': 3, + 'spidermon_item_scraped_count': 3, + 'spidermon_item_scraped_count/dict': 3, + 'spidermon_item_scraped_count/dict/field_1': 3, + 'spidermon_item_scraped_count/dict/field_2': 3, + 'spidermon_item_scraped_count/dict/field_2/_items': 3, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field1': 1, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field2': 3, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field3': 1, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field3/deeper_field1': 1, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field4': 1 + +If set to ``2``, spider statistics will be: + +.. code-block:: python + + 'item_scraped_count': 3, + 'spidermon_item_scraped_count': 3, + 'spidermon_item_scraped_count/dict': 3, + 'spidermon_item_scraped_count/dict/field_1': 3, + 'spidermon_item_scraped_count/dict/field_2': 3, + 'spidermon_item_scraped_count/dict/field_2/_items': 3, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field1': 1, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field2': 3, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field3': 1, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field3/deeper_field1': 1, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field4': 1, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field4/_items': 2, + 'spidermon_item_scraped_count/dict/field_2/_items/nested_field4/_items/deeper_field41': 2 + diff --git a/examples/tutorial/.scrapy/stats/quotes_stats_history b/examples/tutorial/.scrapy/stats/quotes_stats_history new file mode 100644 index 00000000..ce738c40 Binary files /dev/null and b/examples/tutorial/.scrapy/stats/quotes_stats_history differ diff --git a/examples/tutorial/tutorial/schemas/quote_item.json b/examples/tutorial/tutorial/schemas/quote_item.json new file mode 100644 index 00000000..cc22bd48 --- /dev/null +++ b/examples/tutorial/tutorial/schemas/quote_item.json @@ -0,0 +1,27 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema", + "type": "object", + "properties": { + "quote": { + "type": "string" + }, + "author": { + "type": "string" + }, + "author_url": { + "type": "string", + "pattern": "" + }, + "tags": { + "type": "array", + "items": { + "type": "string" + } + } + }, + "required": [ + "quote", + "author", + "author_url" + ] +} \ No newline at end of file diff --git a/examples/tutorial/tutorial/settings.py b/examples/tutorial/tutorial/settings.py index 61cc8e2f..13ac374f 100644 --- a/examples/tutorial/tutorial/settings.py +++ b/examples/tutorial/tutorial/settings.py @@ -15,7 +15,7 @@ SPIDERMON_SLACK_RECIPIENTS = ["@yourself", "#yourprojectchannel"] ITEM_PIPELINES = {"spidermon.contrib.scrapy.pipelines.ItemValidationPipeline": 800} -SPIDERMON_VALIDATION_MODELS = ("tutorial.validators.QuoteItem",) +SPIDERMON_VALIDATION_SCHEMAS = ("../schemas/quote_item.json",) SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS = True diff --git a/examples/tutorial/tutorial/validators.py b/examples/tutorial/tutorial/validators.py deleted file mode 100644 index 4f8b96e1..00000000 --- a/examples/tutorial/tutorial/validators.py +++ /dev/null @@ -1,9 +0,0 @@ -from schematics.models import Model -from schematics.types import URLType, StringType, ListType - - -class QuoteItem(Model): - quote = StringType(required=True) - author = StringType(required=True) - author_url = URLType(required=True) - tags = ListType(StringType) diff --git a/requirements.txt b/requirements.txt index da47d033..5c194e91 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,7 +3,6 @@ slack-sdk boto premailer jsonschema[format] -schematics==2.1.0 python-slugify scrapy pytest diff --git a/setup.py b/setup.py index 3d0fd59c..e072050a 100644 --- a/setup.py +++ b/setup.py @@ -43,8 +43,6 @@ "premailer", "sentry-sdk", ], - # Data validation - "validation": ["schematics"], # Tools to run the tests "tests": test_requirements, # Tools to build and publish the documentation diff --git a/spidermon/contrib/scrapy/extensions.py b/spidermon/contrib/scrapy/extensions.py index 187a66a9..a84ddf46 100644 --- a/spidermon/contrib/scrapy/extensions.py +++ b/spidermon/contrib/scrapy/extensions.py @@ -108,6 +108,7 @@ def from_crawler(cls, crawler): crawler.signals.connect(ext.engine_stopped, signal=signals.engine_stopped) has_field_coverage = crawler.settings.getbool("SPIDERMON_ADD_FIELD_COVERAGE") + if has_field_coverage: crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped) @@ -132,7 +133,14 @@ def engine_stopped(self): spider = self.crawler.spider self._run_suites(spider, self.engine_stopped_suites) - def _count_item(self, item, skip_none_values, item_count_stat=None): + def _count_item( + self, + item, + skip_none_values, + item_count_stat=None, + max_list_nesting_level=0, + nesting_level=0, + ): if item_count_stat is None: item_type = type(item).__name__ item_count_stat = f"spidermon_item_scraped_count/{item_type}" @@ -149,6 +157,24 @@ def _count_item(self, item, skip_none_values, item_count_stat=None): self._count_item(value, skip_none_values, field_item_count_stat) continue + if ( + isinstance(value, list) + and max_list_nesting_level > 0 + and nesting_level < max_list_nesting_level + ): + items_count_stat = f"{field_item_count_stat}/_items" + for list_item in value: + self.crawler.stats.inc_value(items_count_stat) + if isinstance(list_item, dict): + self._count_item( + list_item, + skip_none_values, + items_count_stat, + max_list_nesting_level=max_list_nesting_level, + nesting_level=nesting_level + 1, + ) + continue + def _add_field_coverage_to_stats(self): stats = self.crawler.stats.get_stats() coverage_stats = calculate_field_coverage(stats) @@ -158,8 +184,14 @@ def item_scraped(self, item, response, spider): skip_none_values = spider.crawler.settings.getbool( "SPIDERMON_FIELD_COVERAGE_SKIP_NONE", False ) + list_field_coverage_levels = spider.crawler.settings.getint( + "SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS", 0 + ) + self.crawler.stats.inc_value("spidermon_item_scraped_count") - self._count_item(item, skip_none_values) + self._count_item( + item, skip_none_values, max_list_nesting_level=list_field_coverage_levels + ) def _run_periodic_suites(self, spider, suites): suites = [self.load_suite(s) for s in suites] diff --git a/spidermon/contrib/scrapy/monitors/monitors.py b/spidermon/contrib/scrapy/monitors/monitors.py index 1be76e3c..27488873 100644 --- a/spidermon/contrib/scrapy/monitors/monitors.py +++ b/spidermon/contrib/scrapy/monitors/monitors.py @@ -380,6 +380,27 @@ class FieldCoverageMonitor(BaseScrapyMonitor): You are not obligated to set rules for every field, just for the ones in which you are interested. Also, you can monitor nested fields if available in your returned items. + If a field returned by your spider is a list of dicts (or objects) and you want to check their + coverage, that is also possible. You need to set the ``SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS`` + setting. This value represents for how many levels inside the list the coverage will be computed + (if the objects inside the list also have fields that are objects/lists). + The coverage for list fields is computed in two ways: with + respect to the total items scraped (these values can be greater than 1) and with respect to the + total of items in the list. The stats are in the following form: + + .. code-block:: python + + { + "spidermon_field_coverage/dict/field2/_items/nested_field1": "some_value", + "spidermon_field_coverage/dict/field2/nested_field1": "other_value", + } + + The stat containing `_items` means it is calculated based on the total list items, while the + other, based on the total number of scraped items. + + If the objects in the list also contain another list field, that coverage is also computed in + both ways, with the total list items considered for the `_items` stat that of the innermost list. + In case you have a job without items scraped, and you want to skip this test, you have to enable the ``SPIDERMON_FIELD_COVERAGE_SKIP_IF_NO_ITEM`` setting to avoid the field coverage monitor error. @@ -410,7 +431,9 @@ class MyCustomItem(scrapy.Item): SPIDERMON_FIELD_COVERAGE_RULES = { "MyCustomItem/field_1": 0.4, "MyCustomItem/field_2": 1.0, - }""" + } + + """ def run(self, result): add_field_coverage_set = self.crawler.settings.getbool( diff --git a/spidermon/contrib/scrapy/pipelines.py b/spidermon/contrib/scrapy/pipelines.py index f0cb299f..d44a979e 100644 --- a/spidermon/contrib/scrapy/pipelines.py +++ b/spidermon/contrib/scrapy/pipelines.py @@ -2,12 +2,10 @@ from itemadapter import ItemAdapter from scrapy.exceptions import DropItem, NotConfigured -from scrapy.utils.misc import load_object -from scrapy import Field, Item +from scrapy import Item -from spidermon.contrib.validation import SchematicsValidator, JSONSchemaValidator +from spidermon.contrib.validation import JSONSchemaValidator from spidermon.contrib.validation.jsonschema.tools import get_schema_from -from schematics.models import Model from .stats import ValidationStatsManager @@ -59,7 +57,6 @@ def set_validators(loader, schema): for loader, name in [ (cls._load_jsonschema_validator, "SPIDERMON_VALIDATION_SCHEMAS"), - (cls._load_schematics_validator, "SPIDERMON_VALIDATION_MODELS"), ]: res = crawler.settings.get(name) if not res: @@ -100,15 +97,6 @@ def _load_jsonschema_validator(cls, schema): ) return JSONSchemaValidator(schema) - @classmethod - def _load_schematics_validator(cls, model_path): - model_class = load_object(model_path) - if not issubclass(model_class, Model): - raise NotConfigured( - "Invalid model, models must subclass schematics.models.Model" - ) - return SchematicsValidator(model_class) - def process_item(self, item, _): validators = self.find_validators(item) if not validators: diff --git a/spidermon/contrib/validation/__init__.py b/spidermon/contrib/validation/__init__.py index 5244dafd..ff8ac408 100644 --- a/spidermon/contrib/validation/__init__.py +++ b/spidermon/contrib/validation/__init__.py @@ -1,2 +1 @@ -from .schematics.validator import SchematicsValidator from .jsonschema.validator import JSONSchemaValidator diff --git a/spidermon/contrib/validation/schematics/__init__.py b/spidermon/contrib/validation/schematics/__init__.py deleted file mode 100644 index e69de29b..00000000 diff --git a/spidermon/contrib/validation/schematics/monkeypatches.py b/spidermon/contrib/validation/schematics/monkeypatches.py deleted file mode 100644 index e87da3fa..00000000 --- a/spidermon/contrib/validation/schematics/monkeypatches.py +++ /dev/null @@ -1,39 +0,0 @@ -import schematics - - -def monkeypatch_urltype(): - """ - Replace schematics URL check regex with a better one (stolen from Django). - - This patch cannot be applied to Schematics 2.* because the URL validation - is more complex. - """ - from schematics.types import URLType - from spidermon.contrib.validation.utils import URL_REGEX - - URLType.URL_REGEX = URL_REGEX - - -def monkeypatch_listtype(): - """ - Replace ListType list conversion method to avoid errors - """ - from schematics.transforms import EMPTY_LIST - from schematics.types.compound import ListType - from schematics.exceptions import ConversionError - - def _force_list(self, value): - if value is None or value == EMPTY_LIST: - return [] - try: - return list(value) - except Exception as e: - raise ConversionError("Invalid list") - - ListType._force_list = _force_list - - -# Apply monkeypatches -if schematics.__version__.startswith("1."): - monkeypatch_urltype() - monkeypatch_listtype() diff --git a/spidermon/contrib/validation/schematics/translator.py b/spidermon/contrib/validation/schematics/translator.py deleted file mode 100644 index 0482d9ed..00000000 --- a/spidermon/contrib/validation/schematics/translator.py +++ /dev/null @@ -1,58 +0,0 @@ -from spidermon.contrib.validation.translator import MessageTranslator -from spidermon.contrib.validation import messages - - -class SchematicsMessageTranslator(MessageTranslator): - messages = { - r"^Rogue field$": messages.UNEXPECTED_FIELD, - # BaseType - r"^This field is required.$": messages.MISSING_REQUIRED_FIELD, - r"^Value \(.*?\) must be one of \[.*?\]\.$": messages.VALUE_NOT_IN_CHOICES, - # StringType - r"^Couldn't interpret '.*' as string\.$": messages.INVALID_STRING, - r"^String value is too long\.$": messages.FIELD_TOO_LONG, - r"^String value is too short\.$": messages.FIELD_TOO_SHORT, - r"^String value did not match validation regex\.$": messages.REGEX_NOT_MATCHED, - # DateTimeType - r"^Could not parse .+\. Should be ISO ?8601(?: or timestamp)?\.$": messages.INVALID_DATETIME, - r"^Could not parse .+\. Valid formats: .+$": messages.INVALID_DATETIME, - # DateType - r"^Could not parse .+\. Should be ISO ?8601 \(YYYY-MM-DD\)\.$": messages.INVALID_DATE, - # NumberType - r"^.+ value should be greater than .+$": messages.NUMBER_TOO_LOW, - r"^.+ value should be less than .+$": messages.NUMBER_TOO_HIGH, - # IntType - r"^Value '.*' is not int\.?$": messages.INVALID_INT, - # FloatType - r"^Value '.*' is not float\.?$": messages.INVALID_FLOAT, - # LongType - r"^Value '.*' is not long\.?$": messages.INVALID_LONG, - # Decimalype - r"^Number '.*' failed to convert to a decimal\.?$": messages.INVALID_DECIMAL, - r"^Value '.*' is not decimal\.?$": messages.INVALID_DECIMAL, - r"^Value should be greater than .+$": messages.NUMBER_TOO_LOW, - r"^Value should be less than .+$": messages.NUMBER_TOO_HIGH, - # BooleanType - r"^Must be either true or false\.$": messages.INVALID_BOOLEAN, - # EmailType - r"^Not a well[ -]formed email address\.$": messages.INVALID_EMAIL, - # URLType - r"^Not a well[ -]formed URL\.$": messages.INVALID_URL, - # UUIDType - r"^Couldn't interpret '.*' value as UUID\.$": messages.INVALID_UUID, - # IPv4Type - r"^Invalid IPv4 address$": messages.INVALID_IPV4, - # HashType - r"^Hash value is wrong length\.$": messages.INVALID_HASH_LENGTH, - r"^Hash value is not hexadecimal\.$": messages.INVALID_HASH, - # ListType - r"^Invalid list$": messages.INVALID_LIST, - r"^Could not interpret the value as a list$": messages.INVALID_LIST, - r"^Please provide at least \d+ items?\.$": messages.LIST_TOO_SHORT, - r"^Please provide no more than \d+ items?\.$": messages.LIST_TOO_LONG, - # DictType - r"^Only (?:dictionaries|mappings) may be used in a DictType$": messages.INVALID_DICT, - # DictType - r"^Please use a mapping for this field or .+ " - r"instance instead of .*\.$": messages.INVALID_CHILD_CONTENT, - } diff --git a/spidermon/contrib/validation/schematics/validator.py b/spidermon/contrib/validation/schematics/validator.py deleted file mode 100644 index 9e0aa59d..00000000 --- a/spidermon/contrib/validation/schematics/validator.py +++ /dev/null @@ -1,115 +0,0 @@ -import re - -import schematics -from schematics.exceptions import ModelValidationError, ModelConversionError - -from spidermon.contrib.validation.validator import Validator -from .translator import SchematicsMessageTranslator -from . import monkeypatches - - -class SchematicsValidator(Validator): - default_translator = SchematicsMessageTranslator() - name = "Schematics" - - def __init__(self, model, translator=None, use_default_translator=True): - super().__init__( - translator=translator, use_default_translator=use_default_translator - ) - self._model = model - self._fields_required = {} - self._save_required_fields() - self._data = {} - - def _validate(self, data, strict=False): - self._set_data(data) - model = self._get_model_instance(strict=strict) - try: - model.validate() - except ModelValidationError as e: - self._add_errors(e.messages) - self._restore_required_fields() - - def _reset(self): - super()._reset() - self._data = {} - - def _set_data(self, data): - self._data = dict(data) - - def _get_model_instance(self, strict): - try: - return self._model(raw_data=self._data, strict=strict) - except ModelConversionError as e: - self._add_errors(e.messages) - for field_name in e.messages.keys(): - self._set_field_as_not_required(field_name) - self._data.pop(field_name) - return self._get_model_instance(strict=strict) - - def _save_required_fields(self): - for field_name, field in self._model._fields.items(): - self._fields_required[field_name] = field.required - - def _restore_required_fields(self): - for field_name, required in self._fields_required.items(): - self._model._fields[field_name].required = required - - def _set_field_as_not_required(self, field_name): - if field_name in self._model._fields: - self._model._fields[field_name].required = False - - def _add_errors(self, errors): - if schematics.__version__.startswith("1."): - for field_name, messages in errors.items(): - if isinstance(messages, dict): - transformed_errors = self._get_transformed_child_errors( - field_name, messages - ) - self._add_errors(transformed_errors) - else: - self._errors[field_name] += ( - messages if isinstance(messages, list) else [messages] - ) - else: - from schematics.datastructures import FrozenDict - - for field_name, messages in errors.items(): - if isinstance(messages, (dict, FrozenDict)): - transformed_errors = self._get_transformed_child_errors( - field_name, messages - ) - self._add_errors(transformed_errors) - else: - messages = self._clean_messages(messages) - self._errors[field_name] += messages - - def _get_transformed_child_errors(self, field_name, errors): - return {f"{field_name}.{k}": v for k, v in errors.items()} - - def _clean_messages(self, messages): - """ - This is necessary when using Schematics 2.*, because it encapsulates - the validation error messages in a different way. - """ - from schematics.exceptions import BaseError, ErrorMessage - from schematics.datastructures import FrozenList - - if type(messages) not in (list, FrozenList): - messages = [messages] - - clean_messages = [] - for message in messages: - if isinstance(message, BaseError): - message = message.messages - - if isinstance(message, ErrorMessage): - clean_messages.append(message.summary) - elif isinstance(message, FrozenList): - for err in message: - # err is an ErrorMessage object - clean_messages.append(err.summary) - else: - clean_messages.append(message) - - return clean_messages diff --git a/spidermon/utils/field_coverage.py b/spidermon/utils/field_coverage.py index 54e620d4..7fdded8f 100644 --- a/spidermon/utils/field_coverage.py +++ b/spidermon/utils/field_coverage.py @@ -15,10 +15,33 @@ def calculate_field_coverage(stats): item_key = item_type_m.group(2) item_type_total = stats.get(f"spidermon_item_scraped_count/{item_type}") - field_coverage = value / item_type_total - coverage[ - f"spidermon_field_coverage/{item_type}/{item_key}" - ] = field_coverage + if "_items" in item_key: + if item_key.endswith("_items"): + continue + + levels = item_key.split("/_items/") + + root_field_type_total = stats.get( + f"spidermon_item_scraped_count/{item_type}/{'/_items/'.join(levels[:-1])}/_items" + ) + + item_field_coverage = value / root_field_type_total + global_field_coverage = value / item_type_total + + coverage[ + f"spidermon_field_coverage/{item_type}/{'/'.join(levels)}" + ] = global_field_coverage + + coverage[ + f"spidermon_field_coverage/{item_type}/{'/_items/'.join(levels)}" + ] = item_field_coverage + + else: + field_coverage = value / item_type_total + + coverage[ + f"spidermon_field_coverage/{item_type}/{item_key}" + ] = field_coverage return coverage diff --git a/tests/contrib/scrapy/test_pipelines.py b/tests/contrib/scrapy/test_pipelines.py index 7830e282..db6b658f 100644 --- a/tests/contrib/scrapy/test_pipelines.py +++ b/tests/contrib/scrapy/test_pipelines.py @@ -15,10 +15,6 @@ STATS_TYPES = "spidermon/validation/validators/{}/{}" SETTING_SCHEMAS = "SPIDERMON_VALIDATION_SCHEMAS" -SETTING_MODELS = "SPIDERMON_VALIDATION_MODELS" - -TREE_VALIDATOR_PATH = "tests.fixtures.validators.TreeValidator" -TEST_VALIDATOR_PATH = "tests.fixtures.validators.TestValidator" class PipelineTestCaseMetaclass(type): @@ -255,124 +251,3 @@ def test_add_errors_to_item_prefilled(self): "prefilled", "some_message", ] - - -class PipelineModelValidator(PipelineTest): - assert_type_in_stats = partial(assert_type_in_stats, "schematics") - - data_tests = [ - DataTest( - name="processing usual item without errors", - item=TestItem({"url": "http://example.com"}), - settings={SETTING_MODELS: [TEST_VALIDATOR_PATH]}, - cases=[ - f"'{STATS_ITEM_ERRORS}' not in {{stats}}", - f"{{stats}}['{STATS_AMOUNTS}'] is 1", - assert_type_in_stats(Item), - ], - ), - DataTest( - name="processing item with url problem", - item=TestItem({"url": "example.com"}), - settings={SETTING_MODELS: [TEST_VALIDATOR_PATH]}, - cases=f"'{STATS_ITEM_ERRORS}' in {{stats}}", - ), - DataTest( - name="processing nested items without errors", - item=TreeItem({"child": TreeItem()}), - settings={SETTING_MODELS: [TREE_VALIDATOR_PATH]}, - cases=[ - f"'{STATS_ITEM_ERRORS}' not in {{stats}}", - f"{{stats}}['{STATS_AMOUNTS}'] is 1", - assert_type_in_stats(Item), - ], - ), - DataTest( - name="missing required fields", - item=TestItem(), - settings={SETTING_MODELS: [TEST_VALIDATOR_PATH]}, - cases=f"'{STATS_MISSINGS}' in {{stats}}", - ), - DataTest( - name="validator is {} type, validators in list repr".format( - TestItem.__name__ - ), - item=TestItem(), - settings={SETTING_MODELS: {TestItem: [TEST_VALIDATOR_PATH]}}, - cases=[ - f"'{STATS_ITEM_ERRORS}' in {{stats}}", - assert_type_in_stats(TestItem), - ], - ), - DataTest( - name="support several schema validators per item", - item=TestItem(), - settings={ - SETTING_MODELS: {TestItem: [TEST_VALIDATOR_PATH, TREE_VALIDATOR_PATH]} - }, - cases=[ - f"{{stats}}['{STATS_AMOUNTS}'] is 2", - f"{{stats}}['{STATS_ITEM_ERRORS}'] is 2", - ], - ), - DataTest( - name="item of one type processed only by proper validator", - item=TestItem({"url": "http://example.com"}), - settings={ - SETTING_MODELS: { - TestItem: TEST_VALIDATOR_PATH, - TreeItem: TREE_VALIDATOR_PATH, - } - }, - cases=f"'{STATS_ITEM_ERRORS}' not in {{stats}}", - ), - DataTest( - name="each item processed by proper validator", - item=TreeItem(), - settings={ - SETTING_MODELS: { - TestItem: TEST_VALIDATOR_PATH, - TreeItem: TREE_VALIDATOR_PATH, - } - }, - cases=[ - f"{{stats}}['{STATS_MISSINGS}'] is 1", - assert_type_in_stats(TestItem), - assert_type_in_stats(TreeItem), - ], - ), - ] - - -class PipelineValidators(PipelineTest): - data_tests = [ - DataTest( - name=f"there are both validators per {Item.__name__} type", - item=TestItem(), - settings={ - SETTING_SCHEMAS: [test_schema], - SETTING_MODELS: [TEST_VALIDATOR_PATH], - }, - cases=[ - f"{{stats}}['{STATS_AMOUNTS}'] is 2", - f"{{stats}}['{STATS_ITEM_ERRORS}'] is 2", - assert_type_in_stats("jsonschema", Item), - assert_type_in_stats("schematics", Item), - ], - ), - DataTest( - name="proper validators handle only related items", - item=TestItem({"url": "http://example.com"}), - settings={ - SETTING_SCHEMAS: {TestItem: test_schema, TreeItem: tree_schema}, - SETTING_MODELS: {Item: TEST_VALIDATOR_PATH}, - }, - cases=[ - f"{{stats}}['{STATS_AMOUNTS}'] is 3", - f"'{STATS_ITEM_ERRORS}' not in {{stats}}", - assert_type_in_stats("jsonschema", TestItem), - assert_type_in_stats("jsonschema", TreeItem), - assert_type_in_stats("schematics", Item), - ], - ), - ] diff --git a/tests/fixtures/validators.py b/tests/fixtures/validators.py index 4b3b7dfb..b207f619 100644 --- a/tests/fixtures/validators.py +++ b/tests/fixtures/validators.py @@ -1,15 +1,4 @@ import json -from schematics.models import Model -from schematics.types import URLType, StringType, BaseType - - -class TestValidator(Model): - url = URLType(required=True) - title = StringType() - - -class TreeValidator(Model): - child = BaseType(required=True) tree_schema = { diff --git a/tests/test_item_scraped_signal.py b/tests/test_item_scraped_signal.py index a0fc1b47..784dcec8 100644 --- a/tests/test_item_scraped_signal.py +++ b/tests/test_item_scraped_signal.py @@ -306,3 +306,295 @@ def test_item_scraped_count_do_not_ignore_none_values_by_default(spider): assert stats.get("spidermon_item_scraped_count/dict/field1") == 2 assert stats.get("spidermon_item_scraped_count/dict/field2") == 2 + + +def test_item_scraped_count_list_of_dicts_disabled(spider): + settings = { + "SPIDERMON_ENABLED": True, + "EXTENSIONS": {"spidermon.contrib.scrapy.extensions.Spidermon": 100}, + "SPIDERMON_ADD_FIELD_COVERAGE": True, + "SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS": 0, + } + crawler = get_crawler(settings_dict=settings) + spider = Spider.from_crawler(crawler, "example.com") + returned_items = [ + { + "field1": 1, + "field2": [ + { + "nested_field1": 1, + "nested_field2": 1, + "nested_field3": [ + {"deep_field1": 1}, + {"deep_field1": 1}, + {"deep_field2": 1}, + ], + }, + {"nested_field2": 1}, + ], + }, + { + "field1": 1, + "field2": [ + {"nested_field1": 1}, + { + "nested_field1": 1, + "nested_field4": {"deep_field1": 1, "deep_field2": 1}, + }, + {"nested_field1": 1, "nested_field2": 1}, + ], + }, + ] + + for item in returned_items: + spider.crawler.signals.send_catch_log_deferred( + signal=signals.item_scraped, + item=item, + response="", + spider=spider, + ) + + stats = spider.crawler.stats.get_stats() + + assert stats.get("spidermon_item_scraped_count/dict/field1") == 2 + assert stats.get("spidermon_item_scraped_count/dict/field2") == 2 + + assert stats.get("spidermon_item_scraped_count/dict/field2/_items") == None + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field1") + == None + ) + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field2") + == None + ) + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field3") + == None + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items" + ) + == None + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items/deep_field1" + ) + == None + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items/deep_field2" + ) + == None + ) + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field4") + == None + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field4/deep_field1" + ) + == None + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field4/deep_field2" + ) + == None + ) + + +def test_item_scraped_count_list_of_dicts_one_nesting_level(spider): + settings = { + "SPIDERMON_ENABLED": True, + "EXTENSIONS": {"spidermon.contrib.scrapy.extensions.Spidermon": 100}, + "SPIDERMON_ADD_FIELD_COVERAGE": True, + "SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS": 1, + } + crawler = get_crawler(settings_dict=settings) + spider = Spider.from_crawler(crawler, "example.com") + returned_items = [ + { + "field1": 1, + "field2": [ + { + "nested_field1": 1, + "nested_field2": 1, + "nested_field3": [ + {"deep_field1": 1}, + {"deep_field1": 1}, + {"deep_field2": 1}, + ], + }, + {"nested_field2": 1}, + ], + }, + { + "field1": 1, + "field2": [ + {"nested_field1": 1}, + { + "nested_field1": 1, + "nested_field4": {"deep_field1": 1, "deep_field2": 1}, + }, + {"nested_field1": 1, "nested_field2": 1}, + ], + }, + ] + + for item in returned_items: + spider.crawler.signals.send_catch_log_deferred( + signal=signals.item_scraped, + item=item, + response="", + spider=spider, + ) + + stats = spider.crawler.stats.get_stats() + + assert stats.get("spidermon_item_scraped_count/dict/field1") == 2 + assert stats.get("spidermon_item_scraped_count/dict/field2") == 2 + + assert stats.get("spidermon_item_scraped_count/dict/field2/_items") == 5 + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field1") == 4 + ) + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field2") == 3 + ) + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field3") == 1 + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items" + ) + == None + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items/deep_field1" + ) + == None + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items/deep_field2" + ) + == None + ) + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field4") == 1 + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field4/deep_field1" + ) + == 1 + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field4/deep_field2" + ) + == 1 + ) + + +def test_item_scraped_count_list_of_dicts_two_nesting_levels(spider): + settings = { + "SPIDERMON_ENABLED": True, + "EXTENSIONS": {"spidermon.contrib.scrapy.extensions.Spidermon": 100}, + "SPIDERMON_ADD_FIELD_COVERAGE": True, + "SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS": 2, + } + crawler = get_crawler(settings_dict=settings) + spider = Spider.from_crawler(crawler, "example.com") + returned_items = [ + { + "field1": 1, + "field2": [ + { + "nested_field1": 1, + "nested_field2": 1, + "nested_field3": [ + {"deep_field1": 1}, + {"deep_field1": 1}, + {"deep_field2": 1}, + ], + }, + {"nested_field2": 1}, + ], + }, + { + "field1": 1, + "field2": [ + {"nested_field1": 1}, + { + "nested_field1": 1, + "nested_field4": {"deep_field1": 1, "deep_field2": 1}, + }, + {"nested_field1": 1, "nested_field2": 1}, + ], + }, + ] + + for item in returned_items: + spider.crawler.signals.send_catch_log_deferred( + signal=signals.item_scraped, + item=item, + response="", + spider=spider, + ) + + stats = spider.crawler.stats.get_stats() + + assert stats.get("spidermon_item_scraped_count/dict/field1") == 2 + assert stats.get("spidermon_item_scraped_count/dict/field2") == 2 + + assert stats.get("spidermon_item_scraped_count/dict/field2/_items") == 5 + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field1") == 4 + ) + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field2") == 3 + ) + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field3") == 1 + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items" + ) + == 3 + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items/deep_field1" + ) + == 2 + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items/deep_field2" + ) + == 1 + ) + assert ( + stats.get("spidermon_item_scraped_count/dict/field2/_items/nested_field4") == 1 + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field4/deep_field1" + ) + == 1 + ) + assert ( + stats.get( + "spidermon_item_scraped_count/dict/field2/_items/nested_field4/deep_field2" + ) + == 1 + ) diff --git a/tests/test_validators_schematics.py b/tests/test_validators_schematics.py deleted file mode 100644 index 22efa641..00000000 --- a/tests/test_validators_schematics.py +++ /dev/null @@ -1,779 +0,0 @@ -import schematics -from schematics.exceptions import ValidationError -from schematics.models import Model -from schematics.types import ( - StringType, - DateTimeType, - DateType, - FloatType, - IntType, - LongType, - DecimalType, - BooleanType, - EmailType, - URLType, - UUIDType, - IPv4Type, - MD5Type, - SHA1Type, -) -from schematics.types.compound import ListType, DictType, ModelType - -from spidermon.contrib.validation import SchematicsValidator, messages - - -SCHEMATICS1 = schematics.__version__.startswith("1.") - - -def test_rogue_fields(): - """ - messages: - - UNEXPECTED_FIELD - """ - _test_data( - model=Model, data={"a": 1}, expected=(False, {"a": [messages.UNEXPECTED_FIELD]}) - ) - _test_data(model=Model, data={"a": 1}, expected=(True, {}), strict=False) - - -def test_required(): - """ - messages: - - MISSING_REQUIRED_FIELD - """ - - class DataRequired(Model): - a = StringType(required=True) - - class DataNotRequired(Model): - a = StringType(required=False) - - _test_data( - model=DataRequired, - data={}, - expected=(False, {"a": [messages.MISSING_REQUIRED_FIELD]}), - ) - _test_data(model=DataNotRequired, data={}, expected=(True, {})) - - -def test_choices(): - """ - messages: - - VALUE_NOT_IN_CHOICES - """ - - class Data(Model): - a = StringType(choices=["a", "b"]) - b = IntType(choices=[1, 2, 3]) - - _test_data(model=Data, data={}, expected=(True, {})) - _test_data(model=Data, data={"a": "b", "b": 3}, expected=(True, {})) - _test_data( - model=Data, - data={"a": "c", "b": 4}, - expected=( - False, - { - "a": [messages.VALUE_NOT_IN_CHOICES], - "b": [messages.VALUE_NOT_IN_CHOICES], - }, - ), - ) - - -def test_string_valid(): - """ - messages: - - INVALID_STRING - """ - - class Data(Model): - a = StringType() - - _test_data(model=Data, data={"a": "hello there!"}, expected=(True, {})) - _test_data( - model=Data, data={"a": []}, expected=(False, {"a": [messages.INVALID_STRING]}) - ) - - -def test_string_lengths(): - """ - messages: - - FIELD_TOO_SHORT - - FIELD_TOO_LONG - """ - - class Data(Model): - a = StringType(min_length=2, max_length=5) - - _test_data(model=Data, data={"a": "12"}, expected=(True, {})) - _test_data(model=Data, data={"a": "12345"}, expected=(True, {})) - _test_data( - model=Data, data={"a": "1"}, expected=(False, {"a": [messages.FIELD_TOO_SHORT]}) - ) - _test_data( - model=Data, - data={"a": "123456"}, - expected=(False, {"a": [messages.FIELD_TOO_LONG]}), - ) - - -def test_string_regex(): - """ - messages: - - REGEX_NOT_MATCHED - """ - - class Data(Model): - a = StringType(regex=".*def.*") - - _test_data( - model=Data, - data={"a": "abc"}, - expected=(False, {"a": [messages.REGEX_NOT_MATCHED]}), - ) - _test_data(model=Data, data={"a": "abcdefghi"}, expected=(True, {})) - _test_data(model=Data, data={"a": "def"}, expected=(True, {})) - - -def test_datetime(): - """ - messages: - - INVALID_DATETIME - """ - - class Data(Model): - a = DateTimeType() - - class DataWithFormats(Model): - a = DateTimeType(formats=("%Y-%m-%d %H:%M:%S.%f", "%Y-%m-%d %H:%M:%S")) - - if SCHEMATICS1: - INVALID = ["2015-05-13 13:35:15.718978", "2015-05-13 13:35:15"] - else: - INVALID = ["foo", "1-2-3"] - CUSTOM_FORMAT = ["2015-05-13 13:35:15.718978", "2015-05-13 13:35:15"] - VALID = ["2015-05-13T13:35:15.718978", "2015-05-13T13:35:15"] - _test_valid_invalid( - model=Data, - valid=VALID, - invalid=INVALID, - expected_error=messages.INVALID_DATETIME, - ) - if SCHEMATICS1: - for dt in INVALID: - _test_data(model=DataWithFormats, data={"a": dt}, expected=(True, {})) - else: - for dt in CUSTOM_FORMAT: - _test_data(model=DataWithFormats, data={"a": dt}, expected=(True, {})) - - -def test_date(): - """ - messages: - - INVALID_DATE - """ - - class Data(Model): - a = DateType() - - INVALID = ["2015-05-13 13:35:15", "13-05-2013", "2015-20-13", "2015-01-40"] - VALID = ["2015-05-13", "2050-01-01"] - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_DATE - ) - - -def test_int(): - """ - messages: - - INVALID_INT - - NUMBER_TOO_LOW - - NUMBER_TOO_HIGH - """ - - class Data(Model): - a = IntType(min_value=-10, max_value=10) - b = IntType() - - INVALID = ["", "a", "2a", "2015-05-13 13:35:15", "7.2"] - VALID = ["1", "8", "-2", "-7", 1, 8, -2, -7] - if SCHEMATICS1: - VALID.append(7.2) - else: - INVALID.append(7.2) - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_INT - ) - _test_data( - model=Data, data={"a": -20}, expected=(False, {"a": [messages.NUMBER_TOO_LOW]}) - ) - _test_data( - model=Data, data={"a": 11}, expected=(False, {"a": [messages.NUMBER_TOO_HIGH]}) - ) - - -def test_float(): - """ - messages: - - INVALID_FLOAT - - NUMBER_TOO_LOW - - NUMBER_TOO_HIGH - """ - - class Data(Model): - a = FloatType(min_value=-10, max_value=10) - - INVALID = ["", "a", "2a", "2015-05-13 13:35:15"] - VALID = [ - "1", - "-2", - "8", - "2.3", - "5.2354958", - "-9.231", - 1, - -2, - 8, - 2.3, - 5.2354958, - -9.231, - ] - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_FLOAT - ) - _test_data( - model=Data, data={"a": -20}, expected=(False, {"a": [messages.NUMBER_TOO_LOW]}) - ) - _test_data( - model=Data, data={"a": 11}, expected=(False, {"a": [messages.NUMBER_TOO_HIGH]}) - ) - - -def test_long(): - """ - messages: - - INVALID_LONG - - NUMBER_TOO_LOW - - NUMBER_TOO_HIGH - """ - - class Data(Model): - a = LongType(min_value=-10, max_value=10) - - INVALID = ["", "a", "2a", "2015-05-13 13:35:15", "2.3", "5.2354958"] - VALID = ["1", "-2", "8", 1, -2, 8] - if SCHEMATICS1: - expected_error = messages.INVALID_LONG - else: - expected_error = messages.INVALID_INT - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=expected_error - ) - _test_data( - model=Data, data={"a": -20}, expected=(False, {"a": [messages.NUMBER_TOO_LOW]}) - ) - _test_data( - model=Data, data={"a": 11}, expected=(False, {"a": [messages.NUMBER_TOO_HIGH]}) - ) - - -def test_decimal(): - """ - messages: - - INVALID_DECIMAL - - NUMBER_TOO_LOW - - NUMBER_TOO_HIGH - """ - - class Data(Model): - a = DecimalType(min_value=-10, max_value=10) - - INVALID = ["", "a", "2a", "2015-05-13 13:35:15"] - VALID = [ - "1", - "-2", - "8", - "2.3", - "5.2354958", - "-9.231", - 1, - -2, - 8, - 2.3, - 5.2354958, - -9.231, - ] - _test_valid_invalid( - model=Data, - valid=VALID, - invalid=INVALID, - expected_error=messages.INVALID_DECIMAL, - ) - _test_data( - model=Data, data={"a": -20}, expected=(False, {"a": [messages.NUMBER_TOO_LOW]}) - ) - _test_data( - model=Data, data={"a": 11}, expected=(False, {"a": [messages.NUMBER_TOO_HIGH]}) - ) - - -def test_boolean(): - """ - messages: - - INVALID_BOOLEAN - - NUMBER_TOO_LOW - - NUMBER_TOO_HIGH - """ - - class Data(Model): - a = BooleanType() - - INVALID = ["", "a", "2" "TRUE", "FALSE", "TruE", "FalsE"] - VALID = [0, 1, "0", "1", "True", "False", "true", "false", True, False] - _test_valid_invalid( - model=Data, - valid=VALID, - invalid=INVALID, - expected_error=messages.INVALID_BOOLEAN, - ) - - -def test_email(): - """ - messages: - - INVALID_EMAIL - - NUMBER_TOO_LOW - - NUMBER_TOO_HIGH - """ - - class Data(Model): - a = EmailType() - - INVALID = [ - "", - "johndoe", - "johndoe@domain" "johndoe@domain." "@domain" "@domain.com" "domain.com", - ] - VALID = [ - "johndoe@domain.com", - "john.doe@domain.com", - "john.doe@sub.domain.com", - "j@sub.domain.com", - "j@d.com", - "j@domain.co.uk", - ] - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_EMAIL - ) - - -def test_url(): - """ - messages: - - INVALID_URL - """ - - class Data(Model): - a = URLType() - - INVALID = [ - "", - "http://", - "http://www.", - "www.", - "http://www. .com", - "domain.com", - "www.domain.com", - "http:/www.domain.com", - "http//www.domain.com", - "http:www.domain.com", - "htp://domain.com/", - "http://sub.domain.com\\en-us\\default.aspx\\", - "http:\\\\msdn.domain.com\\en-us\\library\\default.aspx\\", - "http:\\\\www.domain.com\\leafnode-L1.html", - "./", - "../", - "http:\\\\www.domain.com\\leafnode-L1.xhtml\\", - ] - VALID = [ - "http://www.domain", - "http://www.com", - "http://www.domain.com.", - "http://www.domain.com/.", - "http://www.domain.com/..", - "http://www.domain.com//cataglog//index.html", - "http://www.domain.net/", - "http://www.domain.com/level2/leafnode-L2.xhtml/", - "http://www.domain.com/level2/level3/leafnode-L3.xhtml/", - "http://www.domain.com?pageid=123&testid=1524", - "http://www.domain.com/do.html#A", - ] - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_URL - ) - - -def test_uuid(): - """ - messages: - - INVALID_UUID - """ - - class Data(Model): - a = UUIDType() - - INVALID = [ - "", - "678as6sd88ads67", - "678as6sd88ads67-alskjlasd", - "xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx", - "2.25.290383009913173870543740933812899923227", - ] - VALID = [ - "12345678-1234-5678-1234-567812345678", - "12345678123456781234567812345678", - "urn:uuid:12345678-1234-5678-1234-567812345678", - "cfc63f3f-f3a7-465a-8183-acf055c6d472", - "00000000-0000-0000-0000-000000000000", - "01234567-89ab-cdef-0123456789abcdef", - "0123456789abcdef0123456789abcdef", - ] - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_UUID - ) - - -def test_ipv4type(): - """ - messages: - - INVALID_IPV4 - """ - - class Data(Model): - a = IPv4Type() - - INVALID = [ - "", - "0", - "0.", - "0.0", - "0.0.", - "0.0.0", - "0.0.0.0.", - "0.0.0.0.0", - "256.256.256.256", - "2002:4559:1FE2::4559:1FE2", - "2002:4559:1FE2:0:0:0:4559:1FE2", - "2002:4559:1FE2:0000:0000:0000:4559:1FE2", - ] - VALID = [ - "98.139.180.149", - "69.89.31.226", - "192.168.1.1", - "127.0.0.0", - "0.0.0.0", - "255.255.255.255", - ] - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_IPV4 - ) - - -def test_md5(): - """ - messages: - - INVALID_HASH - - INVALID_HASH_LENGTH - """ - - class Data(Model): - a = MD5Type() - - INVALID = [ - "_b1a9953c4611296a827abf8a47804d7", - "zb1a9953c4611296a827abf8a47804d7", - "Gb1a9953c4611296a827abf8a47804d7", - # FIXME: PY3: schematics uses integer conversion for validating hex and - # Py3 integers can contain underscores. - # '8b1_9953c4611296a827abf8c47804d1', - ] - VALID = [ - "8b1a9953c4611296a827abf8c47804d7", - "7dd4bbe8a38600b556f79ca44c9b5132", - "11111111111111111111111111111111", - "8B1A9953C4611296A827ABF8C47804D1", - ] - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_HASH - ) - _test_data( - model=Data, - data={"a": "8b1a9953c4611296a827abf8c47804d"}, - expected=(False, {"a": [messages.INVALID_HASH_LENGTH]}), - ) - _test_data( - model=Data, - data={"a": "8b1a9953c4611296"}, - expected=(False, {"a": [messages.INVALID_HASH_LENGTH]}), - ) - _test_data( - model=Data, - data={"a": "8b1a9953c46112968b1a9953c46112968b1a9953c4611296"}, - expected=(False, {"a": [messages.INVALID_HASH_LENGTH]}), - ) - - -def test_sha1(): - """ - messages: - - INVALID_HASH - - INVALID_HASH_LENGTH - """ - - class Data(Model): - a = SHA1Type() - - INVALID = [ - "_03d40e1a2ede7e31f3c3b45a9e87d12ed33402e", - "g03d40e1a2ede7e31f3c3b45a9e87d12ed33402e", - "z03d40e1a2ede7e31f3c3b45a9e87d12ed33402e", - "G03d40e1a2ede7e31f3c3b45a9e87d12ed33402e", - # FIXME: PY3: schematics uses integer conversion for validating hex and - # Py3 integers can contain underscores. - # 'a03d_0e1a2ede7e31f3c3b45a9e87d12ed33402e', - ] - VALID = ["a03d70e1a2ede7e31f3c3b45a9e87d12ed33402e"] - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_HASH - ) - _test_data( - model=Data, - data={"a": "a03d70e1a2ede7e31f3c3b45a9e87d12ed33402"}, - expected=(False, {"a": [messages.INVALID_HASH_LENGTH]}), - ) - _test_data( - model=Data, - data={"a": "8b1a9953c4611296"}, - expected=(False, {"a": [messages.INVALID_HASH_LENGTH]}), - ) - _test_data( - model=Data, - data={"a": "8b1a9953c46112968b1a9953c46112968b1a9953c4611296"}, - expected=(False, {"a": [messages.INVALID_HASH_LENGTH]}), - ) - - -def test_list(): - """ - messages: - - INVALID_LIST - - LIST_TOO_SHORT - - LIST_TOO_LARGE - - INVALID_INT - """ - - class Data(Model): - a = ListType(field=IntType(), min_size=3, max_size=5) - - _test_data(model=Data, data={"a": [1, 2, 3]}, expected=(True, {})) - _test_data(model=Data, data={"a": ["1", "2", "3"]}, expected=(True, {})) - _test_data( - model=Data, data={"a": Data}, expected=(False, {"a": [messages.INVALID_LIST]}) - ) - if SCHEMATICS1: - _test_data( - model=Data, - data={"a": ["a", "b", "c"]}, - expected=(False, {"a": [messages.INVALID_INT]}), - ) - else: - _test_data( - model=Data, - data={"a": ["a", "b", "c"]}, - expected=( - False, - { - "a.0": [messages.INVALID_INT], - "a.1": [messages.INVALID_INT], - "a.2": [messages.INVALID_INT], - }, - ), - ) - _test_data( - model=Data, - data={"a": [1, 2]}, - expected=(False, {"a": [messages.LIST_TOO_SHORT]}), - ) - _test_data( - model=Data, - data={"a": [1, 2, 3, 4, 5, 6]}, - expected=(False, {"a": [messages.LIST_TOO_LONG]}), - ) - - -def test_dict(): - """ - messages: - - INVALID_DICT - - INVALID_INT - """ - - class Data(Model): - a = DictType(field=IntType) - - INVALID = ["a", Data] - VALID = [{}, {"some": 1}] - if SCHEMATICS1: - VALID.append([]) - else: - INVALID.append([]) - _test_valid_invalid( - model=Data, valid=VALID, invalid=INVALID, expected_error=messages.INVALID_DICT - ) - if SCHEMATICS1: - _test_data( - model=Data, - data={"a": {"some": "a"}}, - expected=(False, {"a": [messages.INVALID_INT]}), - ) - else: - _test_data( - model=Data, - data={"a": {"some": "a"}}, - expected=(False, {"a.some": [messages.INVALID_INT]}), - ) - - -def test_models(): - """ - messages: - - UNEXPECTED_FIELD - - MISSING_REQUIRED_FIELD - - INVALID_FLOAT - """ - - class Coordinates(Model): - latitude = FloatType(required=True) - longitude = FloatType(required=True) - - class Geo(Model): - coordinates = ModelType(Coordinates, required=True) - - class Data(Model): - geo = ModelType(Geo, required=True) - - _test_data( - model=Data, - data={"a": {}}, - expected=( - False, - { - "a": [messages.UNEXPECTED_FIELD], - "geo": [messages.MISSING_REQUIRED_FIELD], - }, - ), - ) - _test_data( - model=Data, - data={"geo": None}, - expected=(False, {"geo": [messages.MISSING_REQUIRED_FIELD]}), - ) - _test_data( - model=Data, - data={"geo": {}}, - expected=(False, {"geo.coordinates": [messages.MISSING_REQUIRED_FIELD]}), - ) - _test_data( - model=Data, - data={"geo": {"coordinates": None}}, - expected=(False, {"geo.coordinates": [messages.MISSING_REQUIRED_FIELD]}), - ) - _test_data( - model=Data, - data={"geo": {"coordinates": {}}}, - expected=( - False, - { - "geo.coordinates.latitude": [messages.MISSING_REQUIRED_FIELD], - "geo.coordinates.longitude": [messages.MISSING_REQUIRED_FIELD], - }, - ), - ) - _test_data( - model=Data, - data={"geo": {"coordinates": {"latitude": None, "longitude": None}}}, - expected=( - False, - { - "geo.coordinates.latitude": [messages.MISSING_REQUIRED_FIELD], - "geo.coordinates.longitude": [messages.MISSING_REQUIRED_FIELD], - }, - ), - ) - _test_data( - model=Data, - data={"geo": {"coordinates": {"latitude": "y", "longitude": "x"}}}, - expected=( - False, - { - "geo.coordinates.latitude": [messages.INVALID_FLOAT], - "geo.coordinates.longitude": [messages.INVALID_FLOAT], - }, - ), - ) - _test_data( - model=Data, - data={"geo": {"coordinates": {"latitude": 40.42, "longitude": -3.71}}}, - expected=(True, {}), - ) - - -def test_multiple_errors_per_field(): - """ - messages: - - FIELD_TOO_SHORT - - REGEX_NOT_MATCHED - """ - - class Data(Model): - a = StringType(min_length=3, regex=r"foo") - - data = {"a": "z"} - v = SchematicsValidator(Data) - result = v.validate(data, strict=True) - assert result[0] is False - error_messages = result[1] - assert "a" in error_messages - expected = [messages.FIELD_TOO_SHORT, messages.REGEX_NOT_MATCHED] - assert sorted(error_messages["a"]) == expected - - -def _test_data(model, data, expected, strict=True): - v = SchematicsValidator(model) - assert expected == v.validate(data, strict=strict) - - -def _test_valid_invalid(model, valid, invalid, expected_error, expected_field="a"): - for dt in valid: - _test_data(model=model, data={expected_field: dt}, expected=(True, {})) - for dt in invalid: - _test_data( - model=model, - data={expected_field: dt}, - expected=(False, {expected_field: [expected_error]}), - ) - - -def test_validation_error_on_model_level_validation(): - class TestModel(Model): - field_a = StringType() - - def validate_field_a(self, data, value): - raise ValidationError("Model-level validation failed.") - - _test_data( - model=TestModel, - data={"field_a": "some_data"}, - expected=(False, {"field_a": ["Model-level validation failed."]}), - ) diff --git a/tests/utils/test_field_coverage.py b/tests/utils/test_field_coverage.py index 698997f1..86ffb184 100644 --- a/tests/utils/test_field_coverage.py +++ b/tests/utils/test_field_coverage.py @@ -24,3 +24,48 @@ def test_calculate_field_coverage_from_stats(): coverage = calculate_field_coverage(spider_stats) assert coverage == expected_coverage + + +def test_calculate_field_coverage_from_stats_with_nested_fields(): + spider_stats = { + "finish_reason": "finished", + "spidermon_item_scraped_count": 100, + "spidermon_item_scraped_count/dict": 100, + "spidermon_item_scraped_count/dict/field1": 100, + "spidermon_item_scraped_count/dict/field2": 90, + "spidermon_item_scraped_count/dict/field2/_items": 1000, + "spidermon_item_scraped_count/dict/field2/_items/nested_field1": 550, + "spidermon_item_scraped_count/dict/field2/_items/nested_field2": 1000, + "spidermon_item_scraped_count/dict/field2/_items/nested_field3": 300, + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items": 500, + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items/deep_field1": 500, + "spidermon_item_scraped_count/dict/field2/_items/nested_field3/_items/deep_field2": 250, + "spidermon_item_scraped_count/dict/field2/_items/nested_field4": 500, + "spidermon_item_scraped_count/dict/field2/_items/nested_field4/deep_field1": 500, + "spidermon_item_scraped_count/dict/field2/_items/nested_field4/deep_field2": 250, + } + + expected_coverage = { + "spidermon_field_coverage/dict/field1": 1.0, + "spidermon_field_coverage/dict/field2": 0.9, + "spidermon_field_coverage/dict/field2/_items/nested_field1": 0.55, + "spidermon_field_coverage/dict/field2/_items/nested_field2": 1.0, + "spidermon_field_coverage/dict/field2/_items/nested_field3": 0.3, + "spidermon_field_coverage/dict/field2/_items/nested_field3/_items/deep_field1": 1.0, + "spidermon_field_coverage/dict/field2/_items/nested_field3/_items/deep_field2": 0.5, + "spidermon_field_coverage/dict/field2/_items/nested_field4": 0.5, + "spidermon_field_coverage/dict/field2/_items/nested_field4/deep_field1": 0.5, + "spidermon_field_coverage/dict/field2/_items/nested_field4/deep_field2": 0.25, + "spidermon_field_coverage/dict/field2/nested_field1": 5.5, + "spidermon_field_coverage/dict/field2/nested_field2": 10.0, + "spidermon_field_coverage/dict/field2/nested_field3": 3.0, + "spidermon_field_coverage/dict/field2/nested_field3/deep_field1": 5.0, + "spidermon_field_coverage/dict/field2/nested_field3/deep_field2": 2.5, + "spidermon_field_coverage/dict/field2/nested_field4": 5.0, + "spidermon_field_coverage/dict/field2/nested_field4/deep_field1": 5.0, + "spidermon_field_coverage/dict/field2/nested_field4/deep_field2": 2.5, + } + + coverage = calculate_field_coverage(spider_stats) + + assert coverage == expected_coverage diff --git a/tox.ini b/tox.ini index c24d4811..cf65c734 100644 --- a/tox.ini +++ b/tox.ini @@ -6,7 +6,7 @@ skip_missing_interpreters = True extras = tests validation -commands = pytest -s -W ignore::schematics.deprecated.SchematicsDeprecationWarning --cov=spidermon --cov-report= {posargs:tests} +commands = pytest -s --cov=spidermon --cov-report= {posargs:tests} [testenv:min] basepython = python3.6