Skip to content

Commit

Permalink
Merge branch 'scrapinghub:master' into issue-403-custom-job-tags
Browse files Browse the repository at this point in the history
  • Loading branch information
VMRuiz committed Jun 30, 2023
2 parents 856bc46 + 2c487b7 commit 2bd22e8
Show file tree
Hide file tree
Showing 26 changed files with 576 additions and 1,259 deletions.
59 changes: 36 additions & 23 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -315,12 +315,8 @@ Item validation
---------------

Item validators allows you to match your returned items with predetermined structure
ensuring that all fields contains data in the expected format. Spidermon allows
you to choose between schematics_ or `JSON Schema`_ to define the structure
of your item.

In this tutorial, we will use a schematics_ model to make sure that all required
fields are populated and they are all of the correct format.
ensuring that all fields contains data in the expected format. supports `JSON Schema`_
to define the structure of your item.

First step is to change our actual spider code to use `Scrapy items`_. Create a
new file called `items.py`:
Expand Down Expand Up @@ -367,25 +363,43 @@ And then modify the spider code to use the newly defined item:
)
)
Now we need to create our schematics model in `validators.py` file that will contain
Now we need to create our jsonschema model in the `schemas/quote_item.json` file that will contain
all the validation rules:

.. _quote-item-validation-schema:

.. code-block:: python
# tutorial/validators.py
from schematics.models import Model
from schematics.types import URLType, StringType, ListType
class QuoteItem(Model):
quote = StringType(required=True)
author = StringType(required=True)
author_url = URLType(required=True)
tags = ListType(StringType)
.. code-block:: json
{
"$schema": "http://json-schema.org/draft-07/schema",
"type": "object",
"properties": {
"quote": {
"type": "string"
},
"author": {
"type": "string"
},
"author_url": {
"type": "string",
"pattern": ""
},
"tags": {
"type": "array",
"items": {
"type":"string"
}
}
},
"required": [
"quote",
"author",
"author_url"
]
}
To allow Spidermon to validate your items, you need to include an item pipeline and
inform the name of the model class used for validation:
inform the path of the json schema used for validation:

.. code-block:: python
Expand All @@ -394,8 +408,8 @@ inform the name of the model class used for validation:
'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}
SPIDERMON_VALIDATION_MODELS = (
'tutorial.validators.QuoteItem',
SPIDERMON_VALIDATION_SCHEMAS = (
'./schemas/quote_item.json',
)
After that, every time you run your spider you will have a new set of stats in
Expand All @@ -408,7 +422,7 @@ your spider log providing information about the results of the validations:
'spidermon/validation/fields': 400,
'spidermon/validation/items': 100,
'spidermon/validation/validators': 1,
'spidermon/validation/validators/item/schematics': True,
'spidermon/validation/validators/item/jsonschema': True,
[scrapy.core.engine] INFO: Spider closed (finished)
You can then create a new monitor that will check these new statistics and raise
Expand Down Expand Up @@ -473,7 +487,6 @@ The resulted item will look like this:
}
.. _`JSON Schema`: https://json-schema.org/
.. _`schematics`: https://schematics.readthedocs.io/en/latest/
.. _`Scrapy`: https://scrapy.org/
.. _`Scrapy items`: https://docs.scrapy.org/en/latest/topics/items.html
.. _`Scrapy Tutorial`: https://doc.scrapy.org/en/latest/intro/tutorial.html
Expand Down
7 changes: 2 additions & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,8 @@ following features:

* It can check the output data produced by Scrapy (or other sources) and
verify it against a schema or model that defines the expected structure,
data types and value restrictions. It supports data validation based on two
external libraries:

* jsonschema: `<https://github.com/Julian/jsonschema>`_
* Schematics: `<https://github.com/schematics/schematics>`_
data types and value restrictions. It supports data validation based on
the jsonschema library (`<https://github.com/Julian/jsonschema>`_).
* It allows you to define conditions that should trigger an alert based on
Scrapy stats.
* It supports notifications via email, Slack, Telegram and Discord.
Expand Down
5 changes: 1 addition & 4 deletions docs/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,12 @@ build your monitors on top of it. The library depends on jsonschema_ and

If you want to set up any notifications, additional `monitoring` dependencies will help with that.

If you want to use schematics_ validation, you probably want `validation`.

So the recommended way to install the library is by adding both:

.. code-block:: bash
pip install "spidermon[monitoring,validation]"
pip install "spidermon[monitoring]"
.. _`jsonschema`: https://pypi.org/project/jsonschema/
.. _`python-slugify`: https://pypi.org/project/python-slugify/
.. _`schematics`: https://pypi.org/project/schematics/
66 changes: 2 additions & 64 deletions docs/source/item-validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,37 +21,8 @@ the first step is to enable the built-in item pipeline in your project settings:
subsequent pipeline changes the content of the item, ignoring the
validation already performed.

After that, you need to choose which validation library will be used. Spidermon
accepts schemas defined using schematics_ or `JSON Schema`_.

With schematics
---------------

Schematics_ is a validation library based on ORM-like models. These models include
some common data types and validators, but they can also be extended to define
custom validation rules.

.. warning::

You need to install `schematics`_ to use this feature.

.. code-block:: python
# Usually placed in validators.py file
from schematics.models import Model
from schematics.types import URLType, StringType, ListType
class QuoteItem(Model):
quote = StringType(required=True)
author = StringType(required=True)
author_url = URLType(required=True)
tags = ListType(StringType)
Check `schematics documentation`_ to learn how to define a model and how to extend the
built-in data types.

With JSON Schema
----------------
Using JSON Schema
-----------------

`JSON Schema`_ is a powerful tool for validating the structure of JSON data. You can
define which fields are required, the type assigned to each field, a regular expression
Expand Down Expand Up @@ -133,36 +104,6 @@ Default: ``_validation``
The name of the field added to the item when a validation error happens and
`SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS`_ is enabled.

SPIDERMON_VALIDATION_MODELS
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Default: ``None``

A `list` containing the `schematics models`_ that contain the definition of the items
that need to be validated.

.. code-block:: python
# settings.py
SPIDERMON_VALIDATION_MODELS = [
'tutorial.validators.DummyItemModel'
]
If you are working on a spider that produces multiple items types, you can define it
as a `dict`:

.. code-block:: python
# settings.py
from tutorial.items import DummyItem, OtherItem
SPIDERMON_VALIDATION_MODELS = {
DummyItem: 'tutorial.validators.DummyItemModel',
OtherItem: 'tutorial.validators.OtherItemModel',
}
SPIDERMON_VALIDATION_SCHEMAS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -235,9 +176,6 @@ Some examples:
# checks that no errors is present in any fields
self.check_field_errors_percent()
.. _`schematics`: https://schematics.readthedocs.io/en/latest/
.. _`schematics documentation`: https://schematics.readthedocs.io/en/latest/
.. _`JSON Schema`: https://json-schema.org/
.. _`guide`: http://json-schema.org/learn/getting-started-step-by-step.html
.. _`schematics models`: https://schematics.readthedocs.io/en/latest/usage/models.html
.. _`jsonschema`: https://pypi.org/project/jsonschema/
82 changes: 82 additions & 0 deletions docs/source/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -182,3 +182,85 @@ If this setting is not provided or set to ``False``, spider statistics will be:
'spidermon_item_scraped_count/dict/field_2': 2,
'spidermon_field_coverage/dict/field_1': 1, # Did not ignore None value
'spidermon_item_scraped_count/dict/field_2': 1,
SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS
-------------------------------------
Default: ``0``

If larger than 0, field coverage will be computed for items inside fields that are lists.
The number represents how deep in the objects tree the coverage is computed.
Be aware that enabling this might have a significant impact in performance.

Considering your spider returns the following items:

.. code-block:: python
[
{
"field_1": None,
"field_2": [{"nested_field1": "value", "nested_field2": "value"}],
},
{
"field_1": "value",
"field_2": [
{"nested_field2": "value", "nested_field3": {"deeper_field1": "value"}}
],
},
{
"field_1": "value",
"field_2": [
{
"nested_field2": "value",
"nested_field4": [
{"deeper_field41": "value"},
{"deeper_field41": "value"},
],
}
],
},
]
If this setting is not provided or set to ``0``, spider statistics will be:

.. code-block:: python
'item_scraped_count': 3,
'spidermon_item_scraped_count': 3,
'spidermon_item_scraped_count/dict': 3,
'spidermon_item_scraped_count/dict/field_1': 3,
'spidermon_item_scraped_count/dict/field_2': 3
If set to ``1``, spider statistics will be:

.. code-block:: python
'item_scraped_count': 3,
'spidermon_item_scraped_count': 3,
'spidermon_item_scraped_count/dict': 3,
'spidermon_item_scraped_count/dict/field_1': 3,
'spidermon_item_scraped_count/dict/field_2': 3,
'spidermon_item_scraped_count/dict/field_2/_items': 3,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field1': 1,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field2': 3,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field3': 1,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field3/deeper_field1': 1,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field4': 1
If set to ``2``, spider statistics will be:

.. code-block:: python
'item_scraped_count': 3,
'spidermon_item_scraped_count': 3,
'spidermon_item_scraped_count/dict': 3,
'spidermon_item_scraped_count/dict/field_1': 3,
'spidermon_item_scraped_count/dict/field_2': 3,
'spidermon_item_scraped_count/dict/field_2/_items': 3,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field1': 1,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field2': 3,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field3': 1,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field3/deeper_field1': 1,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field4': 1,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field4/_items': 2,
'spidermon_item_scraped_count/dict/field_2/_items/nested_field4/_items/deeper_field41': 2
Binary file not shown.
27 changes: 27 additions & 0 deletions examples/tutorial/tutorial/schemas/quote_item.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"type": "object",
"properties": {
"quote": {
"type": "string"
},
"author": {
"type": "string"
},
"author_url": {
"type": "string",
"pattern": ""
},
"tags": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"quote",
"author",
"author_url"
]
}
2 changes: 1 addition & 1 deletion examples/tutorial/tutorial/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
SPIDERMON_SLACK_RECIPIENTS = ["@yourself", "#yourprojectchannel"]

ITEM_PIPELINES = {"spidermon.contrib.scrapy.pipelines.ItemValidationPipeline": 800}
SPIDERMON_VALIDATION_MODELS = ("tutorial.validators.QuoteItem",)
SPIDERMON_VALIDATION_SCHEMAS = ("../schemas/quote_item.json",)

SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS = True

Expand Down
9 changes: 0 additions & 9 deletions examples/tutorial/tutorial/validators.py

This file was deleted.

1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ slack-sdk
boto
premailer
jsonschema[format]
schematics==2.1.0
python-slugify
scrapy
pytest
Expand Down
2 changes: 0 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,6 @@
"premailer",
"sentry-sdk",
],
# Data validation
"validation": ["schematics"],
# Tools to run the tests
"tests": test_requirements,
# Tools to build and publish the documentation
Expand Down
Loading

0 comments on commit 2bd22e8

Please sign in to comment.