scrapy-spider-metadata documentation

scrapy-spider-metadata is a Python 3.8+ library of utilities to extend Scrapy spiders with usable metadata.

In particular, it provides a nice way for Scrapy spiders to define, validate, document and pre-process their arguments as pydantic models.

Spider metadata

This library allows retrieving spider metadata defined in spider classes.

If a spider class defines spider parameters, their schema will also be included in the retrieved metadata.

Defining spider metadata

You can declare arbitrary metadata in your spider classes as a dictionary attribute named metadata:

from scrapy import Spider

class MySpider(Spider):
    name = "my_spider"
    metadata = {
        "description": "This is my spider.",
        "category": "My basic spiders",
    }

As this attribute is shared between instances of the class and of its subclasses, be careful not to modify it in place. Here is a simple way to add or change some values in a subclass:

from scrapy import Spider

class BaseSpider(Spider):
    metadata = {
        "description": "Base spider.",
        "category": "Base spiders",
    }

class BaseNewsSpider(BaseSpider):
    metadata = {
        **BaseSpider.metadata,
        "description": "Base news spider.",
    }

class CNNSpider(BaseNewsSpider):
    metadata = {
        **BaseNewsSpider.metadata,
        "description": "CNN spider.",
        "category": "Concrete spiders",
        "website": "CNN",
    }

Getting spider metadata

scrapy-spider-metadata provides the following function for retrieving the metadata for a specific spider class:

scrapy_spider_metadata.get_spider_metadata(spider_cls: Type[Spider], *, normalize: bool = False) Dict[str, Any][source]

Return the metadata for the spider class.

Return a copy of the metadata dict. If the spider class defines spider parameters, the returned dict will have an additional param_schema key which value is the JSON Schema for the parameters.

Parameters:
  • spider_cls – The spider class.

  • normalize – Normalize the returned schema.

Returns:

The complete spider metadata.

Parameters

Spider arguments have some limitations:

  • There is no standard way for spiders to indicate which parameters they expect or support.

  • Since arguments can come from the command line, and those are always strings, you have to write code to convert arguments to a different type when needed.

  • If you want argument validation, such as making sure that arguments are of the right type, or that required arguments are present, you must implement validation yourself.

scrapy-spider-metadata allows overcoming those limitations.

Defining supported parameters

To define spider parameters, define a subclass of pydantic.BaseModel, and then make your spider also inherit from Args with your parameter specification class as its parameter:

from pydantic import BaseModel
from scrapy import Request, Spider
from scrapy_spider_metadata import Args

class MyParams(BaseModel):
    pages: int

class MySpider(Args[MyParams], Spider):
    name = "my_spider"

    def start_requests(self):
        for index in range(1, self.args.pages + 1):
            yield Request(f"https://books.toscrape.com/catalogue/page-{index}.html")

To learn how to define parameters in your pydantic.BaseModel subclass, see the Pydantic usage documentation.

Defined parameters make your spider:

  • Halt with an exception if there are missing arguments or any provided argument does not match the defined parameter validation rules.

    Note

    By default extra arguments are silently ignored, but you can change that.

  • Expose an instance of your parameter specification class, that contains the parsed version of your spider arguments, e.g. converted to their expected type.

For example, if you run the spider in the example above with the pages parameter set to the string "42", self.args.pages is the int value 42 (self.pages remains "42").

Also, if you do not pass a value for pages at all, the spider will not start, because pages is a required parameter. All parameters without a default value are considered required parameters.

Getting the parameter specification as JSON Schema

Given a spider class with defined parameters, you can get a JSON Schema representation of the parameter specification of that spider using the get_param_schema() class function:

>>> MySpider.get_param_schema()
{'properties': {'pages': {'title': 'Pages', 'type': 'integer'}}, 'required': ['pages'], 'title': 'MyParams', 'type': 'object'}

scrapy-spider-metadata uses Pydantic to generate the JSON Schema, so your version of pydantic can affect the resulting output.

Parameters API

class scrapy_spider_metadata.Args(*args, **kwargs)[source]

Validates and type-converts spider arguments into the args instance attribute according to the spider parameter specification.

classmethod get_param_schema(normalize: bool = False) Dict[Any, Any][source]

Return a dict with the parameter definition as JSON Schema.

If normalize is True, the returned schema will be the same regardless of whether you are using Pydantic 1.x or Pydantic 2.x. The normalized schema may not match the output of any Pydantic version, but it will be functionally equivalent where possible.

Changes

0.1.2 (2023-10-09)

  • Fixed the params schema output for optional fields and enums.

0.1.1 (2023-10-09)

  • Added “normalize” parameter to get_spider_metadata function.

0.1.0 (2023-09-29)

Initial version.