Parameters

Spider arguments have some limitations:

  • There is no standard way for spiders to indicate which parameters they expect or support.

  • Since arguments can come from the command line, and those are always strings, you have to write code to convert arguments to a different type when needed.

  • If you want argument validation, such as making sure that arguments are of the right type, or that required arguments are present, you must implement validation yourself.

scrapy-spider-metadata allows overcoming those limitations.

Defining supported parameters

To define spider parameters, define a subclass of pydantic.BaseModel, and then make your spider also inherit from Args with your parameter specification class as its parameter:

from pydantic import BaseModel
from scrapy import Request, Spider
from scrapy_spider_metadata import Args

class MyParams(BaseModel):
    pages: int

class MySpider(Args[MyParams], Spider):
    name = "my_spider"

    def start_requests(self):
        for index in range(1, self.args.pages + 1):
            yield Request(f"https://books.toscrape.com/catalogue/page-{index}.html")

To learn how to define parameters in your pydantic.BaseModel subclass, see the Pydantic usage documentation.

Defined parameters make your spider:

  • Halt with an exception if there are missing arguments or any provided argument does not match the defined parameter validation rules.

    Note

    By default extra arguments are silently ignored, but you can change that.

  • Expose an instance of your parameter specification class, that contains the parsed version of your spider arguments, e.g. converted to their expected type.

For example, if you run the spider in the example above with the pages parameter set to the string "42", self.args.pages is the int value 42 (self.pages remains "42").

Also, if you do not pass a value for pages at all, the spider will not start, because pages is a required parameter. All parameters without a default value are considered required parameters.

Getting the parameter specification as JSON Schema

Given a spider class with defined parameters, you can get a JSON Schema representation of the parameter specification of that spider using the get_param_schema() class function:

>>> MySpider.get_param_schema()
{'properties': {'pages': {'title': 'Pages', 'type': 'integer'}}, 'required': ['pages'], 'title': 'MyParams', 'type': 'object'}

scrapy-spider-metadata uses Pydantic to generate the JSON Schema, so your version of pydantic can affect the resulting output.

Parameters API

class scrapy_spider_metadata.Args(*args, **kwargs)[source]

Validates and type-converts spider arguments into the args instance attribute according to the spider parameter specification.

classmethod get_param_schema(normalize: bool = False) Dict[Any, Any][source]

Return a dict with the parameter definition as JSON Schema.

If normalize is True, the returned schema will be the same regardless of whether you are using Pydantic 1.x or Pydantic 2.x. The normalized schema may not match the output of any Pydantic version, but it will be functionally equivalent where possible.