Parameters
Spider arguments have some limitations:
There is no standard way for spiders to indicate which parameters they expect or support.
Since arguments can come from the command line, and those are always strings, you have to write code to convert arguments to a different type when needed.
If you want argument validation, such as making sure that arguments are of the right type, or that required arguments are present, you must implement validation yourself.
scrapy-spider-metadata allows overcoming those limitations.
Defining supported parameters
To define spider parameters, define a subclass of pydantic.BaseModel,
and then make your spider also inherit from
Args
with your parameter specification class
as its parameter:
from pydantic import BaseModel
from scrapy import Request, Spider
from scrapy_spider_metadata import Args
class MyParams(BaseModel):
pages: int
class MySpider(Args[MyParams], Spider):
name = "my_spider"
def start_requests(self):
for index in range(1, self.args.pages + 1):
yield Request(f"https://books.toscrape.com/catalogue/page-{index}.html")
To learn how to define parameters in your pydantic.BaseModel subclass, see the Pydantic usage documentation.
Defined parameters make your spider:
Halt with an exception if there are missing arguments or any provided argument does not match the defined parameter validation rules.
Note
By default extra arguments are silently ignored, but you can change that.
Expose an instance of your parameter specification class, that contains the parsed version of your spider arguments, e.g. converted to their expected type.
For example, if you run the spider in the example above
with the pages
parameter set to the string "42"
, self.args.pages
is
the int
value 42
(self.pages
remains "42"
).
Also, if you do not pass a value for pages
at all, the spider will not
start, because pages
is a required parameter. All parameters without a
default value are considered required parameters.
Getting the parameter specification as JSON Schema
Given a spider class with defined parameters, you can
get a JSON Schema representation of the parameter specification of that
spider using the get_param_schema()
class
function:
>>> MySpider.get_param_schema()
{'properties': {'pages': {'title': 'Pages', 'type': 'integer'}}, 'required': ['pages'], 'title': 'MyParams', 'type': 'object'}
scrapy-spider-metadata uses Pydantic to generate the JSON Schema, so your version of pydantic can affect the resulting output.
Parameters API
- class scrapy_spider_metadata.Args(*args, **kwargs)[source]
Validates and type-converts spider arguments into the
args
instance attribute according to the spider parameter specification.- classmethod get_param_schema(normalize: bool = False) Dict[Any, Any] [source]
Return a
dict
with the parameter definition as JSON Schema.If normalize is
True
, the returned schema will be the same regardless of whether you are using Pydantic 1.x or Pydantic 2.x. The normalized schema may not match the output of any Pydantic version, but it will be functionally equivalent where possible.