Item
Item is the key to the whole system which determine what is the result and where is the result.
from toapi import XPath, Item class MovieList(Item): __base_url__ = 'http://www.dy2018.com' url = XPath('//b//a[@class="ulink"]/@href') title = XPath('//b//a[@class="ulink"]/text()') class Meta: source = XPath('//table[@class="tbspan"]') route = {'/movies/?page=1': '/html/gndy/dyzz/', '/movies/?page=:page': '/html/gndy/dyzz/index_:page.html', '/movies/': '/html/gndy/dyzz/'}
When you visit http://127.0.0.1:/movies/?page=2
, You could get the item from http://www.dy2018.com/html/gndy/dyzz/index_2.html
As you can see. The fields of item are selector instances. And the Meta class determine the basic attributes of item.
- Meta.source: A section of a HTML, which should contains one complete item. It is a selector instance
- Meta.route: The url path regex expression of source site.
Clean Data
The clean_{field}
method of item instance is for further processing the returned values. For example:
from toapi import XPath, Item class Post(Item): url = XPath('//a[@class="storylink"]/@href') title = XPath('//a[@class="storylink"]/text()') class Meta: source = XPath('//tr[@class="athing"]') route = {'/':'/'} def clean_url(self, url): return 'http://127.0.0.1%s' % url