Item

Item is the key to the whole system which determine what is the result and where is the result.

from toapi import XPath, Item

class MovieList(Item):
    __base_url__ = 'http://www.dy2018.com'

    url = XPath('//b//a[@class="ulink"]/@href')
    title = XPath('//b//a[@class="ulink"]/text()')

    class Meta:
        source = XPath('//table[@class="tbspan"]')
        route = {'/movies/?page=1': '/html/gndy/dyzz/',
                 '/movies/?page=:page': '/html/gndy/dyzz/index_:page.html',
                 '/movies/': '/html/gndy/dyzz/'}

When you visit http://127.0.0.1:/movies/?page=2, You could get the item from http://www.dy2018.com/html/gndy/dyzz/index_2.html

As you can see. The fields of item are selector instances. And the Meta class determine the basic attributes of item.

  • Meta.source: A section of a HTML, which should contains one complete item. It is a selector instance
  • Meta.route: The url path regex expression of source site.

Clean Data

The clean_{field} method of item instance is for further processing the returned values. For example:

from toapi import XPath, Item

class Post(Item):
    url = XPath('//a[@class="storylink"]/@href')
    title = XPath('//a[@class="storylink"]/text()')

    class Meta:
        source = XPath('//tr[@class="athing"]')
        route = {'/':'/'}

    def clean_url(self, url):
        return 'http://127.0.0.1%s' % url