hanccc/crawler

此包已被废弃且不再维护。作者建议使用 hanccc/crawler 包。

0.2.51 2016-06-05 08:19 UTC

This package is not auto-updated.

Last update: 2022-02-01 12:58:26 UTC


README

一个易于爬取网站列表和详情的包

安装

composer require hanccc/crawler

使用方法

此包需要 Goutte,您可以通过 $this->crawler(); 在列表和详情中获取DOM。

示例

        //or $listCrawler = new ExampleListCrawler(storage_path('logs'));
        $listCrawler = new ExampleListCrawler('http://example.com', storage_path('logs'));
        $listCrawler->setDetailCrawler(new ExampleDetailCrawler());
        $listCrawler->start();

列表爬虫

class ExampleListCrawler extends ListCrawler{
    public $url = 'http://example.com';
    
    //return links per page
    public function getEachPageUrl($page)
    {
        return 'http://example.com/list&page=' . $page;
    }
    
    // get the maximum number of pages
    public function setMaxPage()
    {
        $this->maxPage = $num;
    }
}

详情爬虫

class ExampleDetailCrawler extends DetailCrawler{

    //Returns boolean
    public function isDetailUrl($url)
    {
        if(preg_match('/example.com\/id(\d+)/, $url))
            return true;
    }
    
    // what you want to do about the detail page
    public function handle()
    {
        echo $this->crawler->filter('title')->text();
    }
}

许可证

Crawler 是开源软件,许可协议为 MIT 许可证