此包最新版本(dev-master)没有提供许可证信息。

PHP 网络爬虫框架

dev-master 2018-12-19 01:18 UTC

This package is auto-updated.

Last update: 2024-09-12 11:18:03 UTC


README

Phrawl 是一个旨在成为 PHP 网络爬虫框架的库。它旨在将所有围绕 PHP 的优秀库集中在一个地方,以帮助那些想要爬取互联网的人。它受到了 Scrapy 的启发。

它是一个正在进行中的库,因此它缺少一些功能,例如与 webdriver(如 Selenium)的集成和异步 I/O。

安装

$ composer global require matheusfaustino/phrawl:dev-master

示例

$ phrawl <<EOF
<?php
class SpiderStackOverflow extends Phrawl\BaseCrawler
{
    public \$name = 'stackoverflow';
    protected \$configs = ['concurrency' => 5];
    public \$start_urls
        = [
            'https://stackoverflow.com/questions/10720325/selenium-webdriver-wait-for-complex-page-with-javascriptjs-to-load',
            'https://stackoverflow.com/questions/9291898/selenium-wait-for-javascript-function-to-execute-before-continuing',
            'https://stackoverflow.com/questions/23050430/does-selenium-wait-for-javascript-to-complete',
            'https://stackoverflow.com/questions/835501/how-do-you-stash-an-untracked-file',
            'https://stackoverflow.com/questions/5355121/passing-dict-to-constructor/5355152#5355152',
        ];
    public function parser(\Phrawl\Response \$response)
    {
        \$crawler = \$response->getCrawler();
        printf("Title: %s \nQuestion: %s \n\n", \$crawler->filterXPath('//title')->text()
            , \$crawler->filterXPath('//a[@class="question-hyperlink"]')->text());
        \$url = \$crawler->evaluate('//div[contains(@class, "module")]/div/div/a[@class="question-hyperlink"]')->first();
        \$url = \$url->getBaseHref().\$url->attr('href');
        yield new \Phrawl\Request(\$url, 'first');
    }
    public function first(\Phrawl\Response \$response)
    {
        printf("First method -- Title: %s \n\n", \$response->getCrawler()->filterXPath('//title')->text());
    }
}
EOF

我将很快添加更多示例...