matheusfaustino / phrawl
此包最新版本(dev-master)没有提供许可证信息。
PHP 网络爬虫框架
dev-master
2018-12-19 01:18 UTC
Requires
- php: ^7.2
- ext-json: *
- fabpot/goutte: ^3.2
- guzzlehttp/guzzle: ~6.0
- monolog/monolog: ^1.23
This package is auto-updated.
Last update: 2024-09-12 11:18:03 UTC
README
Phrawl 是一个旨在成为 PHP 网络爬虫框架的库。它旨在将所有围绕 PHP 的优秀库集中在一个地方,以帮助那些想要爬取互联网的人。它受到了 Scrapy 的启发。
它是一个正在进行中的库,因此它缺少一些功能,例如与 webdriver(如 Selenium)的集成和异步 I/O。
安装
$ composer global require matheusfaustino/phrawl:dev-master
示例
$ phrawl <<EOF
<?php
class SpiderStackOverflow extends Phrawl\BaseCrawler
{
public \$name = 'stackoverflow';
protected \$configs = ['concurrency' => 5];
public \$start_urls
= [
'https://stackoverflow.com/questions/10720325/selenium-webdriver-wait-for-complex-page-with-javascriptjs-to-load',
'https://stackoverflow.com/questions/9291898/selenium-wait-for-javascript-function-to-execute-before-continuing',
'https://stackoverflow.com/questions/23050430/does-selenium-wait-for-javascript-to-complete',
'https://stackoverflow.com/questions/835501/how-do-you-stash-an-untracked-file',
'https://stackoverflow.com/questions/5355121/passing-dict-to-constructor/5355152#5355152',
];
public function parser(\Phrawl\Response \$response)
{
\$crawler = \$response->getCrawler();
printf("Title: %s \nQuestion: %s \n\n", \$crawler->filterXPath('//title')->text()
, \$crawler->filterXPath('//a[@class="question-hyperlink"]')->text());
\$url = \$crawler->evaluate('//div[contains(@class, "module")]/div/div/a[@class="question-hyperlink"]')->first();
\$url = \$url->getBaseHref().\$url->attr('href');
yield new \Phrawl\Request(\$url, 'first');
}
public function first(\Phrawl\Response \$response)
{
printf("First method -- Title: %s \n\n", \$response->getCrawler()->filterXPath('//title')->text());
}
}
EOF
我将很快添加更多示例...