PHP中最强大、最受欢迎且适用于生产的爬虫/抓取包
v1.0.0
2019-12-11 02:05 UTC
Requires
- php: >=7.0
- ext-json: *
- amphp/amp: ^2.4
- amphp/artax: ^3.0
- psr/log: ^1.0.1
- symfony/css-selector: ^5.0
- symfony/dom-crawler: ^5.0
Requires (Dev)
- monolog/monolog: ^2.0
This package is auto-updated.
Last update: 2024-09-11 15:03:48 UTC
README
PHP中最强大、最受欢迎且适用于生产的爬虫/抓取包,祝您快乐编程 :)
特性
- 服务器端DOM与自动DomParser插入,使用Symfony\Component\DomCrawler
- 可配置池大小和重试次数
- 控制速率限制
- forceUTF8模式,让爬虫处理字符集检测和转换
- 兼容PHP 7.2
感谢
- Amp PHP的非阻塞并发框架
- Artax PHP的异步HTTP客户端
- node-crawler Node中最强大、最受欢迎且适用于生产的爬虫/抓取包
node-crawler 真的是一个伟大的爬虫。PHPCrawler 尽力保持与它的相似性。
目录
入门
安装
$ composer require "coooold/crawler"
基本使用
use PHPCrawler\PHPCrawler; use PHPCrawler\Response; use Symfony\Component\DomCrawler\Crawler; $logger = new Monolog\Logger("fox"); try { $logger->pushHandler(new \Monolog\Handler\StreamHandler(STDOUT, \Monolog\Logger::INFO)); } catch (\Exception $e) { } $crawler = new PHPCrawler([ 'maxConnections' => 2, 'domParser' => true, 'timeout' => 3000, 'retries' => 3, 'logger' => $logger, ]); $crawler->on('response', function (Response $res) use ($cli) { if (!$res->success) { return; } $title = $res->dom->filter("title")->html(); echo ">>> title: {$title}\n"; $res->dom ->filter('.related-item a') ->each(function (Crawler $crawler) { echo ">>> links: ", $crawler->text(), "\n"; }); }); $crawler->queue('https://www.foxnews.com/'); $crawler->run();
减速
使用 rateLimit
在访问网站时减速。
$crawler = new PHPCrawler([ 'maxConnections' => 10, 'rateLimit' => 2, // reqs per second 'domParser' => true, 'timeout' => 30000, 'retries' => 3, 'logger' => $logger, ]); for ($page = 1; $page <= 100; $page++) { $crawler->queue([ 'uri' => "http://www.qbaobei.com/jiaoyu/gshb/List_{$page}.html", 'type' => 'list', ]); } $crawler->run(); //between two tasks, avarage time gap is 1000 / 2 (ms)
自定义参数
有时您需要从先前的请求/响应会话中访问变量,您需要做的是将参数作为选项传递
$crawler->queue([ 'uri' => 'http://www.google.com', 'parameter1' => 'value1', 'parameter2' => 'value2', ])
然后在回调中通过 $res->task['parameter1']
、$res->task['parameter2']
... 访问它们
原始体
如果您正在下载文件,如图像、pdf、word等,您必须保存原始响应体,这意味着爬虫不应将其转换为字符串。要实现这一点,您需要将编码设置为null
$crawler = new PHPCrawler([ 'maxConnections' => 10, 'rateLimit' => 2, // req per second 'domParser' => false, 'timeout' => 30000, 'retries' => 3, 'logger' => $logger, ]); $crawler->on('response', function (Response $res, PHPCrawler $crawler) { if (!$res->success) { return; } echo "write ".$res->task['fileName']."\n"; file_put_contents($res->task['fileName'], $res->body); }); $crawler->queue([ 'uri' => "http://www.gutenberg.org/ebooks/60881.txt.utf-8", 'fileName' => '60881.txt', ]); $crawler->queue([ 'uri' => "http://www.gutenberg.org/ebooks/60882.txt.utf-8", 'fileName' => '60882.txt', ]); $crawler->queue([ 'uri' => "http://www.gutenberg.org/ebooks/60883.txt.utf-8", 'fileName' => '60883.txt', ]); $crawler->run();
事件
Event::RESPONSE
请求完成时触发。
$crawler->on('response', function (Response $res, PHPCrawler $crawler) { if (!$res->success) { return; } });
Event::DRAIN
当队列为空时触发。
$crawler->on('drain', function () { echo "queue is drained\n"; });
高级
编码
HTTP体将从默认编码转换为utf-8。
$crawler = new PHPCrawler([ 'encoding' => 'gbk, ]);
日志
可以使用PSR日志记录器实例。
$logger = new Monolog\Logger("fox"); $logger->pushHandler(new \Monolog\Handler\StreamHandler(STDOUT, \Monolog\Logger::INFO)); $crawler = new PHPCrawler([ 'logger' => $logger, ]);
请参阅 Monolog Reference。
协程
基于amp非阻塞并发框架的PHPCrawler可以与协程一起工作,确保卓越的性能。应在回调中使用Amp异步包,也就是说,不建议使用php原生的mysql客户端或php原生的文件io。yield关键字类似于ES6中的await,引入了非阻塞io。
$crawler->on('response', function (Response $res) use ($cli) { /** @var \Amp\Artax\Response $res */ $res = yield $cli->request("https://www.foxnews.com/politics/lindsey-graham-adam-schiff-is-doing-a-lot-of-damage-to-the-country-and-he-needs-to-stop"); $body = yield $res->getBody(); echo "=======> body " . strlen($body) . " bytes \n"; });
与DomParser协同工作
Symfony\Component\DomCrawler 是用于爬取页面的便捷工具。Response::dom 将注入一个 Symfony\Component\DomCrawler\Crawler 实例。
$crawler->on('response', function (Response $res) use ($cli) { if (!$res->success) { return; } $title = $res->dom->filter("title")->html(); echo ">>> title: {$title}\n"; $res->dom ->filter('.related-item a') ->each(function (Crawler $crawler) { echo ">>> links: ", $crawler->text(), "\n"; }); });
请参阅 DomCrawler Reference。