tallesairan / crawler
PHP 最强大、最受欢迎的生产级爬虫/抓取包
1.0
2023-08-15 16:50 UTC
Requires
- php: >=7.0
- ext-json: *
- amphp/amp: ^2.4
- amphp/artax: ^3.0
- psr/log: ^3.0.0
- symfony/css-selector: ^6.3
- symfony/dom-crawler: ^6.3
Requires (Dev)
- monolog/monolog: ^3.3
This package is auto-updated.
Last update: 2024-09-15 18:59:10 UTC
README
PHP 最强大、最受欢迎的生产级爬虫/抓取包,祝您愉快地编码 :)
功能
- 服务器端 DOM 以及自动插入 Symfony\Component\DomCrawler 的 DomParser
- 可配置池大小和重试次数
- 控制速率限制
- forceUTF8 模式,让爬虫为您处理字符集检测和转换
- 兼容 PHP 7.2
感谢
- Amp PHP 的非阻塞并发框架
- Artax PHP 的异步 HTTP 客户端
- node-crawler Node 的最强大、最受欢迎的生产级爬虫/抓取包
node-crawler 真的是一个伟大的爬虫。PHPCrawler 尽力保持与它的相似性。
目录
入门
安装
$ composer require "coooold/crawler"
基本用法
use PHPCrawler\PHPCrawler; use PHPCrawler\Response; use Symfony\Component\DomCrawler\Crawler; $logger = new Monolog\Logger("fox"); try { $logger->pushHandler(new \Monolog\Handler\StreamHandler(STDOUT, \Monolog\Logger::INFO)); } catch (\Exception $e) { } $crawler = new PHPCrawler([ 'maxConnections' => 2, 'domParser' => true, 'timeout' => 3000, 'retries' => 3, 'logger' => $logger, ]); $crawler->on('response', function (Response $res) use ($cli) { if (!$res->success) { return; } $title = $res->dom->filter("title")->html(); echo ">>> title: {$title}\n"; $res->dom ->filter('.related-item a') ->each(function (Crawler $crawler) { echo ">>> links: ", $crawler->text(), "\n"; }); }); $crawler->queue('https://www.foxnews.com/'); $crawler->run();
减速
使用 rateLimit
减速访问网站时。
$crawler = new PHPCrawler([ 'maxConnections' => 10, 'rateLimit' => 2, // reqs per second 'domParser' => true, 'timeout' => 30000, 'retries' => 3, 'logger' => $logger, ]); for ($page = 1; $page <= 100; $page++) { $crawler->queue([ 'uri' => "http://www.qbaobei.com/jiaoyu/gshb/List_{$page}.html", 'type' => 'list', ]); } $crawler->run(); //between two tasks, avarage time gap is 1000 / 2 (ms)
自定义参数
有时您需要从先前的请求/响应会话中访问变量,您应该将参数作为选项传递相同的方式
$crawler->queue([ 'uri' => 'http://www.google.com', 'parameter1' => 'value1', 'parameter2' => 'value2', ])
然后在回调中通过 $res->task['parameter1']
,$res->task['parameter2']
... 访问它们
原始体
如果您正在下载文件,如图像、pdf、word 等,您必须保存原始响应体,这意味着 Crawler 不应该将其转换为字符串。要实现这一点,您需要将编码设置为 null
$crawler = new PHPCrawler([ 'maxConnections' => 10, 'rateLimit' => 2, // req per second 'domParser' => false, 'timeout' => 30000, 'retries' => 3, 'logger' => $logger, ]); $crawler->on('response', function (Response $res, PHPCrawler $crawler) { if (!$res->success) { return; } echo "write ".$res->task['fileName']."\n"; file_put_contents($res->task['fileName'], $res->body); }); $crawler->queue([ 'uri' => "http://www.gutenberg.org/ebooks/60881.txt.utf-8", 'fileName' => '60881.txt', ]); $crawler->queue([ 'uri' => "http://www.gutenberg.org/ebooks/60882.txt.utf-8", 'fileName' => '60882.txt', ]); $crawler->queue([ 'uri' => "http://www.gutenberg.org/ebooks/60883.txt.utf-8", 'fileName' => '60883.txt', ]); $crawler->run();
事件
Event::RESPONSE
在请求完成后触发。
$crawler->on('response', function (Response $res, PHPCrawler $crawler) { if (!$res->success) { return; } });
Event::DRAIN
在队列为空时触发。
$crawler->on('drain', function () { echo "queue is drained\n"; });
高级
编码
HTTP 主体将从默认编码转换为 utf-8。
$crawler = new PHPCrawler([ 'encoding' => 'gbk, ]);
日志记录器
可以使用 PSR 日志记录器实例。
$logger = new Monolog\Logger("fox"); $logger->pushHandler(new \Monolog\Handler\StreamHandler(STDOUT, \Monolog\Logger::INFO)); $crawler = new PHPCrawler([ 'logger' => $logger, ]);
见 Monolog 参考。
协程
基于 amp 非阻塞并发框架的 PHPCrawler 可以与协程一起工作,确保出色的性能。在回调中应使用 Amp 异步包,也就是说,不建议使用 php 原生 mysql 客户端或 php 原生文件 io。关键字 yield 如 ES6 中的 await 引入了非阻塞 io。
$crawler->on('response', function (Response $res) use ($cli) { /** @var \Amp\Artax\Response $res */ $res = yield $cli->request("https://www.foxnews.com/politics/lindsey-graham-adam-schiff-is-doing-a-lot-of-damage-to-the-country-and-he-needs-to-stop"); $body = yield $res->getBody(); echo "=======> body " . strlen($body) . " bytes \n"; });
与 DomParser 一起工作
Symfony\Component\DomCrawler 是一个用于抓取页面的便捷工具。Response::dom 将注入一个 Symfony\Component\DomCrawler\Crawler 实例。
$crawler->on('response', function (Response $res) use ($cli) { if (!$res->success) { return; } $title = $res->dom->filter("title")->html(); echo ">>> title: {$title}\n"; $res->dom ->filter('.related-item a') ->each(function (Crawler $crawler) { echo ">>> links: ", $crawler->text(), "\n"; }); });