PHP中最强大、最受欢迎且适用于生产的爬虫/抓取包

v1.0.0 2019-12-11 02:05 UTC

This package is auto-updated.

Last update: 2024-09-11 15:03:48 UTC


README

PHP中最强大、最受欢迎且适用于生产的爬虫/抓取包,祝您快乐编程 :)

特性

  • 服务器端DOM与自动DomParser插入,使用Symfony\Component\DomCrawler
  • 可配置池大小和重试次数
  • 控制速率限制
  • forceUTF8模式,让爬虫处理字符集检测和转换
  • 兼容PHP 7.2

感谢

  • Amp PHP的非阻塞并发框架
  • Artax PHP的异步HTTP客户端
  • node-crawler Node中最强大、最受欢迎且适用于生产的爬虫/抓取包

node-crawler 真的是一个伟大的爬虫。PHPCrawler 尽力保持与它的相似性。

中文说明

目录

入门

安装

$ composer require "coooold/crawler"

基本使用

use PHPCrawler\PHPCrawler;
use PHPCrawler\Response;
use Symfony\Component\DomCrawler\Crawler;

$logger = new Monolog\Logger("fox");
try {
    $logger->pushHandler(new \Monolog\Handler\StreamHandler(STDOUT, \Monolog\Logger::INFO));
} catch (\Exception $e) {
}

$crawler = new PHPCrawler([
    'maxConnections' => 2,
    'domParser' => true,
    'timeout' => 3000,
    'retries' => 3,
    'logger' => $logger,
]);

$crawler->on('response', function (Response $res) use ($cli) {
    if (!$res->success) {
        return;
    }

    $title = $res->dom->filter("title")->html();
    echo ">>> title: {$title}\n";
    $res->dom
        ->filter('.related-item a')
        ->each(function (Crawler $crawler) {
            echo ">>> links: ", $crawler->text(), "\n";
        });
});

$crawler->queue('https://www.foxnews.com/');
$crawler->run();

减速

使用 rateLimit 在访问网站时减速。

$crawler = new PHPCrawler([
    'maxConnections' => 10,
    'rateLimit' => 2,   // reqs per second
    'domParser' => true,
    'timeout' => 30000,
    'retries' => 3,
    'logger' => $logger,
]);

for ($page = 1; $page <= 100; $page++) {
    $crawler->queue([
        'uri' => "http://www.qbaobei.com/jiaoyu/gshb/List_{$page}.html",
        'type' => 'list',
    ]);
}

$crawler->run(); //between two tasks, avarage time gap is 1000 / 2 (ms)

自定义参数

有时您需要从先前的请求/响应会话中访问变量,您需要做的是将参数作为选项传递

$crawler->queue([
    'uri' => 'http://www.google.com',
    'parameter1' => 'value1',
    'parameter2' => 'value2',
])

然后在回调中通过 $res->task['parameter1']$res->task['parameter2'] ... 访问它们

原始体

如果您正在下载文件,如图像、pdf、word等,您必须保存原始响应体,这意味着爬虫不应将其转换为字符串。要实现这一点,您需要将编码设置为null

$crawler = new PHPCrawler([
    'maxConnections' => 10,
    'rateLimit' => 2,   // req per second
    'domParser' => false,
    'timeout' => 30000,
    'retries' => 3,
    'logger' => $logger,
]);

$crawler->on('response', function (Response $res, PHPCrawler $crawler) {
    if (!$res->success) {
        return;
    }

    echo "write ".$res->task['fileName']."\n";
    file_put_contents($res->task['fileName'], $res->body);
});

$crawler->queue([
    'uri' => "http://www.gutenberg.org/ebooks/60881.txt.utf-8",
    'fileName' => '60881.txt',
]);

$crawler->queue([
    'uri' => "http://www.gutenberg.org/ebooks/60882.txt.utf-8",
    'fileName' => '60882.txt',
]);

$crawler->queue([
    'uri' => "http://www.gutenberg.org/ebooks/60883.txt.utf-8",
    'fileName' => '60883.txt',
]);

$crawler->run();

事件

Event::RESPONSE

请求完成时触发。

$crawler->on('response', function (Response $res, PHPCrawler $crawler) {
    if (!$res->success) {
        return;
    }
});

Event::DRAIN

当队列为空时触发。

$crawler->on('drain', function () {
    echo "queue is drained\n";
});

高级

编码

HTTP体将从默认编码转换为utf-8。

$crawler = new PHPCrawler([
    'encoding' => 'gbk,
]);

日志

可以使用PSR日志记录器实例。

$logger = new Monolog\Logger("fox");
$logger->pushHandler(new \Monolog\Handler\StreamHandler(STDOUT, \Monolog\Logger::INFO));

$crawler = new PHPCrawler([
    'logger' => $logger,
]);

请参阅 Monolog Reference

协程

基于amp非阻塞并发框架的PHPCrawler可以与协程一起工作,确保卓越的性能。应在回调中使用Amp异步包,也就是说,不建议使用php原生的mysql客户端或php原生的文件io。yield关键字类似于ES6中的await,引入了非阻塞io。

$crawler->on('response', function (Response $res) use ($cli) {
    /** @var \Amp\Artax\Response $res */
    $res = yield $cli->request("https://www.foxnews.com/politics/lindsey-graham-adam-schiff-is-doing-a-lot-of-damage-to-the-country-and-he-needs-to-stop");
    $body = yield $res->getBody();
    echo "=======> body " . strlen($body) . " bytes \n";
});

与DomParser协同工作

Symfony\Component\DomCrawler 是用于爬取页面的便捷工具。Response::dom 将注入一个 Symfony\Component\DomCrawler\Crawler 实例。

$crawler->on('response', function (Response $res) use ($cli) {
    if (!$res->success) {
        return;
    }

    $title = $res->dom->filter("title")->html();
    echo ">>> title: {$title}\n";
    $res->dom
        ->filter('.related-item a')
        ->each(function (Crawler $crawler) {
            echo ">>> links: ", $crawler->text(), "\n";
        });
});

请参阅 DomCrawler Reference

其他

API参考

配置