README

本包提供了一个用于抓取网站上链接的类。底层使用 Guzzle promises 来并发抓取多个 URL。

由于爬虫可以执行 JavaScript，它可以抓取由 JavaScript 渲染的网站。底层使用 Chrome 和 Puppeteer 来实现这一功能。

支持我们

我们投入了大量资源来创建一流的开放式源代码包。您可以通过购买我们的付费产品之一来支持我们。

我们非常感谢您从家乡寄给我们明信片，说明您正在使用我们的哪个包。您可以在我们的联系方式页面上找到我们的地址。我们在我们的虚拟明信片墙上发布所有收到的明信片。

安装

此包可以通过 Composer 安装

composer require spatie/crawler

用法

爬虫可以像这样实例化

use Spatie\Crawler\Crawler;

Crawler::create()
    ->setCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);

传递给 setCrawlObserver 的参数必须是一个扩展 \Spatie\Crawler\CrawlObservers\CrawlObserver 抽象类的对象

namespace Spatie\Crawler\CrawlObservers;

use GuzzleHttp\Exception\RequestException;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;

abstract class CrawlObserver
{
    /*
     * Called when the crawler will crawl the url.
     */
    public function willCrawl(UriInterface $url, ?string $linkText): void
    {
    }

    /*
     * Called when the crawler has crawled the given url successfully.
     */
    abstract public function crawled(
        UriInterface $url,
        ResponseInterface $response,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText,
    ): void;

    /*
     * Called when the crawler had a problem crawling the given url.
     */
    abstract public function crawlFailed(
        UriInterface $url,
        RequestException $requestException,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText = null,
    ): void;

    /**
     * Called when the crawl has ended.
     */
    public function finishedCrawling(): void
    {
    }
}

使用多个观察者

您可以使用 setCrawlObservers 设置多个观察者

Crawler::create()
    ->setCrawlObservers([
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        ...
     ])
    ->startCrawling($url);

或者您也可以使用 addCrawlObserver 逐个设置多个观察者

Crawler::create()
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);

执行 JavaScript

默认情况下，爬虫不会执行 JavaScript。这是您如何启用 JavaScript 执行的方法

Crawler::create()
    ->executeJavaScript()
    ...

为了在 JavaScript 执行后获取 body html，此包依赖于我们的 Browsershot 包。此包底层使用 Puppeteer。以下是如何在您的系统上安装它的指南：安装指南。

Browsershot 会根据您的系统猜测其依赖项的安装位置。默认情况下，爬虫将实例化一个新的 Browsershot 实例。您可能需要使用 setBrowsershot(Browsershot $browsershot) 方法设置一个自定义创建的实例。

Crawler::create()
    ->setBrowsershot($browsershot)
    ->executeJavaScript()
    ...

请注意，即使您没有 Browsershot 所需的系统依赖项，爬虫也可以正常工作。这些系统依赖项仅在您调用 executeJavaScript() 时需要。

过滤某些 URL

您可以使用 setCrawlProfile 函数来告诉爬虫不要访问某些 URL。该函数期望一个扩展 Spatie\Crawler\CrawlProfiles\CrawlProfile 的对象

/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(UriInterface $url): bool;

此包自带三个 CrawlProfiles

CrawlAllUrls：此配置文件将抓取所有页面上的所有 URL，包括指向外部网站的 URL。
CrawlInternalUrls：此配置文件将仅抓取主机页面上的内部 URL。
CrawlSubdomains：此配置文件将仅抓取主机页面上的内部 URL 及其子域。

忽略 robots.txt 和 robots meta

默认情况下，爬虫将尊重机器人数据。您可以通过这种方式禁用这些检查

Crawler::create()
    ->ignoreRobots()
    ...

机器人的数据可以来自robots.txt文件、元标签或响应头。有关规范的更多信息，请在此处查找：http://www.robotstxt.org/。

解析机器人数据是通过我们的包spatie/robots-txt完成的。

接受具有rel="nofollow"属性的链接

默认情况下，爬虫将拒绝包含属性rel="nofollow"的所有链接。可以通过以下方式禁用这些检查

Crawler::create()
    ->acceptNofollowLinks()
    ...

使用自定义用户代理

为了尊重robots.txt规则中的自定义用户代理，您可以指定自己的自定义用户代理。

Crawler::create()
    ->setUserAgent('my-agent')

您可以在robots.txt中添加针对'my-agent'的特定爬取规则组。以下示例禁止由'my-agent'标识的爬虫爬取整个网站。

// Disallow crawling for my-agent
User-agent: my-agent
Disallow: /

设置并发请求数量

为了提高爬取速度，该包默认并发爬取10个URL。如果您想更改该数字，可以使用setConcurrency方法。

Crawler::create()
    ->setConcurrency(1) // now all urls will be crawled one by one

定义爬取限制

默认情况下，爬虫会继续爬取直到它爬取了所有可以找到的页面。如果您在一个有服务器限制的环境中工作，这种行为可能会引起问题。

可以使用以下两个选项来控制爬取行为

总爬取限制（setTotalCrawlLimit）：此限制定义了可爬取的最大URL数量。
当前爬取限制（setCurrentCrawlLimit）：此定义了当前爬取期间处理的URL数量。

让我们通过一些示例来澄清这两种方法之间的区别。

示例1：使用总爬取限制

setTotalCrawlLimit方法允许限制爬取的URL总数，无论您调用爬虫的频率如何。

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

示例2：使用当前爬取限制

setCurrentCrawlLimit将设置每次执行期间爬取的URL数量上限。此段代码将每次处理5个页面，没有爬取页面的总限制。

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

示例3：结合总爬取限制和当前爬取限制

这两个限制可以结合起来控制爬虫。

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

示例4：跨请求爬取

您可以使用setCurrentCrawlLimit来分割长时间运行的爬取。以下示例演示了一个（简化）方法。它由一个初始请求和任何数量的后续请求组成，继续爬取。

初始请求

为了开始跨不同请求进行爬取，您需要创建一个新的队列，选择您选择的队列驱动程序。首先，将队列实例传递给爬虫。爬虫将在处理页面和发现新URL时开始填充队列。在爬虫完成（使用当前爬取限制）后，序列化和存储队列引用。

// Create a queue using your queue-driver.
$queue = <your selection/implementation of a queue>;

// Crawl the first set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serializedQueue = serialize($queue);

后续请求

对于任何后续请求，您需要反序列化原始队列并将其传递给爬虫。

// Unserialize queue
$queue = unserialize($serializedQueue);

// Crawls the next set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serialized_queue = serialize($queue);

行为基于队列中的信息。只有当传递相同的队列实例时，行为才会按描述的方式工作。当传递一个新的队列时，即使对于同一网站，之前的爬取限制也不会适用。

更多详细示例请在此处找到：这里。

设置最大爬取深度

默认情况下，爬虫会继续爬取直到它爬取了提供的URL的每个页面。如果您想限制爬虫的深度，可以使用setMaximumDepth方法。

Crawler::create()
    ->setMaximumDepth(2)

设置最大响应大小

大多数HTML页面都很小。但是爬虫可能会意外地抓取到大型文件，如PDF和MP3。在这种情况下，为了保持内存使用量低，爬虫只会使用小于2MB的响应。如果在流式传输响应时，它的大小超过2MB，爬虫将停止流式传输响应。将假设响应体为空。

您可以更改最大响应大小。

// let's use a 3 MB maximum.
Crawler::create()
    ->setMaximumResponseSize(1024 * 1024 * 3)

在请求之间添加延迟

在某些情况下，当爬取过于积极时，可能会遇到速率限制。为了避免这种情况，您可以使用setDelayBetweenRequests()方法在每次请求之间添加暂停。这个值以毫秒为单位。

Crawler::create()
    ->setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150ms

限制要解析的内容类型

默认情况下，找到的每个页面都将下载（大小不超过setMaximumResponseSize()）并解析附加链接。您可以通过设置setParseableMimeTypes()来限制要下载和解析的内容类型，该函数接收一个允许的类型数组。

Crawler::create()
    ->setParseableMimeTypes(['text/html', 'text/plain'])

这将防止下载具有不同MIME类型的页面主体，如二进制文件、音频/视频等，这些文件不太可能嵌入链接。此功能主要节省带宽。

使用自定义爬取队列

当爬取网站时，爬虫会将要爬取的URL放入队列中。默认情况下，此队列使用内置的ArrayCrawlQueue存储在内存中。

当网站非常大时，您可能希望将队列存储在其他地方，例如数据库。在这种情况下，您可以编写自己的爬取队列。

有效的爬取队列是任何实现Spatie\Crawler\CrawlQueues\CrawlQueue-接口的类。您可以通过爬虫上的setCrawlQueue方法传递您的自定义爬取队列。

Crawler::create()
    ->setCrawlQueue(<implementation of \Spatie\Crawler\CrawlQueues\CrawlQueue>)

这里

更改默认基本URL方案

默认情况下，如果未设置，爬虫会将基本URL方案设置为http。您可以使用setDefaultScheme更改它。

Crawler::create()
    ->setDefaultScheme('https')

变更日志

请参阅CHANGELOG以获取有关最近更改的更多信息。

贡献

请参阅CONTRIBUTING以获取详细信息。

测试

首先，安装Puppeteer依赖项，否则您的测试将失败。

npm install puppeteer

要运行测试，您必须在单独的终端窗口中首先启动包含的基于node的服务器。

cd tests/server
npm install
node server.js

服务器运行时，您可以开始测试。

composer test

安全性

如果您发现有关安全的错误，请发送电子邮件至security@spatie.be，而不是使用问题跟踪器。

Postcardware

您可以自由使用此包，但如果它进入您的生产环境，我们非常欢迎您从您的家乡寄给我们一张明信片，说明您正在使用我们的哪个包。

我们的地址是：Spatie，Kruikstraat 22，2018 安特卫普，比利时。

我们将发布所有收到的明信片在我们的公司网站上。

致谢

许可

MIT许可证（MIT）。有关更多信息，请参阅许可证文件。

koffleart/

维护者

详细信息