johnroyer/crawler-php

PHP 实现的爬虫

0.3.6 2024-02-08 05:56 UTC

This package is auto-updated.

Last update: 2024-09-08 07:16:22 UTC


README

简单的网络爬虫。

注意:这是一个网站项目。请勿在生产环境中使用。

用法

AbstractHandler 创建处理器,并设置处理器应该处理的域名

class MyHandler extends \Zeroplex\Crawler\Handler\AbstractHandler
{
    public function getDomain(): string
    {
        return 'test.com';
    }

    public function shouldFetch(\Psr\Http\Message\RequestInterface $request): bool
    {
        if (1 === preg_match('/(css|js|jpg|png|gif)$/', $request->getUri())) {
            // ignore css, js and common images
            return false;
        }
        return true;
    }

    public function handle(\Psr\Http\Message\ResponseInterface $response): void
    {
        // get content using $response->getBody()->getContents()
    }
}

然后设置爬虫并运行

$crawler = new \Zeroplex\Crawler\Crawler();

$crawler->setDelay(0)
    ->setTimeout(3)
    ->setFollowRedirect(true)
    ->setUserAgent('Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/100.1');

$crawler->addHandler(new BlogHandler());

// URL to start
$crawler->run('https://test.com');

扩展

例如,通过 Predis 实现URL队列。

composer install

composer require predis/predis

实现 UrlQueueInterface

class RedisQueue implements Zeroplex\Crawler\UrlQueue\UrlQueueInterface
{
    private $redis;
    public function __construct(string $host, int $port) { }

    public function push(string $url): void
    {
        $this->redis->lpush($url);
    }

    public function pop(): string
    {
        return $this->redis->lpop();
    }

    // and so on
}