johnroyer / crawler-php
PHP 实现的爬虫
0.3.6
2024-02-08 05:56 UTC
Requires
- php: ^8.1|^8.2
- ext-intl: *
- ext-mbstring: *
- guzzlehttp/guzzle: ^7.5
- johnroyer/url-normalizer: ^2.1.0
- symfony/css-selector: ^6.2
- symfony/dom-crawler: ^6.2
Requires (Dev)
- phpunit/phpunit: ^9.0
- squizlabs/php_codesniffer: ^3.6
README
简单的网络爬虫。
注意:这是一个网站项目。请勿在生产环境中使用。
用法
从 AbstractHandler
创建处理器,并设置处理器应该处理的域名
class MyHandler extends \Zeroplex\Crawler\Handler\AbstractHandler { public function getDomain(): string { return 'test.com'; } public function shouldFetch(\Psr\Http\Message\RequestInterface $request): bool { if (1 === preg_match('/(css|js|jpg|png|gif)$/', $request->getUri())) { // ignore css, js and common images return false; } return true; } public function handle(\Psr\Http\Message\ResponseInterface $response): void { // get content using $response->getBody()->getContents() } }
然后设置爬虫并运行
$crawler = new \Zeroplex\Crawler\Crawler(); $crawler->setDelay(0) ->setTimeout(3) ->setFollowRedirect(true) ->setUserAgent('Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/100.1'); $crawler->addHandler(new BlogHandler()); // URL to start $crawler->run('https://test.com');
扩展
例如,通过 Predis 实现URL队列。
composer install
composer require predis/predis
实现 UrlQueueInterface
class RedisQueue implements Zeroplex\Crawler\UrlQueue\UrlQueueInterface { private $redis; public function __construct(string $host, int $port) { } public function push(string $url): void { $this->redis->lpush($url); } public function pop(): string { return $this->redis->lpop(); } // and so on }