webimage/spider

爬取网站

0.0.6 2024-08-30 11:23 UTC

This package is auto-updated.

Last update: 2024-09-30 11:30:54 UTC


README

一个用于下载、缓存和爬取URL的Symfony/Browser-Kit包装器。

用法

use WebImage\Spider\UrlFetcher;
use Symfony\Component\HttpClient\HttpClient;
$logger = new \Monolog\Logger('spider');
$fetcher = new UrlFetcher('/path/to/cache', $logger, HttpClient::create());
$result = $fetcher->fetch(new Url('https://www.domain.com'));

创建一个带有用户代理的HttpClient是个好主意,例如

use Symfony\Component\HttpClient\HttpClient;
HttpClient::create([
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    ]
]);

可以通过设置onFetch(FetchHandlerInterface)或onFetchCallback监听器来设置一个可以递归爬取URL的爬虫。

use WebImage\Spider\FetchResponseEvent;
/** @var \WebImage\Spider\UrlFetcher $fetcher */
$fetcher->onFetchCallback(function(FetchResponseEvent $ev) {
    // Perform some logic here, then
    $ev->getTarget()->fetch(new Url('https://www.another.com/path'));
});

使用onFetch(...)和onFetchCallback(...)可以将URL推送到一个栈中,URL会按照添加的顺序递归处理。