dezento / crawlify
快速并发爬虫
1.0
2021-05-31 19:28 UTC
Requires
- php: ^8.0
- dezento/effective-url-middleware: ^1.0
- guzzlehttp/guzzle: ^7.3
- illuminate/collections: ^8.38
- symfony/css-selector: ^5.2
- symfony/dom-crawler: ^5.2
Requires (Dev)
- symfony/var-dumper: ^5.2
README
安装
composer require dezento/crawlify
概述
Crawlify 是一个轻量级的爬虫,用于通过 DomCrawler 操作 HTML、XML 和 JSON。它使用 GuzzleHttp\Pool 来执行并发请求,这意味着你可以使用所有可用的 请求选项。
返回的结果使用 Laravel Collections 包装。
示例
爬取 JSON
use Dezento\Crawlify;
$links = [];
for ($i = 1; $i <= 100; $i++) {
$links[] = 'https://jsonplaceholder.typicode.com/posts/' . $i ;
}
$json = (new Crawlify(collect($links))) // you can pass Array or Collection of links
->settings([
'type' => 'JSON' //this is Crawlify Option
])
->fetch()
->get('fulfilled')
->map(fn ($p) => collect(json_decode($p->response)))
->dd();
爬取 XML
有关遍历 XML,请参阅 DomCrawler 文档。
$xml = (new Crawlify([
'https://www.nytimes.com/svc/collections/v1/publish/https://www.nytimes.com/section/world/rss.xml',
]))
->fetch()
->get('fulfilled')
->map(fn ($item) =>
collect($item->response->filter('item')->children())
->map(fn ($data) => $data->textContent)
)->dd();
爬取 HTML
有关遍历 HTML,请参阅 DomCrawler 文档。
$html = (new Crawlify([
'https://en.wikipedia.org/wiki/Category:Lists_of_spider_species_by_family'
]))
->settings([
#'proxy' => 'http://username:password@192.168.16.1:10',
'concurrency' => 5,
'delay' => 0
])
->fetch()
->get('fulfilled')
->map(fn ($item) =>
collect($item->response->filter('a')->links())
->map(fn($el) => $el->getUri())
)
->reject(fn($a) => $a->isEmpty())
->dd();
选项
->settings([
'proxy' => 'http://username:password@192.168.16.1:10',
'concurrency' => 5,
'delay' => 0,
....
])
有关选项,您可以参考 请求选项 文档。Crawlify 的唯一自定义选项是 'type' => 'JSON'
注意
在使用 dd() 辅助函数之前,您必须安装它。
composer require symfony/var-dumper