dezento/crawlify

快速并发爬虫

1.0 2021-05-31 19:28 UTC

This package is auto-updated.

Last update: 2024-09-12 00:44:38 UTC


README

安装

composer require dezento/crawlify

概述

Crawlify 是一个轻量级的爬虫,用于通过 DomCrawler 操作 HTML、XML 和 JSON。它使用 GuzzleHttp\Pool 来执行并发请求,这意味着你可以使用所有可用的 请求选项
返回的结果使用 Laravel Collections 包装。

示例

爬取 JSON
use Dezento\Crawlify;


$links = [];
for ($i = 1; $i <= 100; $i++) {
    $links[] = 'https://jsonplaceholder.typicode.com/posts/' . $i ;
}

$json = (new Crawlify(collect($links))) // you can pass Array or Collection of links
->settings([
  'type' => 'JSON'  //this is Crawlify Option
])
->fetch()
->get('fulfilled')
->map(fn ($p) => collect(json_decode($p->response)))
->dd();
爬取 XML

有关遍历 XML,请参阅 DomCrawler 文档。

$xml = (new Crawlify([
    'https://www.nytimes.com/svc/collections/v1/publish/https://www.nytimes.com/section/world/rss.xml',
]))
->fetch()
->get('fulfilled')
->map(fn ($item) =>
  collect($item->response->filter('item')->children())
  ->map(fn ($data) => $data->textContent)
)->dd();

爬取 HTML

有关遍历 HTML,请参阅 DomCrawler 文档。

$html = (new Crawlify([
  'https://en.wikipedia.org/wiki/Category:Lists_of_spider_species_by_family'
]))
->settings([
  #'proxy' => 'http://username:password@192.168.16.1:10',
  'concurrency' => 5,
  'delay' => 0
])
->fetch()
->get('fulfilled')
->map(fn ($item) =>
  collect($item->response->filter('a')->links())
  ->map(fn($el) => $el->getUri())
)
->reject(fn($a) => $a->isEmpty())
->dd();

选项
->settings([
  'proxy' => 'http://username:password@192.168.16.1:10',
  'concurrency' => 5,
  'delay' => 0,
  ....
])

有关选项,您可以参考 请求选项 文档。Crawlify 的唯一自定义选项是 'type' => 'JSON'

注意

在使用 dd() 辅助函数之前,您必须安装它。

composer require symfony/var-dumper