README

安装

composer require dezento/crawlify

概述

Crawlify 是一个轻量级的爬虫，用于通过 DomCrawler 操作 HTML、XML 和 JSON。它使用 GuzzleHttp\Pool 来执行并发请求，这意味着你可以使用所有可用的请求选项。
返回的结果使用 Laravel Collections 包装。

示例

爬取 JSON

use Dezento\Crawlify;


$links = [];
for ($i = 1; $i <= 100; $i++) {
    $links[] = 'https://jsonplaceholder.typicode.com/posts/' . $i ;
}

$json = (new Crawlify(collect($links))) // you can pass Array or Collection of links
->settings([
  'type' => 'JSON'  //this is Crawlify Option
])
->fetch()
->get('fulfilled')
->map(fn ($p) => collect(json_decode($p->response)))
->dd();

爬取 XML

有关遍历 XML，请参阅 DomCrawler 文档。

$xml = (new Crawlify([
    'https://www.nytimes.com/svc/collections/v1/publish/https://www.nytimes.com/section/world/rss.xml',
]))
->fetch()
->get('fulfilled')
->map(fn ($item) =>
  collect($item->response->filter('item')->children())
  ->map(fn ($data) => $data->textContent)
)->dd();

爬取 HTML

有关遍历 HTML，请参阅 DomCrawler 文档。

$html = (new Crawlify([
  'https://en.wikipedia.org/wiki/Category:Lists_of_spider_species_by_family'
]))
->settings([
  #'proxy' => 'http://username:password@192.168.16.1:10',
  'concurrency' => 5,
  'delay' => 0
])
->fetch()
->get('fulfilled')
->map(fn ($item) =>
  collect($item->response->filter('a')->links())
  ->map(fn($el) => $el->getUri())
)
->reject(fn($a) => $a->isEmpty())
->dd();

选项

->settings([
  'proxy' => 'http://username:password@192.168.16.1:10',
  'concurrency' => 5,
  'delay' => 0,
  ....
])

有关选项，您可以参考请求选项文档。Crawlify 的唯一自定义选项是 'type' => 'JSON'

注意

在使用 dd() 辅助函数之前，您必须安装它。

composer require symfony/var-dumper

dezento / crawlify

维护者

详细信息

README

安装

概述

示例

爬取 JSON

爬取 XML

爬取 HTML

选项

注意