mjorgens/web-crawler

一个PHP网络爬虫库

V1.0.3 2021-02-15 17:22 UTC

This package is auto-updated.

Last update: 2024-09-11 02:21:44 UTC


README

GitHub release (latest by date) GitHub Workflow Status (branch) GitHub

这是一个PHP库,它接受一个起始URL,然后解析页面HTML并提取URL。然后它跟随URL并解析这些页面,直到达到最大URL数量。

要求

PHP from Packagist

安装

推荐通过Composer安装此库。

composer require mjorgens/web-crawler

使用

$repository = new \Mjorgens\Crawler\CrawledRepository\CrawledMemoryRepository(); // The collection of pages
$url = new Uri('https://example.com'); // Starting url
$maxUrls = 5; // Max number of urls to crawl

Crawler::create()
            ->setRepository($repository)
            ->setMaxCrawl($maxUrls)
            ->startCrawling($url); // Start the crawler

foreach ($repository as $page){
    echo $page->url;
    echo $page->html;
}