radowoj / crawla
基于 Symfony 组件和 Guzzle 的简单网页爬虫
v0.4.0
2022-07-03 09:20 UTC
Requires
- php: ^8.1
- guzzlehttp/guzzle: ^7.4
- symfony/css-selector: ^6.1
- symfony/dom-crawler: ^6.1
Requires (Dev)
- phpunit/phpunit: ^9.5
This package is auto-updated.
Last update: 2024-09-30 01:18:01 UTC
README
安装
通过 composer
$ composer require radowoj/crawla
示例 1 - 从入口页面的链接页面获取标题、提交次数和 README
<?php use Symfony\Component\DomCrawler\Crawler as DomCrawler; require_once('../vendor/autoload.php'); $crawler = new \Radowoj\Crawla\Crawler( 'https://github.com/radowoj' ); $dataGathered = []; //configure our crawler //first - set CSS selector for links that should be visited $crawler->setLinkSelector('span.pinned-repo-item-content span.d-block a.text-bold') //second - customize guzzle client used for requests ->setClient(new GuzzleHttp\Client([ GuzzleHttp\RequestOptions::DELAY => 100 ])) //third - define what should be done, when a page was visited? ->setPageVisitedCallback(function(DomCrawler $domCrawler) use(&$dataGathered) { //callback will be called for every visited page, including the base url, so let's ensure that //repo data will be gathered only on repo pages if (!preg_match('/radowoj\/\w+/', $domCrawler->getUri())) { return; } $readme = $domCrawler->filter('#readme'); $dataGathered[] = [ 'title' => trim($domCrawler->filter('span[itemprop="about"]')->text()), 'commits' => trim($domCrawler->filter('li.commits span.num')->text()), 'readme' => $readme->count() ? trim($readme->text()) : '', ]; }); //now crawl, following up to 1 links deep from the entry point $crawler->crawl(1); var_dump($dataGathered); var_dump($crawler->getVisited()->all());
示例 2 - 简单的网站地图
<?php require_once('../vendor/autoload.php'); $crawler = new \Radowoj\Crawla\Crawler( 'https://developer.github.com/' ); $dataGathered = []; //configure our crawler $crawler->setClient(new GuzzleHttp\Client([ GuzzleHttp\RequestOptions::DELAY => 100 ])) //set link selector (all links - this is the default value) ->setLinkSelector('a'); //check up to 1 levels deep $crawler->crawl(1); //get links of all visited pages var_dump($crawler->getVisited()->all()); //get links that were too deep to visit var_dump($crawler->getTooDeep()->all());