webdl / panther-crawler
基于Panther的Web爬虫
v0.1
2022-03-13 09:22 UTC
Requires
- php: >=8.0
- symfony/event-dispatcher: ^5.3 || ^6.0
- symfony/panther: ^2.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.7
- symfony/var-dumper: ^6.0
README
(非常基础) 基于Panther的Panther Web爬虫
安装 panther-crawler
使用Composer将panther-crawler安装到您的项目中
composer req webdl/panther-crawler
安装 ChromeDriver 和 geckodriver
Panther使用WebDriver协议来控制用于爬取网站的浏览器。
在所有系统上,您可以使用dbrekelmans/browser-driver-installer
在本地安装ChromeDriver和geckodriver
composer require --dev dbrekelmans/bdi
vendor/bin/bdi detect drivers
基本用法
<?php use Symfony\Component\Panther\Client; use Webdl\PantherCrawler\Config\ScraperConfig; use Webdl\PantherCrawler\Scraper\Scraper; require __DIR__.'/vendor/autoload.php'; // Composer's autoloader $client = Client::createChromeClient(); // Or, if you care about the open web and prefer to use Firefox $client = Client::createFirefoxClient(); // Adjust the config $scrapperConfig = ScraperConfig::create('https://fr.wikipedia.org/', maxLinks: 200); $crawler = new Scraper($client, $scrapperConfig); $crawler->crawl();
带有事件分发的基本用法
<?php use Symfony\Component\EventDispatcher\EventDispatcher; use Symfony\Component\Panther\Client; use Webdl\PantherCrawler\Config\ScraperConfig; use Webdl\PantherCrawler\Event\PageCrawledEvent; use Webdl\PantherCrawler\Scraper\Scraper; require __DIR__.'/vendor/autoload.php'; // Composer's autoloader $eventDispatcher = new EventDispatcher(); $client = Client::createChromeClient(); // Or, if you care about the open web and prefer to use Firefox $client = Client::createFirefoxClient(); $eventDispatcher->addListener(PageCrawledEvent::NAME, function(PageCrawledEvent $event) { echo 'A page was crawled!' . PHP_EOL; }); $scrapperConfig = ScraperConfig::create('https://fr.wikipedia.org/', maxLinks: 200); $crawler = new Scraper($client, $scrapperConfig, $eventDispatcher); $crawler->crawl();