dr4g0nsr / sitemap-crawler
使用robots.txt和sitemap.xml作为URL来源的任何类型网站的爬虫。用于缓存刷新非常有用。
1.0
2022-11-11 21:16 UTC
Requires
- php: >=7.2
- ext-curl: *
- guzzlehttp/guzzle: ^7.5
- vipnytt/sitemapparser: 1.1.5
Requires (Dev)
- dealerdirect/phpcodesniffer-composer-installer: ^0.7.0
- doctrine/annotations: ^1.2
- php-parallel-lint/php-console-highlighter: ^1.0.0
- php-parallel-lint/php-parallel-lint: ^1.3.2
- phpcompatibility/php-compatibility: ^9.3.5
- roave/security-advisories: dev-latest
- squizlabs/php_codesniffer: ^3.6.2
- yoast/phpunit-polyfills: ^1.0.0
Suggests
- ext-curl: Required for CURL handler support
- ext-intl: Required for Internationalized Domain Name (IDN) support
- psr/log: Required for using the Log middleware
This package is auto-updated.
Last update: 2024-09-12 01:25:43 UTC
README
网站地图爬虫
使用网站地图爬取网站/刷新缓存。
文件不存储,只是触发URL。
使用Composer获取代码
composer require dr4g0nsr/sitemap-crawler
如何实现
创建config.php
<?php
$settings = [
"sleep" => 0,
"excluded" => []
];
使用如下代码
<?php
require __DIR__ . '/vendor/autoload.php';
require __DIR__ . '/config.php';
use dr4g0nsr\Crawler;
$url = 'https://candymapper.com';
print "Crawler version: " . Crawler::version() . PHP_EOL;
$crawler = new Crawler(['sleep' => 0, 'verbose' => true]);
$crawler->loadConfig(__DIR__ . '/config.php');
$sitemap = $crawler->getSitemap($url);
$crawler->crawlURLS($sitemap);
这是最简单的代码,您也可以在vendor/dr4g0nsr/SitemapCrawler/test目录下的test子目录中找到它。