README

Robots.txt 解析器

一个易于使用、可扩展的 robots.txt 解析库，完全支持互联网上每一个指令和规范。

用例

权限检查
获取爬虫规则
发现 Sitemap
主机偏好
动态 URL 参数发现
渲染 robots.txt

优点

(与其他大多数 robots.txt 库相比)

自动下载 robots.txt。 (可选)
集成缓存系统。 (可选)
爬取延迟处理。
文档可用。
支持每一个指令，来自每一个规范。
HTTP 状态码处理器，根据 Google 的规范。
专门的 User-Agent 解析器和组确定库，以实现最大精确度。
提供额外数据，如 首选主机、动态 URL 参数、Sitemap 位置等。
支持的协议：HTTP、HTTPS、FTP、SFTP 和 FTP/S。

要求

PHP 7.3+ 或 8.0+
PHP 扩展
- cURL
- mbstring

安装

安装 robots.txt 解析器推荐的方法是通过 Composer。将其添加到您的 composer.json 文件中

{
  "require": {
    "vipnytt/robotstxtparser": "^2.1"
  }
}

然后运行： php composer update

入门

基本用法示例

<?php
$client = new vipnytt\RobotsTxtParser\UriClient('http://example.com');

if ($client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html')) {
    // Access is granted
}
if ($client->userAgent('MyBot')->isDisallowed('http://example.com/admin')) {
    // Access is denied
}

一些基本方法的摘录

<?php
// Syntax: $baseUri, [$statusCode:int|null], [$robotsTxtContent:string], [$encoding:string], [$byteLimit:int|null]
$client = new vipnytt\RobotsTxtParser\TxtClient('http://example.com', 200, $robotsTxtContent);

// Permission checks
$allowed = $client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html'); // bool
$denied = $client->userAgent('MyBot')->isDisallowed('http://example.com/admin'); // bool

// Crawl delay rules
$crawlDelay = $client->userAgent('MyBot')->crawlDelay()->getValue(); // float | int

// Dynamic URL parameters
$cleanParam = $client->cleanParam()->export(); // array

// Preferred host
$host = $client->host()->export(); // string | null
$host = $client->host()->getWithUriFallback(); // string
$host = $client->host()->isPreferred(); // bool

// XML Sitemap locations
$host = $client->sitemap()->export(); // array

以上只是基本的味道，还有许多更高级和/或专业的方法可用于几乎任何目的。访问速查表获取技术细节。

访问文档获取更多信息。

vipnytt / robotstxtparser

维护者

详细信息

README

Robots.txt 解析器

用例

优点

要求

安装

入门

基本用法示例

一些基本方法的摘录

指令

规范