README

简介
- 概览
- robots.txt文件格式刷新
用法
构建

简介

概览

处理 robots.txt 文件

将robots.txt文件解析为模型
获取用户代理的指令
检查用户代理是否允许访问url路径
提取站点地图URL
以编程方式创建模型并将其转换为字符串

robots.txt文件格式刷新

让我们快速回顾一下robots.txt文件格式，以便您了解可以从\webignition\RobotsTxt\File\File对象中获取什么。

robots.txt文件包含一组记录。一条记录向指定的用户代理提供一组指令。指令指示用户代理执行某些操作（或不要执行某些操作）。空白行用于分隔记录。

以下是一个包含两个记录的示例

User-agent: Slurp
Disallow: /

User-Agent: *
Disallow: /private

这指示用户代理'Slurp'不允许访问'/'（即整个站点），并指示所有其他用户代理不允许访问'/private'。

robots.txt文件可以包含适用于所有用户代理的指令，无论指定的记录如何。这些指令包括在记录之外的指令集合中。一个常见用途是sitemap指令。

以下是一个包含适用于每个人和每件事的指令的示例

User-agent: Slurp
Disallow: /

User-Agent: *
Disallow: /private

Sitemap: http://example.com/sitemap.xml

用法

将字符串中的robots.txt文件解析为模型

<?php
use webignition\RobotsTxt\File\Parser;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$robotsTxtFile = $parser->getFile();
 
// Get an array of records
$robotsTxtFile->getRecords();

// Get the list of record-independent directives (such as sitemap directives):
$robotsTxtFile->getNonGroupDirectives()->get();

这可能本身不是很有用。您通常从robots.txt文件中检索信息，因为您是爬虫，需要知道您允许访问（或不允许访问）的内容，或者因为您是需要定位网站sitemap.xml文件的工具或服务。

检查模型以获取用户代理的指令

假设我们是'Slurp'用户代理，并想知道为我们指定的内容

<?php
use webignition\RobotsTxt\File\Parser;
use webignition\RobotsTxt\Inspector\Inspector;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$inspector = new Inspector($parser->getFile());
$inspector->setUserAgent('slurp');

$slurpDirectiveList = $inspector->getDirectives();

好的，现在我们有一个DirectiveList，其中包含一组指令。我们可以调用$directiveList->get()来获取适用于我们的指令。

这些原始指令集合在模型中可用，因为它存在于原始robots.txt文件中。通常这些原始数据本身并不直接有用。也许我们想要进一步检查它？

检查用户代理是否允许访问url路径

这更接近了，让我们检查模型中的部分数据。

<?php
use webignition\RobotsTxt\File\Parser;
use webignition\RobotsTxt\Inspector\Inspector;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$inspector = new Inspector($parser->getFile());
$inspector->setUserAgent('slurp');

if ($inspector->isAllowed('/foo')) {
    // Do whatever is needed access to /foo is allowed
}

提取站点地图URL

robots.txt文件可以列出所有相关站点地图的URL。这些指令不是特定于用户代理的。

假设我们是一个自动化的Web前端测试服务，我们需要找到网站的sitemap.xml以找到需要测试的URL列表。我们知道网站的域名，我们知道在哪里查找robots.txt文件，我们知道这可能会指定sitemap.xml文件的位置。

<?php
use webignition\RobotsTxt\File\Parser;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$robotsTxtFile = $parser->getFile();

$sitemapDirectives = $robotsTxtFile->getNonGroupDirectives()->getByField('sitemap');
$sitemapUrl = (string)$sitemapDirectives->first()->getValue();

太好了，我们已经找到了robots.txt文件中列出的第一个站点地图的URL。可能有多个，但通常只有一个。

根据特定字段类型过滤针对用户代理的指令

让我们获取针对Slurp的所有disallow指令

<?php
use webignition\RobotsTxt\File\Parser;
use webignition\RobotsTxt\Inspector\Inspector;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$robotsTxtFile = $parser->getFile();

$inspector = new Inspector($robotsTxtFile);
$inspector->setUserAgent('slurp');

$slurpDisallowDirectiveList = $inspector->getDirectives()->getByField('disallow');

构建

在项目中作为库使用

如果用作其他项目的依赖项，请更新该项目的composer.json并更新您的依赖关系。

"require": {
    "webignition/robots-txt-file": "*"      
}

这将为您获取最新版本。请查看版本列表以获取特定版本。

开发

该项目使用composer管理外部依赖项。首先获取并安装此软件。

# Make a suitable project directory
mkdir ~/robots-txt-file && cd ~/robots-txt-file

# Clone repository
git clone git@github.com:webignition/robots-txt-file.git.

# Retrieve/update dependencies
composer update

# Run code sniffer and unit tests
composer cs
composer test

测试

请查看在travis上的项目以获取最新构建状态，或亲自运行测试。

cd ~/robots-txt-file
composer test

webignition / robots-txt-file

维护者

详细信息