README

URLResolver.php 是一个 PHP 类，尝试将 URL 解析到最终的、规范的链接。在当今的互联网上，链接缩短器、跟踪代码等可能会导致许多不同的链接最终指向同一资源。通过遵循 HTTP 重定向并解析网页中的 Open Graph 和规范 URL，URLResolver.php 尝试解决这个问题。

识别的模式

遵循 HTTP 标头中发现的 301、302 和 303 重定向
遵循网页中发现的 Open Graph URL <meta> 标签
遵循网页中发现的规范 URL <link> 标签
遵循刷新 meta 标签中的 URL
如果内容类型不是 HTML 页面，则快速中止下载

我愿意接受更多改进的建议。

用法

解析 URL 可以非常简单，就像这样

<?php 

use Mdf\PhpUrlResolver;

$resolver = new URLResolver();
print $resolver->resolveURL('http://goo.gl/0GMP1')->getURL();

然而，在大多数情况下，您可能需要进行一些额外的设置。以下代码设置了一个用户代理以识别您的爬虫（否则将使用默认设置）并指定了一个临时文件，该文件可以用于在会话期间存储 cookie。一些网站会测试浏览器是否支持 cookie，因此这将提高您的结果。

<?php 

use Mdf\PhpUrlResolver;

$resolver = new URLResolver();

# Identify your crawler (otherwise the default will be used)
$resolver->setUserAgent('Mozilla/5.0 (compatible; YourAppName/1.0; +http://www.example.com)');

# Designate a temporary file that will store cookies during the session.
# Some web sites test the browser for cookie support, so this enhances results.
$resolver->setCookieJar('/tmp/url_resolver.cookies');

# resolveURL() returns an object that allows for additional information.
$url = 'http://goo.gl/0GMP1';
$url_result = $resolver->resolveURL($url);

# Test to see if any error occurred while resolving the URL:
if ($url_result->didErrorOccur()) {
	print "there was an error resolving $url:\n  ";
	print $url_result->getErrorMessageString();
}

# Otherwise, print out the resolved URL.  The [HTTP status code] will tell you
# additional information about the success/failure. For instance, if the
# link resulted in a 404 Not Found error, it would print '404: http://...'
# The successful status code is 200.
else {
	print $url_result->getHTTPStatusCode();
	print ': ';
	print $url_result->getURL();
}

下载和需求

许可证

URLResolver.php 在 MIT 许可证下授权，可在源代码中查看。

下载

以 .tar.gz 或 .zip 文件的形式下载 URLResolver.php。

需求

PHP 必须安装 curl 扩展
PHP Simple HTML DOM Parser 是必需的，并包含在下载中。

API

URLResolver()

$resolver = new URLResolver();
创建一个 URL 解析器对象，您可以调用额外的方法。

$resolver->resolveURL($url);
$url 是您想要解析的链接。
返回一个包含最终、解析后的 URL 的 [URLResult] 对象。

$resolver->setUserAgent($user_agent);
传入一个字符串，该字符串发送到每个 Web 服务器以识别您的爬虫。

$resolver->setCookieJar($cookie_file); # 默认禁用 cookie
*** 此文件将在每次 resolveURL() 调用结束时被删除 ***
传入一个文件路径，用于在每次 resolveURL() 调用期间存储 cookie。
如果没有设置 cookie 文件，则将禁用 cookie，并且结果可能会受到影响。
此文件必须不存在。如果存在，则将 true 作为第二个参数传递以启用覆盖。

$resolver->setMaxRedirects($max_redirects); # 默认为 10
设置在每次 resolveURL() 调用期间尝试的最大 URL 请求次数。

$resolver->setMaxResponseDataSize($max_bytes); # 默认为 120000
传入一个整数，指定每次请求中可以下载的最大数据量。
在每次 resolveURL() 调用期间可能发生多次 URL 请求。
设置得太低可能会限制结果的有用性（默认 120000）。

$resolver->setRequestTimeout($num_seconds); # 默认为 30
设置任何URL请求的最大时间，单位为秒。
在每次 resolveURL() 调用期间可能发生多次 URL 请求。

$resolver->isDebugMode($value); # 默认为false
将$value设置为true以启用调试模式，设置为false以禁用（默认）。
这将打印出每个访问的链接，以及状态码和链接类型。

URLResolverResult()

$url_result = $resolver->resolveURL($url);
检索表示$url解析的URLResolverResult()对象。

$url_result->getURL();
这是在跟随重定向后我们能获取到的最佳解析URL。

$url_result->getHTTPStatusCode();
返回解析URL的整数HTTP状态码。
示例：200 - OK（成功），404 - 未找到，301 - 永久移动，...

$url_result->hasSuccessHTTPStatus();
如果解析URL的HTTP状态码为200，则返回true。

$url_result->hasRedirectHTTPStatus();
如果解析URL的HTTP状态码为301、302或303，则返回true。

$url_result->getContentType();
返回解析URL的Content-Type HTTP头的值。
如果未提供头，则返回null。示例：text/html，image/jpeg，...

$url_result->getContentLength();
返回解析URL的字节大小。
仅由Content-Length HTTP头确定。否则返回null。

$url_result->isOpenGraphURL();
如果解析的URL被标记为Open Graph URL（og:url），则返回true。

$url_result->isCanonicalURL();
如果解析的URL被标记为Canonical URL（rel=canonical），则返回true。

$url_result->isStartingURL();
如果解析的URL也是您传递给resolveURL()的URL，则返回true。

$url_result->didErrorOccur();
如果在解析URL时发生错误，则返回true。
如果此返回值为false，则保证$url_result具有状态码。

$url_result->getErrorMessageString();
如果didErrorOccur()返回true，则返回错误的解释。

$url_result->didConnectionFail();
如果有连接错误（没有头或没有返回体），则返回true。
可能表明您更有可能至少再试一次。
如果此返回值为true，则didErrorOccur()也将为true。

变更日志

v1.1 - 2014年6月3日
- 支持http重定向代码303
v1.0 - 2011年12月3日
- 初始发布支持http头重定向、og:url和rel=canonical

mdf / php-url-resolver

维护者

详细信息