README

一款极快且高效的网页爬虫，能在瞬间解析数兆字节无效HTML。

您可以使用熟悉的jQuery/CSS选择器语法轻松找到所需的数据。

在我的单元测试中，我要求它在3MB HTML文档上至少比Symfony的DOMCrawler快10倍。根据我谦逊的测试，在某些情况下，它比DOMCrawler快两个到三个数量级，平均使用内存减少x2。

请参阅tests/README.md。

API文档

💡功能

非常快的解析和查找
解析破碎的HTML
类似jQuery的DOM遍历风格
内存使用低
可以处理大型HTML文档（我已经测试到20MB，但限制是您拥有的RAM量）
不需要安装cURL，并自动处理重定向（请参阅hQuery::fromUrl()）
为多个处理任务缓存响应
PSR-7友好（请参阅hQuery::fromHTML($message)）
PHP 5.3+
无依赖

🛠安装

只需将此文件夹添加到您的项目中，然后include_once 'hquery.php';，您就可以开始使用hQuery了。

或者使用composer require duzun/hquery

或者使用npm install hquery.php，require_once 'node_modules/hquery.php/hquery.php';。

⚙使用方法

基本设置

// Optionally use namespaces
use duzun\hQuery;

// Either use composer, or include this file:
include_once '/path/to/libs/hquery.php';

// Set the cache path - must be a writable folder
// If not set, hQuery::fromURL() would make a new request on each call
hQuery::$cache_path = "/path/to/cache";

// Time to keep request data in cache, seconds
// A value of 0 disables cache
hQuery::$cache_expires = 3600; // default one hour

我建议使用php-http/cache-plugin与一个PSR-7客户端以获得更好的灵活性。

从文件加载HTML

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');

// Remote
$doc = hQuery::fromFile('https://example.com/', false, $context);

其中$context是通过stream_context_create()创建的。

请参阅#26了解如何使用$context通过代理进行HTTP请求的示例。

从字符串加载HTML

hQuery::fromHTML( string `$html`, string `$url` = NULL )

$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');

// Set base_url, in case the document is loaded from local source.
// Note: The base_url property is used to retrieve absolute URLs from relative ones.
$doc->base_url = 'http://desired-host.net/path';

加载远程HTML文档

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

use duzun\hQuery;

// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);

var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request

// with POST
$doc = hQuery::fromUrl(
    'http://example.com/someDoc.html', // url
    ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
    ['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
    ['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);

有关构建高级请求（POST、参数等），请参阅hQuery::http_wr()，尽管我建议使用专门的（PSR-7）库来发送请求，并使用hQuery::fromHTML($html, $url=NULL)来处理结果。例如，请参阅Guzzle。

PSR-7示例

composer require php-http/message php-http/discovery php-http/curl-client

如果您没有cURL PHP 扩展，只需在上面的命令中将 php-http/curl-client 替换为 php-http/socket-client。

use duzun\hQuery;

use Http\Discovery\HttpClientDiscovery;
use Http\Discovery\MessageFactoryDiscovery;

$client = HttpClientDiscovery::find();
$messageFactory = MessageFactoryDiscovery::find();

$request = $messageFactory->createRequest(
  'GET',
  'http://example.com/someDoc.html',
  ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']
);

$response = $client->sendRequest($request);

$doc = hQuery::fromHTML($response, $request->getUri());

另一种选择是使用 stream_context_create() 来创建一个 $context，然后调用 hQuery::fromFile($url, false, $context)。

处理结果

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery\Node `$ctx` = NULL )

// Find all banners (images inside anchors)
$banners = $doc->find('a[href] > img[src]:parent');

// Extract links and images
$links  = array();
$images = array();
$titles = array();

// If the result of find() is not empty
// $banners is a collection of elements (hQuery\Element)
if ( $banners ) {

    // Iterate over the result
    foreach($banners as $pos => $a) {
        $links[$pos] = $a->attr('href'); // get absolute URL from href property
        $titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text

        // Filter the result
        if ( !$a->hasClass('logo') ) {
            // $a->style property is the parsed $a->attr('style')
            if ( strtolower($a->style['position']) == 'fixed' ) continue;

            $img = $a->find('img')[0]; // ArrayAccess
            if ( $img ) $images[$pos] = $img->src; // short for $img->attr('src')
        }
    }

    // If at least one element has the class .home
    if ( $banners->hasClass('home') ) {
        echo 'There is .home button!', PHP_EOL;

        // ArrayAccess for elements and properties.
        if ( $banners[0]['href'] == '/' ) {
            echo 'And it is the first one!';
        }
    }
}

// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;

// Get the size of the document ( strlen($html) )
$size = $doc->size;

注意：如果字符集元属性有错误值或由于其他任何原因内部转换失败，hQuery 将忽略错误并继续使用原始 HTML 进行处理，但会在 $doc->html_errors['convert_encoding'] 上注册一个错误信息。

🖧 在线演示

在 DUzun.Me

很多人询问我的 在线演示 页面的来源。这里就是

view-source:https://duzun.me/playground/hquery

🏃 运行游乐场

您可以在本地机器上轻松运行 examples/ 中的任何示例。您只需要在系统中安装 PHP。在用 git clone https://github.com/duzun/hQuery.php.git 克隆仓库后，您有几个选项来启动一个 web 服务器。

选项 1

cd hQuery.php/examples
php -S localhost:8000

# open browser https://:8000/

选项 2 (browser-sync)

此选项启动一个实时重新加载服务器，适用于与代码交互。

npm install
gulp

# open browser https://:8080/

选项 3 (VSCode)

如果您正在使用 VSCode，只需打开项目并运行调试器（F5）。

🔧 TODO

对一切进行单元测试
记录一切
~~Cookie 支持~~（在 mem 中实现以处理重定向）
~~改进选择器，以便可以通过属性进行选择~~
添加更多选择器
在内部使用 HTTPlug

💖 支持我的项目

我喜欢开源。只要可能，我都会与世界分享酷炫的东西（查看 NPM 和 GitHub）。

如果您喜欢我所做的工作，并且这个项目帮助您减少了开发时间，请考虑以下操作：

★ 收藏并分享您喜欢的（并使用的）项目
☕ 给我一杯咖啡 - PayPal.me/duzuns（联系 duzun.me）
₿ 在这个地址给我一些 比特币：bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa（或使用下面的二维码）

duzun / hquery

维护者

详细信息

README

💡功能

🛠安装

⚙使用方法

基本设置

从文件加载HTML

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

从字符串加载HTML

hQuery::fromHTML( string `$html`, string `$url` = NULL )

加载远程HTML文档

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

PSR-7示例

处理结果

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery\Node `$ctx` = NULL )

🖧 在线演示

🏃 运行游乐场

选项 1

选项 2 (browser-sync)

选项 3 (VSCode)

🔧 TODO

💖 支持我的项目

duzun / hquery

维护者

详细信息

README

💡功能

🛠安装

⚙使用方法

基本设置

从文件加载HTML

hQuery::fromFile( string $filename, boolean $use_include_path = false, resource $context = NULL )

从字符串加载HTML

hQuery::fromHTML( string $html, string $url = NULL )

加载远程HTML文档

hQuery::fromUrl( string $url, array $headers = NULL, array|string $body = NULL, array $options = NULL )

PSR-7示例

处理结果

hQuery::find( string $sel, array|string $attr = NULL, hQuery\Node $ctx = NULL )

🖧 在线演示

🏃 运行游乐场

选项 1

选项 2 (browser-sync)

选项 3 (VSCode)

🔧 TODO

💖 支持我的项目

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

hQuery::fromHTML( string `$html`, string `$url` = NULL )

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery\Node `$ctx` = NULL )