ixnode/php-web-crawler

PHP Web Crawler - 这个PHP类允许你递归地爬取给定的HTML页面(或给定的HTML文件)并从中收集一些数据。

0.1.24 2024-02-28 01:18 UTC

README

Release PHP PHPStan PHPUnit PHPCS PHPMD Rector - Instant Upgrades and Automated Refactoring LICENSE

这个PHP类允许你递归地爬取给定的HTML页面(或给定的HTML文件)并从中收集一些数据。只需定义URL(或HTML文件)和一组XPath表达式,这些表达式应与输出数据对象匹配。最终表示将是一个PHP数组,可以轻松转换为JSON格式以进行进一步处理。

1. 安装

composer require ixnode/php-web-crawler
vendor/bin/php-web-crawler -V
php-web-crawler 0.1.0 (02-24-2024 14:46:26) - Björn Hempel <bjoern@hempel.li>

2. 使用方法

2.1 PHP代码

use Ixnode\PhpWebCrawler\Output\Field;
use Ixnode\PhpWebCrawler\Source\Raw;
use Ixnode\PhpWebCrawler\Value\Text;
use Ixnode\PhpWebCrawler\Value\XpathTextNode;

$rawHtml = <<<HTML
<html>
    <head>
        <title>Test Page</title>
    </head>
    <body>
        <h1>Test Title</h1>
        <p>Test Paragraph</p>
    </body>
</html>
HTML;

$html = new Raw(
    $rawHtml,
    new Field('version', new Text('1.0.0')),
    new Field('title', new XpathTextNode('//h1')),
    new Field('paragraph', new XpathTextNode('//p'))
);

$html->parse()->getJsonStringFormatted();
// See below

2.2 JSON结果

{
    "version": "1.0.0",
    "title": "Test Title",
    "paragraph": "Test Paragraph"
}

3. 高级使用

3.1 组

PHP代码

use Ixnode\PhpWebCrawler\Output\Field;
use Ixnode\PhpWebCrawler\Output\Group;
use Ixnode\PhpWebCrawler\Source\Raw;
use Ixnode\PhpWebCrawler\Value\XpathTextNode;

$rawHtml = <<<HTML
<html>
    <head>
        <title>Test Page</title>
    </head>
    <body>
        <h1>Test Title</h1>
        <p class="paragraph-1">Test Paragraph 1</p>
        <p class="paragraph-2">Test Paragraph 2</p>
    </body>
</html>
HTML;

$html = new Raw(
    $rawHtml,
    new Field('title', new XpathTextNode('/html/head/title')),
    new Group(
        'content',
        new Group(
            'header',
            new Field('h1', new XpathTextNode('/html/body//h1')),
        ),
        new Group(
            'text',
            new Field('p1', new XpathTextNode('/html/body//p[@class="paragraph-1"]')),
            new Field('p2', new XpathTextNode('/html/body//p[@class="paragraph-2"]')),
        )
    )
);

$html->parse()->getJsonStringFormatted();
// See below

JSON结果

{
  "title": "Test Page",
  "content": {
    "header": {
      "h1": "Test Title"
    },
    "text": {
      "p1": "Test Paragraph 1",
      "p2": "Test Paragraph 2"
    }
  }
}

3.2 XpathSection

PHP代码

use Ixnode\PhpWebCrawler\Output\Field;
use Ixnode\PhpWebCrawler\Output\Group;
use Ixnode\PhpWebCrawler\Source\Raw;
use Ixnode\PhpWebCrawler\Source\XpathSection;
use Ixnode\PhpWebCrawler\Value\XpathTextNode;

$rawHtml = <<<HTML
<html>
    <head>
        <title>Test Page</title>
    </head>
    <body>
        <div class="content">
            <h1>Test Title</h1>
            <p class="paragraph-1">Test Paragraph 1</p>
            <p class="paragraph-2">Test Paragraph 2</p>
        </div>
    </body>
</html>
HTML;

$html = new Raw(
    $rawHtml,
    new Field('title', new XpathTextNode('/html/head/title')),
    new Group(
        'content',
        new XpathSection(
            '/html/body//div[@class="content"]',
            new Group(
                'header',
                new Field('h1', new XpathTextNode('./h1')),
            ),
            new Group(
                'text',
                new Field('p1', new XpathTextNode('./p[@class="paragraph-1"]')),
                new Field('p2', new XpathTextNode('./p[@class="paragraph-2"]')),
            )
        )
    )
);

$html->parse()->getJsonStringFormatted();
// See below

JSON结果

{
    "title": "Test Page",
    "content": {
        "header": {
            "h1": "Test Title"
        },
        "text": {
            "p1": "Test Paragraph 1",
            "p2": "Test Paragraph 2"
        }
    }
}

3.3 XpathSection(扁平化)

PHP代码

use Ixnode\PhpWebCrawler\Output\Field;
use Ixnode\PhpWebCrawler\Output\Group;
use Ixnode\PhpWebCrawler\Source\Raw;
use Ixnode\PhpWebCrawler\Source\XpathSections;
use Ixnode\PhpWebCrawler\Value\XpathTextNode;

$rawHtml = <<<HTML
<html>
    <head>
        <title>Test Page</title>
    </head>
    <body>
        <div class="content">
            <h1>Test Title</h1>
            <p class="paragraph-1">Test Paragraph 1</p>
            <p class="paragraph-2">Test Paragraph 2</p>
            <ul>
                <li>Test Item 1</li>
                <li>Test Item 2</li>
            </ul>
        </div>
    </body>
</html>
HTML;

$html = new Raw(
    $rawHtml,
    new Field('title', new XpathTextNode('/html/head/title')),
    new Group(
        'hits',
        new XpathSections(
            '/html/body//div[@class="content"]/ul',
            new XpathTextNode('./li/text()'),
        )
    )
);

$html->parse()->getJsonStringFormatted();
// See below

JSON结果

{
    "title": "Test Page",
    "hits": [
        [
            "Test Item 1",
            "Test Item 2"
        ]
    ]
}

3.3 XpathSection(结构化)

PHP代码

use Ixnode\PhpWebCrawler\Output\Field;
use Ixnode\PhpWebCrawler\Output\Group;
use Ixnode\PhpWebCrawler\Source\Raw;
use Ixnode\PhpWebCrawler\Source\XpathSections;
use Ixnode\PhpWebCrawler\Value\XpathTextNode;

$rawHtml = <<<HTML
<html>
    <head>
        <title>Test Page</title>
    </head>
    <body>
        <div class="content">
            <h1>Test Title</h1>
            <p class="paragraph-1">Test Paragraph 1</p>
            <p class="paragraph-2">Test Paragraph 2</p>
            <table>
                <tbody>
                    <tr>
                        <th>Caption 1</th>
                        <td>Cell 1</td>
                    </tr>
                    <tr>
                        <th>Caption 2</th>
                        <td>Cell 2</td>
                    </tr>
                </tbody>
            </table>
        </div>
    </body>
</html>
HTML;

$html = new Raw(
    $rawHtml,
    new Field('title', new XpathTextNode('/html/head/title')),
    new Group(
        'hits',
        new XpathSections(
            '/html/body//div[@class="content"]/table/tbody/tr',
            new Field('caption', new XpathTextNode('./th/text()')),
            new Field('content', new XpathTextNode('./td/text()')),
        )
    )
);

$html->parse()->getJsonStringFormatted();
// See below

JSON结果

{
    "title": "Test Page",
    "hits": [
        {
            "caption": "Caption 1",
            "content": "Cell 1"
        },
        {
            "caption": "Caption 2",
            "content": "Cell 2"
        }
    ]
}

4. 更多示例

5. 开发

git clone git@github.com:ixnode/php-web-crawler.git && cd php-web-crawler
composer install
composer test

6. 许可证

本库采用MIT许可证授权 - 有关详细信息,请参阅LICENSE.md文件。