README

为Laravel提供的最集成的网络抓取包。

顶级功能

Scavenger提供以下功能和更多功能。

易用性
- Scavenger配置非常简单。只需发布配置文件并设置目标。
一次从多个来源抓取数据。
将抓取的数据转换为可用的Laravel模型对象。
- 例如。您可能抓取一篇文章，将其转换为所选的对象，并保存在您的数据库中。立即可供观众查看。
您可以对任何抓取实体的每个属性轻松执行一个或多个操作。
- 例如。在将数据保存到数据库之前，您可以从所选模型或包中调用释义服务。
数据完整性约束
- Scavenger使用您选择的散列算法来维护数据完整性。此散列用于确保一个抓取（源文章）不会被转换为多个输出对象（模型重复）。
控制台命令
- 一旦Scavenger配置完成，简单的Artisan命令就可以启动搜索器。由于这是一个控制台命令，它更有效率，超时发生的可能性也更小。
- Artisan命令： php artisan scavenger:seek
调度就绪
- Scavenger可以轻松地设置计划抓取。因此，创建一个相对自主的网站非常简单！
搜索引擎结果页面（SERP）
- Scavenger可以灵活地用于抓取搜索引擎结果页面。

安装

通过composer安装；在您的终端

composer require reliqarts/laravel-scavenger

或要求在 composer.json

{
    "require": {
        "reliqarts/laravel-scavenger": "^3.1"
    }
}

然后在您的终端中运行 composer update 以拉取它。

(可选) 发布包资源和配置

php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider"

您可以选择只使用 scavenger-config 标签发布配置

php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider" --tag="scavenger-config"

或仅通过 scavenger-migrations 标签发布迁移

php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider" --tag="scavenger-migrations"

配置

Scavenger配置高度灵活。配置完成后，设置将用于每次抓取。

结构

以下是典型配置文件结构的示例，其中注释解释了每个设置。

<?php

return [
    // debug mode?
    'debug' => false,

    // whether log file should be written
    'log' => true,

    // How much detail is expected in output, 1 being the lowest, 3 being highest.
    'verbosity' => 1,

    // Set the database config
    'database' => [
        // Scraps table
        'scraps_table' => env('SCAVENGER_SCRAPS_TABLE', 'scavenger_scraps'),
    ],

    // Daemon config - used to build daemon user
    'daemon' => [
        // Model to use for Daemon identification and login
        'model' => 'App\\User',

        // Model property to check for daemon ID
        'id_prop' => 'email',

        // Daemon ID
        'id' => 'daemon@scavenger.reliqarts.com',

        // Any additional information required to create a user:
        // NB. this is only used when creating a daemon user, there is no "safe" way
        // to change the daemon's password once he has been created.
        'info' => [
            'name' => 'Scavenger Daemon',
            'password' => 'pass',
        ],
    ],

    // guzzle settings
    'guzzle_settings' => [
        'timeout' => 60,
    ],

    // hashing algorithm to use
    'hash_algorithm' => 'sha512',

    // storage
    'storage' => [
        // This directory will live inside your application's log directory.
        'log_dir' => env('SCAVENGER_LOG_DIR', 'scavenger'),
    ],

    // different model entities and mapping information
    'targets' => [
        // NB. the "rooms" target shown below is for example purposes only. It has all posible keys explicitly.
        'rooms' => [
            'example' => true,
            'serp' => false,
            'model' => 'App\\Room',
            'source' => 'http://myroomslistingsite.1demo/section/rooms',
            'search' => [
                // keywords
                'keywords' => ['professional'],
                // form markup
                'form' => [
                    // search form selector (important)
                    'selector' => '#form',
                    // input element name for search term/keyword
                    'keyword_input_name' => 'keyword',
                    'submit_button' => [
                        // text on submit button (optional)
                        'text' => null,
                        // submit element id, use if button doesn't have text (optional)
                        'id' => null,
                    ],
                ],
            ],
            'pager' => [
                // link (a tag) selector
                'selector' => 'div.content #page a.pagingnav',
            ],
            // max. number of pages to scrape (0 is unlimited)
            'pages' => 0,
            // content markup: actual data to be scraped
            'markup' => [
                'title' => 'div.content section > table tr h3',
                // inside: content to be found upon clicking title link
                '__inside' => [
                    'title' => '#ad-title > h1 > a',
                    'body' => 'article .adcontent > p[align="LEFT"]:last-of-type',
                    // focus: focus detail on the following section
                    '__focus' => 'section section > .content #ad-detail > article',
                ],
                // wrapper/item/result: wrapping selector for each item on single page.
                // If inside special key is set this key becomes invalid (i.e. inside takes preference)
                '__result' => null,
            ],
            // split single attributes into multiple based on regex
            'dissect' => [
                'body' => [
                    'email' => '(([eE]mail)*:*\s*\w+\@(\s*\w)*\.(net|com))',
                    'phone' => '((([cC]all|[[tT]el|[Pp][Hh](one)*)[:\d\-,\sDL\/]*\d)|(\d{3}\-?\d{4}))',
                    'beds' => '([\d]+[\d\.\/\s]*[^\w]*([Bb]edroom|b\/r|[Bb]ed)s?)',
                    'baths' => '([\d]+[\d\.\/\s]*[^\w]*([Bb]athroom|bth|[Bb]ath)s?)',
                    // retain:  whether details should be left in source attribute after extraction
                    '__retain' => true,
                ],
            ],
            // modify attributes by calling functions
            'preprocess' => [
                // takes a callable
                // optional third parameter of array if callable method needs an instance
                // e.g. ['App\\Item', 'foo', true] or 'bar'
                'title' => null,
            ],
            // remap entity attributes to model properties (optional)
            'remap' => [
                'title' => null,
                'body' => null,
            ],
            // scraps containing any of these words will be rejected (optional)
            'bad_words' => [
                'office',
            ],
        ],

        // Google SERP example:
        'google' => [
            'example' => true,
            'serp' => true,
            'model' => 'App\\GoogleResult',
            'source' => 'https://www.google.com',
            'search' => [
                'keywords' => ['dog'],
                'form' => [
                    'selector' => 'form[name="f"]',
                    'keyword_input_name' => 'q',
                ],
            ],
            'pages' => 2,
            'pager' => [
                'selector' => '#foot > table > tr > td.b:last-child a',
            ],
            'markup' => [
                '__result' => 'div.g',
                'title' => 'h3 > a',
                'description' => '.st',
                // the 'link' and 'position' attributes make use of some of Scavengers available properties
                'link' => '__link',
                'position' => '__position',
            ],
        ],

        // Bing SERP example:
        'bing' => [
            'example' => true,
            'serp' => true,
            'model' => 'App\\BingResult',
            'source' => 'https://www.bing.com',
            'search' => [
                'keywords' => ['dog'],
                'form' => [
                    'selector' => 'form#sb_form',
                    'keyword_input_name' => 'q',
                ],
            ],
            'pages' => 3,
            'pager' => [
                'selector' => '.sb_pagN',
            ],
            'markup' => [
                '__result' => '.b_algo',
                'title' => 'h2 a',
                'description' => '.b_caption p',
                'link' => '__link',
                'position' => '__position',
            ],
        ],
    ],
];

目标分解

targets 数组包含一个实体列表（用于抓取），按唯一的标识符键索引。结构如下。

model：从目标创建的Laravel DB模型。
source：抓取的源URL。
search：搜索设置。如果需要在显示目标数据之前进行搜索，则使用它。（可选）
- keywords：要搜索的关键词数组。
- keyword_input：关键词输入文本标记。
- form_markup：搜索表单的CSS选择器。
- submit_button_text：表单提交按钮上的文本。
pager：下一链接CSS选择器。用于跳转到下一页。
markup：从主列表中抓取的属性数组。[attributeName => CSS选择器]
- __inside：详情页的子标记。当文章标题被点击/打开时显示的页面的标记。（可选）
dissect：通过REGEX将复合属性拆分为更小的属性。（可选）
preprocess：需要预处理的属性数组。[attributeName => callable]（可选）
remap：需要重命名以保存为目标对象的属性数组。[attributeName => newName]（可选）
bad_words：包含这些单词的任何碎片将被丢弃。（可选）

术语表

以下单词可能出现在上面的上下文中。

Daemon：Scavenger服务将使用的用户实例。
Scrap：转换为目标对象之前抓取的数据。
Target：为单个实体配置的源模型映射。
Target Object：从抓取中生成的Eloquent模型对象。

致谢

这个库受到了Guzzle库的极大启发，并依赖于它，尽管可能调整了几个概念。

reliqarts / laravel-scavenger

维护者

详细信息

README

顶级功能

安装

配置

结构

目标分解

术语表

致谢