reliqarts / laravel-scavenger
为Laravel提供的最集成的网络爬虫包。
v3.5.2
2023-03-05 01:08 UTC
Requires
- php: ^7.4 || ^8.0
- ext-dom: *
- ext-iconv: *
- ext-json: *
- fabpot/goutte: ^4.0
- illuminate/support: 6 - 10
- monolog/monolog: 1.24 - 3
Requires (Dev)
- orchestra/testbench: 4 - 8
- phpro/grumphp: ^1.0
- phpspec/prophecy-phpunit: ^2.0
- phpunit/phpunit: ^9.3
- symplify/easy-coding-standard: >=8.2
README
为Laravel提供的最集成的网络抓取包。
顶级功能
Scavenger提供以下功能和更多功能。
- 易用性
- Scavenger配置非常简单。只需发布配置文件并设置目标。
- 一次从多个来源抓取数据。
- 将抓取的数据转换为可用的Laravel模型对象。
- 例如。您可能抓取一篇文章,将其转换为所选的对象,并保存在您的数据库中。立即可供观众查看。
- 您可以对任何抓取实体的每个属性轻松执行一个或多个操作。
- 例如。在将数据保存到数据库之前,您可以从所选模型或包中调用释义服务。
- 数据完整性约束
- Scavenger使用您选择的散列算法来维护数据完整性。此散列用于确保一个抓取(源文章)不会被转换为多个输出对象(模型重复)。
- 控制台命令
- 一旦Scavenger配置完成,简单的Artisan命令就可以启动搜索器。由于这是一个控制台命令,它更有效率,超时发生的可能性也更小。
- Artisan命令:
php artisan scavenger:seek
- 调度就绪
- Scavenger可以轻松地设置计划抓取。因此,创建一个相对自主的网站非常简单!
- 搜索引擎结果页面(SERP)
- Scavenger可以灵活地用于抓取搜索引擎结果页面。
安装
-
通过composer安装;在您的终端
composer require reliqarts/laravel-scavenger
或要求在 composer.json
{ "require": { "reliqarts/laravel-scavenger": "^3.1" } }
然后在您的终端中运行
composer update
以拉取它。 -
(可选) 发布包资源和配置
php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider"
您可以选择只使用 scavenger-config
标签发布配置
php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider" --tag="scavenger-config"
或仅通过 scavenger-migrations
标签发布迁移
php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider" --tag="scavenger-migrations"
配置
Scavenger配置高度灵活。配置完成后,设置将用于每次抓取。
结构
以下是典型配置文件结构的示例,其中注释解释了每个设置。
<?php return [ // debug mode? 'debug' => false, // whether log file should be written 'log' => true, // How much detail is expected in output, 1 being the lowest, 3 being highest. 'verbosity' => 1, // Set the database config 'database' => [ // Scraps table 'scraps_table' => env('SCAVENGER_SCRAPS_TABLE', 'scavenger_scraps'), ], // Daemon config - used to build daemon user 'daemon' => [ // Model to use for Daemon identification and login 'model' => 'App\\User', // Model property to check for daemon ID 'id_prop' => 'email', // Daemon ID 'id' => 'daemon@scavenger.reliqarts.com', // Any additional information required to create a user: // NB. this is only used when creating a daemon user, there is no "safe" way // to change the daemon's password once he has been created. 'info' => [ 'name' => 'Scavenger Daemon', 'password' => 'pass', ], ], // guzzle settings 'guzzle_settings' => [ 'timeout' => 60, ], // hashing algorithm to use 'hash_algorithm' => 'sha512', // storage 'storage' => [ // This directory will live inside your application's log directory. 'log_dir' => env('SCAVENGER_LOG_DIR', 'scavenger'), ], // different model entities and mapping information 'targets' => [ // NB. the "rooms" target shown below is for example purposes only. It has all posible keys explicitly. 'rooms' => [ 'example' => true, 'serp' => false, 'model' => 'App\\Room', 'source' => 'http://myroomslistingsite.1demo/section/rooms', 'search' => [ // keywords 'keywords' => ['professional'], // form markup 'form' => [ // search form selector (important) 'selector' => '#form', // input element name for search term/keyword 'keyword_input_name' => 'keyword', 'submit_button' => [ // text on submit button (optional) 'text' => null, // submit element id, use if button doesn't have text (optional) 'id' => null, ], ], ], 'pager' => [ // link (a tag) selector 'selector' => 'div.content #page a.pagingnav', ], // max. number of pages to scrape (0 is unlimited) 'pages' => 0, // content markup: actual data to be scraped 'markup' => [ 'title' => 'div.content section > table tr h3', // inside: content to be found upon clicking title link '__inside' => [ 'title' => '#ad-title > h1 > a', 'body' => 'article .adcontent > p[align="LEFT"]:last-of-type', // focus: focus detail on the following section '__focus' => 'section section > .content #ad-detail > article', ], // wrapper/item/result: wrapping selector for each item on single page. // If inside special key is set this key becomes invalid (i.e. inside takes preference) '__result' => null, ], // split single attributes into multiple based on regex 'dissect' => [ 'body' => [ 'email' => '(([eE]mail)*:*\s*\w+\@(\s*\w)*\.(net|com))', 'phone' => '((([cC]all|[[tT]el|[Pp][Hh](one)*)[:\d\-,\sDL\/]*\d)|(\d{3}\-?\d{4}))', 'beds' => '([\d]+[\d\.\/\s]*[^\w]*([Bb]edroom|b\/r|[Bb]ed)s?)', 'baths' => '([\d]+[\d\.\/\s]*[^\w]*([Bb]athroom|bth|[Bb]ath)s?)', // retain: whether details should be left in source attribute after extraction '__retain' => true, ], ], // modify attributes by calling functions 'preprocess' => [ // takes a callable // optional third parameter of array if callable method needs an instance // e.g. ['App\\Item', 'foo', true] or 'bar' 'title' => null, ], // remap entity attributes to model properties (optional) 'remap' => [ 'title' => null, 'body' => null, ], // scraps containing any of these words will be rejected (optional) 'bad_words' => [ 'office', ], ], // Google SERP example: 'google' => [ 'example' => true, 'serp' => true, 'model' => 'App\\GoogleResult', 'source' => 'https://www.google.com', 'search' => [ 'keywords' => ['dog'], 'form' => [ 'selector' => 'form[name="f"]', 'keyword_input_name' => 'q', ], ], 'pages' => 2, 'pager' => [ 'selector' => '#foot > table > tr > td.b:last-child a', ], 'markup' => [ '__result' => 'div.g', 'title' => 'h3 > a', 'description' => '.st', // the 'link' and 'position' attributes make use of some of Scavengers available properties 'link' => '__link', 'position' => '__position', ], ], // Bing SERP example: 'bing' => [ 'example' => true, 'serp' => true, 'model' => 'App\\BingResult', 'source' => 'https://www.bing.com', 'search' => [ 'keywords' => ['dog'], 'form' => [ 'selector' => 'form#sb_form', 'keyword_input_name' => 'q', ], ], 'pages' => 3, 'pager' => [ 'selector' => '.sb_pagN', ], 'markup' => [ '__result' => '.b_algo', 'title' => 'h2 a', 'description' => '.b_caption p', 'link' => '__link', 'position' => '__position', ], ], ], ];
目标分解
targets
数组包含一个实体列表(用于抓取),按唯一的标识符键索引。结构如下。
model
:从目标创建的Laravel DB模型。source
:抓取的源URL。search
:搜索设置。如果需要在显示目标数据之前进行搜索,则使用它。(可选)keywords
:要搜索的关键词数组。keyword_input
:关键词输入文本标记。form_markup
:搜索表单的CSS选择器。submit_button_text
:表单提交按钮上的文本。
pager
:下一链接CSS选择器。用于跳转到下一页。markup
:从主列表中抓取的属性数组。[attributeName => CSS选择器]
__inside
:详情页的子标记。当文章标题被点击/打开时显示的页面的标记。(可选)
dissect
:通过REGEX将复合属性拆分为更小的属性。(可选)preprocess
:需要预处理的属性数组。[attributeName => callable]
(可选)remap
:需要重命名以保存为目标对象的属性数组。[attributeName => newName]
(可选)bad_words
:包含这些单词的任何碎片将被丢弃。(可选)
术语表
以下单词可能出现在上面的上下文中。
Daemon
:Scavenger服务将使用的用户实例。Scrap
:转换为目标对象之前抓取的数据。Target
:为单个实体配置的源模型映射。Target Object
:从抓取中生成的Eloquent模型对象。
致谢
这个库受到了Guzzle库的极大启发,并依赖于它,尽管可能调整了几个概念。