README

爬虫

一个易于使用的全站页面爬虫，用于在您的页面上提供搜索结果。爬虫模块收集配置域内所有网站的信息，并将索引存储在数据库中。从那里，您现在可以创建搜索查询以提供搜索结果。还有一些辅助方法，通过将输入分割成多个搜索查询来提供智能搜索结果（默认使用）。

安装

使用composer安装模块

composer require luyadev/luya-module-crawler:^3.0

通过Composer安装后，在配置文件中的模块部分包含该模块。

'modules' => [
    //...
    'crawler' => [
        'class' => 'luya\crawler\frontend\Module',
        'baseUrl' => 'https://luya.io',
        /*
        'filterRegex' => [
            '#.html#i', // filter all links with `.html`
            '#/agenda#i', // filter all links which contain the word with leading slash agenda,
            '#date\=#i, // filter all links with the word date inside. for example when using an agenda which will generate infinite links
        ],
        'on beforeProcess' => function() {
            // optional add or filter data from the BuilderIndex, which will be processed to the Index afterwards
        },
        'on afterIndex' => function() {
            // optional add or filter data from the freshly built Index
        }
        */
    ],
    'crawleradmin' => 'luya\crawler\admin\Module',
]

其中baseUrl是您想爬取所有信息的域名。

在您的配置中设置模块后，您必须运行迁移和导入命令（以设置权限）

./vendor/bin/luya migrate
./vendor/bin/luya import

运行爬虫

要执行命令（并运行爬虫进程），请使用crawl爬虫命令，您应该将此命令放入cronjob中以确保您的索引是最新的

确保您的页面处于utf8模式（<meta charset="utf-8"/>）并确保设置语言<html lang="<?= Yii::$app->composition->langShortCode; ?>">。

./vendor/bin/luya crawler/crawl

为了提供当前的爬取结果，您应该创建一个cronjob，每晚爬取页面：cd httpdocs/current && ./vendor/bin/luya crawler/crawl

爬虫参数

crawler/crawl的所有爬虫参数，例如：crawler/crawl --pdfs=0 --concurrent=5 --linkcheck=0

名称	描述	默认值
linkcheck	爬虫索引您的站点后是否应检查所有链接	是
pdfs	爬虫是否应索引PDF文件	是
concurrent	并发页面爬取的数量	15

统计

您还可以通过启用每周执行一次的cronjob来获取统计结果

./vendor/bin/luya crawler/statistic

创建搜索表单

向crawler/default/index路由发送带有query的POST请求，并按以下方式渲染视图

<?php
use luya\helpers\Url;
use yii\widgets\LinkPager;
use luya\crawler\widgets\DidYouMeanWidget;
/* @var $query string The lookup query encoded */
/* @var $language string */
/* @var $this \luya\web\View */
/* @var $provider \yii\data\ActiveDataProvider */
/* @var $searchModel \luya\crawler\models\Searchdata */
?>

<form class="searchpage__searched-form" action="<?= Url::toRoute(['/crawler/default/index']); ?>" method="get">
    <input id="search" name="query" type="search" value="<?= $query ?>">
    <input type="submit" value="Search"/>
</form>

<h2><?= $provider->totalCount; ?> Results</h2>

<?php if ($query && $provider->totalCount == 0): ?>
    <div>No results found for &laquo;<?= $query; ?>&raquo;.</div>
<?php endif; ?>

<?= DidYouMeanWidget::widget(['searchModel' => $searchModel]); ?>
<?php foreach($provider->models as $item): /* @var $item \luya\crawler\models\Index */ ?>
    <h3><?= $item->title; ?></h3>
    <p><?= $item->preview($query); ?></p>
    <a href="<?= $item->url; ?>"><?= $item->url; ?></a>
<?php endforeach; ?>
<?= LinkPager::widget(['pagination' => $provider->pagination]); ?>

爬虫设置

您可以使用爬虫标签来触发某些事件或存储信息

标签	示例	描述
CRAWL_IGNORE	`<!-- [CRAWL_IGNORE] -->忽略此内容<!-- [/CRAWL_IGNORE] -->`	忽略索引中的某些内容。
CRAWL_FULL_IGNORE	`<!-- [CRAWL_FULL_IGNORE] -->`	忽略爬虫的整个页面，请注意，忽略页面内的链接将被添加到索引中。
CRAWL_GROUP	`<!-- [CRAWL_GROUP]api[/CRAWL_GROUP] -->`	有时您想根据页面的某个部分对结果进行分组，为了让爬虫了解您的当前页面的组/部分，现在您可以按`group`字段分组结果。
CRAWL_TITLE	`<!-- [CRAWL_TITLE]My Title[/CRAWL_TITLE] -->`	如果您想确保始终使用您自定义的标题，则可以使用CRAWL_TITLE标签来确保页面的标题

zephir / luya-module-crawler

维护者

详细信息

README

爬虫

安装

运行爬虫

爬虫参数

统计

创建搜索表单

爬虫设置