README

爬虫

一个易于使用的全站页面爬虫，用于在您的页面上提供搜索结果。爬虫模块收集配置域上的所有网站信息并将索引存储在数据库中。从这里，您可以创建搜索查询以提供搜索结果。还有帮助方法，通过将输入分割成多个搜索查询来提供智能搜索结果（默认使用）。

安装

通过composer安装模块

composer require luyadev/luya-module-crawler:^3.0

通过Composer安装后，在配置文件的模块部分包含该模块。

'modules' => [
    //...
    'crawler' => [
        'class' => 'luya\crawler\frontend\Module',
        'baseUrl' => 'https://luya.io',
        /*
        'filterRegex' => [
            '#.html#i', // filter all links with `.html`
            '#/agenda#i', // filter all links which contain the word with leading slash agenda,
            '#date\=#i, // filter all links with the word date inside. for example when using an agenda which will generate infinite links
        ],
        'on beforeProcess' => function() {
            // optional add or filter data from the BuilderIndex, which will be processed to the Index afterwards
        },
        'on afterIndex' => function() {
            // optional add or filter data from the freshly built Index
        }
        */
    ],
    'crawleradmin' => 'luya\crawler\admin\Module',
]

其中 baseUrl 是您想要爬取所有信息的域名。

在您的配置中设置模块后，您必须运行迁移和导入命令（以设置权限）

./vendor/bin/luya migrate
./vendor/bin/luya import

运行爬虫

要执行命令（并运行爬虫过程），使用爬虫命令 crawl，您应该将此命令放入cronjob以确保您的索引是最新的

确保您的页面在utf8模式下（<meta charset="utf-8"/>）并且确保设置语言 <html lang="<?= Yii::$app->composition->langShortCode; ?>">。

./vendor/bin/luya crawler/crawl

为了提供当前的爬虫结果，您应该创建一个cronjob，每天晚上爬取页面：cd httpdocs/current && ./vendor/bin/luya crawler/crawl

爬虫参数

crawler/crawl的所有爬虫参数，例如：crawler/crawl --pdfs=0 --concurrent=5 --linkcheck=0

统计信息

您还可以通过启用每周执行一次的cronjob来获取统计结果

./vendor/bin/luya crawler/statistic

创建搜索表单

向crawler/default/index路由发送带有query的POST请求，并按以下方式渲染视图

<?php
use luya\helpers\Url;
use yii\widgets\LinkPager;
use luya\crawler\widgets\DidYouMeanWidget;
/* @var $query string The lookup query encoded */
/* @var $language string */
/* @var $this \luya\web\View */
/* @var $provider \yii\data\ActiveDataProvider */
/* @var $searchModel \luya\crawler\models\Searchdata */
?>

<form class="searchpage__searched-form" action="<?= Url::toRoute(['/crawler/default/index']); ?>" method="get">
    <input id="search" name="query" type="search" value="<?= $query ?>">
    <input type="submit" value="Search"/>
</form>

<h2><?= $provider->totalCount; ?> Results</h2>

<?php if ($query && $provider->totalCount == 0): ?>
    <div>No results found for &laquo;<?= $query; ?>&raquo;.</div>
<?php endif; ?>

<?= DidYouMeanWidget::widget(['searchModel' => $searchModel]); ?>
<?php foreach($provider->models as $item): /* @var $item \luya\crawler\models\Index */ ?>
    <h3><?= $item->title; ?></h3>
    <p><?= $item->preview($query); ?></p>
    <a href="<?= $item->url; ?>"><?= $item->url; ?></a>
<?php endforeach; ?>
<?= LinkPager::widget(['pagination' => $provider->pagination]); ?>

爬虫设置

您可以使用爬虫标签来触发某些事件或存储信息

luyadev / luya-module-crawler

维护者

详细信息