truercm/laravel-webscrape

在Laravel应用中抓取网页

1.2.0 2024-09-06 14:43 UTC

This package is auto-updated.

Last update: 2024-09-06 14:44:09 UTC


README

Web爬虫

Scrape web pages with a Laravel application.

安装

您可以通过Composer安装此包

composer require truercm/laravel-webscrape

您可以使用以下命令发布和运行迁移

php artisan vendor:publish --tag="laravel-webscrape-migrations"
php artisan migrate

您可以使用以下命令发布配置文件

php artisan vendor:publish --tag="laravel-webscrape-config"

这是发布后的配置文件内容

return [

    /*
    |--------------------------------------------------------------------------
    | Webscrape models
    |--------------------------------------------------------------------------
    */
    'models' => [

        /*
        |--------------------------------------------------------------------------
        | Subject model holds the credentials, target_id and the final scraping result
        |--------------------------------------------------------------------------
        */
        'subject' => TrueRcm\LaravelWebscrape\Models\CrawlSubject::class,

        /*
        |--------------------------------------------------------------------------
        | Target model stores the remote target, authentication url and processing job
        |--------------------------------------------------------------------------
        */
        'target' => TrueRcm\LaravelWebscrape\Models\CrawlTarget::class,

        /*
        |--------------------------------------------------------------------------
        | TargetUrl model collects all URLs for the Target
        |--------------------------------------------------------------------------
        */
        'target_url' => TrueRcm\LaravelWebscrape\Models\CrawlTargetUrl::class,

        /*
        |--------------------------------------------------------------------------
        | Url Result model stores processed results
        |--------------------------------------------------------------------------
        */
        'result' => TrueRcm\LaravelWebscrape\Models\CrawlResult::class,
    ],

    /*
     |--------------------------------------------------------------------------
     | Selenium driver url
     |--------------------------------------------------------------------------
     */
    'selenium_driver_url' => env('SELENIUM_DRIVER_URL', null),
];

Laravel Web爬虫使用Selenium来抓取页面,因此请确保您已经安装了它

使用方法

这是一个通用包,您需要自行实现所有抓取步骤。

高级概述涉及

  1. 拥有一个CrawlTarget - 模型,包含您需要抓取的页面列表的入口点
  2. 抓取主题 - 一个模型,将凭据与抓取目标连接起来

一旦您注册了目标,您就可以

  1. 使用凭据和目标URL初始化主题
  2. 开始远程URL抓取并处理结果
    $crawlSubject = \TrueRcm\LaravelWebscrape\Actions\StoreCrawlSubject::run([
        'model_type' => App\Models\User::class,
        'model_id' => 1,
        'crawl_target_id' => 1,
        'credentials' => ['values' => 'that would be piped', 'into' => 'crawl target'],
]);

然后

resolve($crawlSubject->crawlTarget->crawling_job)
    ->dispatch($crawlSubject);
  1. 作业完成后,我们在CrawlSubject的结果列中拥有最终结果
    $crawlSubject->result;

测试

composer test

变更日志

有关最近更改的更多信息,请参阅变更日志

贡献

有关详细信息,请参阅贡献

安全漏洞

有关如何报告安全漏洞的详细信息,请参阅我们的安全策略

鸣谢

许可

MIT许可(MIT)。有关更多信息,请参阅许可文件