truercm / laravel-webscrape
在Laravel应用中抓取网页
1.2.0
2024-09-06 14:43 UTC
Requires
- php: ^8.0
- dbrekelmans/bdi: ^1.2
- frictionlessdigital/actions: ^9.0|^10.0|^11.0
- illuminate/contracts: ^8.0|^9.0|^10.0|^11.0
- spatie/laravel-package-tools: ^1.12
- symfony/browser-kit: ^6.0|^7.0
- symfony/http-client: ^6.0|^7.0
- symfony/panther: ^2.1
Requires (Dev)
- dg/bypass-finals: ^1.7
- nunomaduro/collision: ^5.0|^6.0|^7.0|^8.0
- orchestra/testbench: ^6.0|^7.0|^8.0|^9.0
- pestphp/pest: ^1.0|^2.0
- phpspec/prophecy: ~1.0
README
Web爬虫
Scrape web pages with a Laravel application.
安装
您可以通过Composer安装此包
composer require truercm/laravel-webscrape
您可以使用以下命令发布和运行迁移
php artisan vendor:publish --tag="laravel-webscrape-migrations"
php artisan migrate
您可以使用以下命令发布配置文件
php artisan vendor:publish --tag="laravel-webscrape-config"
这是发布后的配置文件内容
return [ /* |-------------------------------------------------------------------------- | Webscrape models |-------------------------------------------------------------------------- */ 'models' => [ /* |-------------------------------------------------------------------------- | Subject model holds the credentials, target_id and the final scraping result |-------------------------------------------------------------------------- */ 'subject' => TrueRcm\LaravelWebscrape\Models\CrawlSubject::class, /* |-------------------------------------------------------------------------- | Target model stores the remote target, authentication url and processing job |-------------------------------------------------------------------------- */ 'target' => TrueRcm\LaravelWebscrape\Models\CrawlTarget::class, /* |-------------------------------------------------------------------------- | TargetUrl model collects all URLs for the Target |-------------------------------------------------------------------------- */ 'target_url' => TrueRcm\LaravelWebscrape\Models\CrawlTargetUrl::class, /* |-------------------------------------------------------------------------- | Url Result model stores processed results |-------------------------------------------------------------------------- */ 'result' => TrueRcm\LaravelWebscrape\Models\CrawlResult::class, ], /* |-------------------------------------------------------------------------- | Selenium driver url |-------------------------------------------------------------------------- */ 'selenium_driver_url' => env('SELENIUM_DRIVER_URL', null), ];
Laravel Web爬虫使用Selenium来抓取页面,因此请确保您已经安装了它。
使用方法
这是一个通用包,您需要自行实现所有抓取步骤。
高级概述涉及
- 拥有一个CrawlTarget - 模型,包含您需要抓取的页面列表的入口点
- 抓取主题 - 一个模型,将凭据与抓取目标连接起来
一旦您注册了目标,您就可以
- 使用凭据和目标URL初始化主题
- 开始远程URL抓取并处理结果
$crawlSubject = \TrueRcm\LaravelWebscrape\Actions\StoreCrawlSubject::run([ 'model_type' => App\Models\User::class, 'model_id' => 1, 'crawl_target_id' => 1, 'credentials' => ['values' => 'that would be piped', 'into' => 'crawl target'], ]);
然后
resolve($crawlSubject->crawlTarget->crawling_job) ->dispatch($crawlSubject);
- 作业完成后,我们在CrawlSubject的结果列中拥有最终结果
$crawlSubject->result;
测试
composer test
变更日志
有关最近更改的更多信息,请参阅变更日志。
贡献
有关详细信息,请参阅贡献。
安全漏洞
有关如何报告安全漏洞的详细信息,请参阅我们的安全策略。
鸣谢
许可
MIT许可(MIT)。有关更多信息,请参阅许可文件。