fievel / webspider
webspider
0.1.0
2016-07-21 13:16 UTC
Requires
- php: >=5.5
- doctrine/orm: >=2.2
- guzzlehttp/guzzle: ^6.0
- symfony/css-selector: >=2.7
- symfony/dom-crawler: >=2.7
- symfony/framework-bundle: >=2.7
This package is not auto-updated.
Last update: 2024-09-28 18:05:26 UTC
README
此存储库封装了 Guzzle 和一些 Symfony 组件,提供了一种简单的方式来爬取网站。
要求
- PHP >=5.5
- Guzzle >= 6.0
- Doctrine ORM >= 2.2
- Symfony Components >= 2.7
安装
将 fievel/webspider
添加到您的 composer.json
文件中的 require 依赖项
composer require fievel/webspider
用法
根据需要扩展 WebSpiderAbstract
类并实现以下方法
getDataFromResponse: 用于从响应中提取数据,默认行为将主体视为纯文本;
protected function getDataFromResponse(ResponseInterface $response)
{
return (string) $response->getBody();
}
parseData: 用于提取数据信息,如果需要,可以初始化 Symfony DomCrawler
;
protected function parseData($data)
{
$this->crawler->addHtmlContent($data);
$node = $this->crawler->filter('input');
$value = null;
if ($node->count() > 0) {
$value = $node->first()->attr('value');
}
return $value;
}
handleException: 用于处理 Guzzle 异常;
protected function handleException(\Exception $e)
{
return null;
}
剩下的唯一一件事情就是启动创建的蜘蛛,为此您可以使用 SpiderManager
服务。
$manager = $this->container->get('fievel_web_spider.manager.spider');
$manager->setLogger($this->logger);
$response = null;
try {
$response = $manager->runSpider([
AppBundle\Spiders\CustomSpider::class, // Spider class created
'https:///test-spider', // URL to spidering
'post', // Http method supported by Guzzle
['cookies' => true], // Custom config supported by Guzzle Client
[ // Custom options supported by Guzzle Client
RequestOptions::FORM_PARAMS => [
'full_name' => 'John Doe'
]
]
]);
} catch(\Exception $e) {
}
特性
可以在后续蜘蛛调用之间共享存储。
$storage = new SpiderStorage();
$storage->add($sharedData);
$response = $manager->runSpider([
AppBundle\Spiders\CustomSpider::class, // Spider class created
'https:///test-spider', // URL to spidering
'post', // Http method supported by Guzzle
['cookies' => true], // Custom config supported by Guzzle Client
[ // Custom options supported by Guzzle Client
RequestOptions::FORM_PARAMS => [
'full_name' => 'John Doe'
]
],
$storage // Shared storage
]);
甚至可以创建队列并将整个执行留给管理者。
$queue = new SpiderCallQueue();
$queue->enqueue(
AppBundle\Spiders\FirstPageSpider::class,
'https:///test-spider',
'post',
['cookies' => true],
[
RequestOptions::FORM_PARAMS => [
'full_name' => 'John Doe'
]
]
);
$queue->enqueue(
AppBundle\Spiders\SecondPageSpider::class,
'https:///test-spider',
'get',
['cookies' => true],
[]
);
$response = $manager->runSpiderQueue($queue);
最后但同样重要的是,SpiderManager
将使用自定义的 GuzzleMiddleware
处理失败的重试。