fievel/webspider

0.1.0 2016-07-21 13:16 UTC

This package is not auto-updated.

Last update: 2024-09-28 18:05:26 UTC


README

此存储库封装了 Guzzle 和一些 Symfony 组件,提供了一种简单的方式来爬取网站。

要求

  • PHP >=5.5
  • Guzzle >= 6.0
  • Doctrine ORM >= 2.2
  • Symfony Components >= 2.7

安装

fievel/webspider 添加到您的 composer.json 文件中的 require 依赖项

composer require fievel/webspider

用法

根据需要扩展 WebSpiderAbstract 类并实现以下方法

getDataFromResponse: 用于从响应中提取数据,默认行为将主体视为纯文本;

protected function getDataFromResponse(ResponseInterface $response)
{
    return (string) $response->getBody();
}

parseData: 用于提取数据信息,如果需要,可以初始化 Symfony DomCrawler

protected function parseData($data)
{
    $this->crawler->addHtmlContent($data);

    $node = $this->crawler->filter('input');

    $value = null;
    if ($node->count() > 0) {
        $value = $node->first()->attr('value');
    }

    return $value;
}

handleException: 用于处理 Guzzle 异常;

protected function handleException(\Exception $e)
{
    return null;
}

剩下的唯一一件事情就是启动创建的蜘蛛,为此您可以使用 SpiderManager 服务。

$manager = $this->container->get('fievel_web_spider.manager.spider');
$manager->setLogger($this->logger);

$response = null;
try {
    $response = $manager->runSpider([
        AppBundle\Spiders\CustomSpider::class,  // Spider class created
        'https:///test-spider',         // URL to spidering
        'post',                                 // Http method supported by Guzzle
        ['cookies' => true],                    // Custom config supported by Guzzle Client
        [                                       // Custom options supported by Guzzle Client
            RequestOptions::FORM_PARAMS => [
                'full_name' => 'John Doe'
            ]
        ]
    ]);
} catch(\Exception $e) {
}

特性

可以在后续蜘蛛调用之间共享存储。

$storage = new SpiderStorage();
$storage->add($sharedData);

$response = $manager->runSpider([
    AppBundle\Spiders\CustomSpider::class,  // Spider class created
    'https:///test-spider',         // URL to spidering
    'post',                                 // Http method supported by Guzzle
    ['cookies' => true],                    // Custom config supported by Guzzle Client
    [                                       // Custom options supported by Guzzle Client
        RequestOptions::FORM_PARAMS => [
            'full_name' => 'John Doe'
        ]
    ],
    $storage                                // Shared storage
]);

甚至可以创建队列并将整个执行留给管理者。

$queue = new SpiderCallQueue();

$queue->enqueue(
    AppBundle\Spiders\FirstPageSpider::class,
    'https:///test-spider',
    'post',
    ['cookies' => true],
    [
        RequestOptions::FORM_PARAMS => [
            'full_name' => 'John Doe'
        ]
    ]
);
$queue->enqueue(
    AppBundle\Spiders\SecondPageSpider::class,
    'https:///test-spider',
    'get',
    ['cookies' => true],
    []
);

$response = $manager->runSpiderQueue($queue);

最后但同样重要的是,SpiderManager 将使用自定义的 GuzzleMiddleware 处理失败的重试。

代理

链接