hindmost / rolling-curl-mini
此包最新版本(1.0.6)没有提供许可证信息。
1.0.6
2015-01-21 19:46 UTC
Requires
- php: >=5.0.0
This package is not auto-updated.
Last update: 2024-09-21 18:24:49 UTC
README
Rolling Curl Mini 是 Rolling Curl 的一个分支。它允许使用 cURL PHP 库并行处理多个 HTTP 请求。
更多信息请阅读这篇文章(俄语)。
基本使用示例
... require "RollingCurlMini.php"; ... $o_mc = new RollingCurlMini(10); ... $o_mc->add($url, $postdata, $callback, $userdata, $options, $headers); ... $o_mc->execute(); ...
回调函数
任何请求都可以有一个单独的回调函数 - 当请求完成时被调用的函数/方法。回调函数接受 4 个参数,可能看起来像以下这样
/** * @param string $content - content of request response * @param string $url - URL of requested resource * @param array $info - cURL handle info * @param mixed $userdata - user-defined data passed with add() method */ function request_callback($content, $url, $info, $userdata) { }
许可证
滚动爬取抽象
滚动爬取抽象是一个多用途的爬取(抓取)框架,它使用多-curl 和 RollingCurlMini 类的功能。它是一个基于 PHP 的基本类,实现了多-curl 爬取器的常见功能。特定功能应在派生类中实现。特定的爬取器类应扩展 RollingScraperAbstract 类并实现(重写)两个强制方法:_initPages 和 _handlePage。
更多信息请阅读这篇文章(俄语)。
基本使用示例
class MyScraper extends RollingScraperAbstract { ... public function __construct() { ... $this->modConfig(array( 'state_time_storage' => '...', // temporal section of state storage (file path) 'state_data_storage' => '...', // data section of state storage (file path) 'scrape_life' => 0, // expiration time (secs) of scraped data 'run_timeout' => 30, // max. time (secs) to execute scraper script 'run_pages_loops' => 20, // max. number of loops through pages 'run_pages_buffer' => 500, // page requests buffer size 'curl_threads' => 10, // number of multi-curl threads 'curl_options' => array(...), // CURL options used in multi-curl requests )); parent::__construct(); } /** * Initialize the starting list of page requests */ protected function _initPages() { ... // add page request. $url - page URL $this->addPage($url); ... } /** * Process response of a page request * @param string $cont - page content * @param string $url - url of request * @param array $aInfo - CURL info data * @param int $index - # of page request * @param array $aData - custom request data (part of request data) * @return bool */ protected function _handlePage($cont, $url, $aInfo, $index, $aData) { ... } ... } $scraper = new MyScraper(); $bool = $scraper->run(); list($time_start, $time_end, , $time_run_start, , $n_pages_total, $n_pages_passed) = $scraper->getStateProgress(); if ($time_end) { echo sprintf('Completed at %s', date('Y.m.d, H:i:s', $time_end)); } else { if ($bool) echo sprintf('In progress: %d/%d pages', $n_pages_passed, $n_pages_total); else echo 'Cancelled since another script instance is still running'; }