maurice2k / multicurl
基于对象的异步多curl包装器
1.3.1
2022-03-24 15:10 UTC
Requires
- php: ^7.4 || ~8.0
- ext-curl: ^7.4 || ~8.0
- ext-json: ^7.4 || ~8.0
README
Maurice\Multicurl
为PHP的curl_multi_*
函数提供了一个简单的面向对象的接口。它不仅是一个包装器,还提供了事件循环,确保不超过指定数量的并发连接,并处理超时(连接和总超时)。
Maurice\Multicurl
基本上由一个Manager
组成,该Manager
协调多个基于Channel
的实例。理论上,通道可以是cURL支持的任何连接类型,而实际上,当前版本的Maurice\Multicurl
只在Channel
之上实现了HttpChannel
。
安装
使用composer安装
$ composer require maurice2k/multicurl
兼容性
Maurice\Multicurl
需要PHP 7.4(或更高版本)并启用curl扩展。
用法
基本示例
use Maurice\Multicurl\{Manager, Channel, HttpChannel}; $urls = [ 'https://www.google.com/', 'https://#/', 'https://www.amazon.com/', 'https://www.ebay.com/', 'https://www.example.org/', 'https://non-existant.this-is-a-dns-error.org/', 'https://www.netflix.com/', 'https://www.microsoft.com/', ]; $manager = new Manager(2); // allow two concurrent connections // set defaults for all channels that are being instantiated using HttpChannel::create() HttpChannel::prototype()->setConnectionTimeout(200); HttpChannel::prototype()->setTimeout(5000); HttpChannel::prototype()->setFollowRedirects(true); HttpChannel::prototype()->setCookieJarFile('cookies.txt'); foreach ($urls as $url) { $chan = HttpChannel::create($url); $chan->setOnReadyCallback(function(Channel $channel, array $info, $content) { echo "[X] Successfully loaded '" . $channel->getURL() . "' (" . strlen($content) . " bytes, status code " . $info['http_code'] . ")\n"; }); $chan->setOnTimeoutCallback(function(Channel $channel, int $timeoutType, int $elapsedMS, Manager $manager) { echo "[T] " . ($timeoutType == Channel::TIMEOUT_CONNECTION ? "Connection" : "Global") . " timeout after ${elapsedMS} ms for '" . $channel->getURL() . "'\n"; }); $chan->setOnErrorCallback(function(Channel $channel, string $message, $errno, $info) { echo "[E] cURL error #${errno}: '${message}' for '" . $channel->getURL() . "'\n"; }); $manager->addChannel($chan); } $manager->run();
输出类似以下内容
[X] Successfully loaded 'https://www.google.com/' (47769 bytes, status code 200)
[X] Successfully loaded 'https://#/' (136682 bytes, status code 200)
[X] Successfully loaded 'https://www.ebay.com/' (287403 bytes, status code 200)
[X] Successfully loaded 'https://www.amazon.com/' (102336 bytes, status code 200)
[E] cURL error #6: 'Couldn't resolve host name' for 'https://non-existant.this-is-a-dns-error.org/'
[T] Connection timeout after 200 ms for 'https://www.example.org/'
[X] Successfully loaded 'https://www.microsoft.com/' (183702 bytes, status code 200)
[X] Successfully loaded 'https://www.netflix.com/' (428858 bytes, status code 200)
动态添加新通道
在这个例子中,我们实现了一个超级简单的网络爬虫,它从维基百科的“网络爬虫”页面开始,每爬取一页最多提取五个新页面,直到将20页放入管理器(成功爬取或失败)。
更干净的方法是创建一个HttpCrawlChannel
(扩展HttpChannel
),它直接实现并覆盖了HttpChannel的onReady
方法(以及onTimeout
和onError
)。
use Maurice\Multicurl\{Manager, Channel, HttpChannel}; $counter = new \stdClass(); $counter->links = 20; $manager = new Manager(2); // allow two concurrent connections $manager->setContext($counter); $chan = new HttpChannel('https://en.wikipedia.org/wiki/Web_crawler'); $chan->setConnectionTimeout(500); $chan->setTimeout(5000); $chan->setFollowRedirects(true); $chan->setCookieJarFile('cookies.txt'); $chan->setOnReadyCallback(function(Channel $channel, array $info, $content, Manager $manager) { echo "[X] Successfully loaded '" . $channel->getURL() . "' (" . strlen($content) . " bytes)\n"; if ($manager->getContext()->links > 0) { if (!preg_match_all('#<a[^>]+?href="(/wiki/[^:]+?)"[^>]*?>#', $content, $matches)) { return; } $relativeLinks = array_unique($matches[1]); shuffle($relativeLinks); $relativeLinks = array_slice($relativeLinks, 0, min($manager->getContext()->links, 5)); foreach ($relativeLinks as $relativeLink) { $urlinfo = parse_url($info['url']); $newUrl = $urlinfo['scheme'] . '://' . $urlinfo['host'] . $relativeLink; $newChan = clone $channel; $newChan->setUrl($newUrl); $manager->addChannel($newChan); $manager->getContext()->links--; } } }); $chan->setOnTimeoutCallback(function(Channel $channel, int $timeoutType, int $elapsedMS, Manager $manager) { echo "[T] " . ($timeoutType == Channel::TIMEOUT_CONNECTION ? "Connection" : "Global") . " timeout after ${elapsedMS} ms for '" . $channel->getURL() . "'\n"; }); $chan->setOnErrorCallback(function(Channel $channel, string $message, $errno, $info, Manager $manager) { echo "[E] cURL error #${errno}: '${message}' for '" . $channel->getURL() . "'\n"; }); $manager->addChannel($chan); $manager->run();
输出类似以下内容
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Web_crawler' (175116 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Video_search_engine' (57989 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Selection-based_search' (43810 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Unix' (221806 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Online_search' (37052 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Bing_(search_engine)' (274871 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/FAST_Crawler' (175471 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Google_Videos' (34469 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Speech_recognition' (283155 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Phonemes' (163499 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/CastTV' (32544 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Multisearch' (34057 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Federated_search' (58325 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Natural_language_search_engine' (77872 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Search_engine_optimization' (178195 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/User_space' (65509 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Cloud_services' (311577 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Pax_(Unix)' (59822 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Paging' (142101 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/Nice_(Unix)' (57852 bytes)
[X] Successfully loaded 'https://en.wikipedia.org/wiki/AIX_operating_system' (187416 bytes)