restyler / scrapeninja-api-php-client
具有代理轮换、重试和Chrome TLS指纹仿真的Web抓取API
Requires
- guzzlehttp/guzzle: ^7.5
README
此库是ScrapeNinja Web抓取API的一个基于Guzzle的轻量级包装。
什么是ScrapeNinja?
简单且高性能的Web抓取API,具有以下特点:
- 2种网站渲染模式
scrape()
:快速,模拟Chrome TLS指纹,无需Puppeteer/Playwright开销scrapeJs()
:完整的真实Chrome,带有JavaScript渲染和基本交互(点击、填写表单)。
- 由轮换代理支持(地理区域:美国、欧盟、巴西、法国、德国,提供4g住宅代理,也可根据需求指定自己的代理)。
- 具有智能重试和默认设置的超时
- 允许从原始HTML中提取任意数据,无需处理PHP HTML解析库:只需传递用JavaScript编写的
extractor
函数,它将在ScrapeNinja服务器上执行。ScrapeNinja使用Cheerio,这是一个类似jQuery的库,用于从HTML中提取数据,您可以在Live Cheerio沙盒中快速构建和测试您的extractor函数,有关从HackerNews HTML源获取纯数据的extractor示例,请参阅/examples/extractor.php
。
ScrapeNinja完整API文档
https://rapidapi.com/restyler/api/scrapeninja
ScrapeNinja实时沙盒
ScrapeNinja允许您在浏览器中快速创建和测试您的Web抓取器: https://scrapeninja.net/scraper-sandbox
用例
ScrapeNinja的常见用例是当常规Guzzle/cURL无法可靠地获取抓取网站响应时,即使头部与真实浏览器完全相同,也会收到403或5xx错误。
另一个主要用例是当您想避免Puppeteer设置和维护,但仍需要真实的JavaScript渲染而不是发送原始网络请求时。
ScrapeNinja有助于减少获取HTTP响应和处理重试、代理处理和超时的代码量。
了解更多关于ScrapeNinja的信息
https://pixeljets.com/blog/bypass-cloudflare/ https://scrapeninja.net
在此处获取免费访问密钥
https://rapidapi.com/restyler/api/scrapeninja
查看/example文件夹中的示例
安装
composer require restyler/scrapeninja-api-php-client
示例
/examples
文件夹包含ScrapeNinja如何使用的快速启动示例。要在终端中执行这些示例,请检索API密钥并将其设置为环境变量
export SCRAPENINJA_RAPIDAPI_KEY=YOUR-KEY
php ./examples/extractor.php
基本抓取请求
use ScrapeNinja\Client; $scraper = new Client([ "rapidapi_key" => getenv('SCRAPENINJA_RAPIDAPI_KEY') ] ); $response = $client->scrape([ // target website URL "url" => "https://news.ycombinator.com/", // Proxy geo. eu, br, de, fr, 4g-eu, us proxy locations are available. Default: "us" "geo" => "us", // Custom headers to pass to target website. Space after ':' is mandatory according to HTTP spec. // User-agent header is not required, it is attached automatically. "headers" => ["Some-custom-header: header1-val", "Other-header: header2-val"], "method" => "GET" // HTTP method to use. Default: "GET". Allowed: "GET", "POST", "PUT". ]); echo '<h2>Basic scrape response:</h2><pre>'; // response contains associative array with response, with // 'body' containing target website response (as a string) and // 'info' property containing all the metadata. echo 'HTTP Response status: ' . $response['info']['statusCode'] . "\n"; echo 'HTTP Response status: ' . print_r($response['info']['headers'], 1) . "\n"; echo 'HTTP Response body (truncated): ' . mb_substr($response['body'], 0, 300) . '...' . "\n"; /* Array ( [info] => Array ( [version] => 1.1 (string) [statusCode] => 200 (integer) [statusMessage] => OK (string) [headers] => Array ( [server] => nginx [date] => Mon, 02 May 2022 04:38:12 GMT [content-type] => text/html; charset=utf-8 [content-encoding] => gzip ) ) [body] => <html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?5eYyZbFhPFukXyt5EaSy">... ) */
获取由真实浏览器(Puppeteer)渲染的完整HTML(PHP)
$response = $client->scrapeJs([ "url" => "https://news.ycombinator.com/" ]);
从原始HTML中提取数据
// javascript extractor function, executed on ScrapeNinja servers $extractor = "// define function which accepts body and cheerio as args function extract(input, cheerio) { // return object with extracted values let $ = cheerio.load(input); let items = []; $('.titleline').map(function() { let infoTr = $(this).closest('tr').next(); let commentsLink = infoTr.find('a:contains(comments)'); items.push([ $(this).text(), $('a', this).attr('href'), infoTr.find('.hnuser').text(), parseInt(infoTr.find('.score').text()), infoTr.find('.age').attr('title'), parseInt(commentsLink.text()), 'https://news.ycombinator.com/' + commentsLink.attr('href'), new Date() ]); }); return { items }; }"; // the extractor function works identically with both scrape() and scrapeJs() ScrapeNinja rendering modes $response = $client->scrapeJs([ 'url' => 'https://scrapeninja.net/samples/hackernews.html', 'extractor' => $extractor ]); echo '<h2>Extractor function test:</h2><pre>'; print_r($response['extractor']);
响应将包含包含纯数据的PHP数组
(
[result] => Array
(
[items] => Array
(
[0] => Array
(
[0] => A bug fix in the 8086 microprocessor, revealed in the die's silicon (righto.com)
[1] => https://www.righto.com/2022/11/a-bug-fix-in-8086-microprocessor.html
[2] => _Microft
[3] => 216
[4] => 2022-11-26T22:28:40
[5] => 66
[6] => https://news.ycombinator.com/item?id=33757484
[7] => 2022-12-19T09:20:53.875Z
)
[1] => Array
(
[0] => Cache invalidation is one of the hardest problems in computer science (surfingcomplexity.blog)
[1] => https://surfingcomplexity.blog/2022/11/25/cache-invalidation-really-is-one-of-the-hardest-things-in-computer-science/
[2] => azhenley
[3] => 126
[4] => 2022-11-26T03:43:06
[5] => 66
[6] => https://news.ycombinator.com/item?id=33749677
[7] => 2022-12-19T09:20:53.878Z
)
[2] => Array
(
[0] => FCC Bans Authorizations for Devices That Pose National Security Threat (fcc.gov)
[1] => https://www.fcc.gov/document/fcc-bans-authorizations-devices-pose-national-security-threat
[2] => terramex
[3] => 236
[4] => 2022-11-26T20:01:49
[5] => 196
[6] => https://news.ycombinator.com/item?id=33756089
[7] => 2022-12-19T09:20:53.881Z
)
....
发送POST请求
ScrapeNinja可以执行POST请求。
发送JSON POST
$response = $client->scrape([ "url" => "https://news.ycombinator.com/", "headers" => ["Content-Type: application/json"], "method" => "POST" "data" => "{\"fefe\":\"few\"}" ]);
发送www编码的POST
$response = $client->scrape([ "url" => "https://news.ycombinator.com/", "headers" => ["Content-Type: application/x-www-form-urlencoded"], "method" => "POST" "data" => "key1=val1&key2=val2" ]);
重试逻辑
ScrapeNinja默认情况下将请求重试2次(因此总共3次请求),在失败的情况下(目标网站超时、代理超时、某些提供者的验证码请求)。此行为可以修改并禁用。
ScrapeNinja 还可以设置在遇到特定的 http 响应状态码和响应体中的文本时重试(这对于自定义验证码非常有用)。
$response = $client->scrape([ "url" => "https://news.ycombinator.com/", "retryNum": 1, // 0 to disable retries "textNotExpected": [ "random-captcha-text-which-might-appear" ], "statusNotExpected": [ 403, 502 ] ]);
错误处理
您绝对应该在 scrape() 调用中包含 try-catch 处理程序并记录您的错误。RapidAPI 可能会宕机,ScrapeNinja 服务器可能会宕机,目标网站可能会宕机。
- 如果 RapidAPI 或 ScrapeNinja 崩溃,您将得到 Guzzle 异常,它将 ScrapeNinja 服务器返回的非 200 响应视为异常情况(这是好事)。如果您超过计划限制,可能会得到 429 错误。
- 如果 ScrapeNinja 尝试了 3 次仍未能获取“良好”的响应,它可能会抛出 503 错误。
在这些情况下,获取失败的 HTTP 响应很有用。
try { $response = $ninja->scrape($requestOpts); // you might want to add your custom errors here if ($response['info']['statusCode'] != 200) { throw new \Exception('your custom exception because this you didn\'t expect this from target website'); } } catch (GuzzleHttp\Exception\ClientException $e) { $response = $e->getResponse(); echo 'Status code: ' . $response->getStatusCode() . "\n"; echo 'Err message: ' . $e->getMessage() . "\n"; } catch (\Exception $e) { // your custom error handling logic, this is a non-Guzzle error }
(请参阅 examples/ 文件夹以获取完整的错误处理示例)