restyler/scrapeninja-api-php-client

具有代理轮换、重试和Chrome TLS指纹仿真的Web抓取API

v1.0.6 2022-12-19 09:50 UTC

This package is auto-updated.

Last update: 2024-09-19 13:29:20 UTC


README

此库是ScrapeNinja Web抓取API的一个基于Guzzle的轻量级包装。

什么是ScrapeNinja?

简单且高性能的Web抓取API,具有以下特点:

  • 2种网站渲染模式
    • scrape():快速,模拟Chrome TLS指纹,无需Puppeteer/Playwright开销
    • scrapeJs():完整的真实Chrome,带有JavaScript渲染和基本交互(点击、填写表单)。
  • 由轮换代理支持(地理区域:美国、欧盟、巴西、法国、德国,提供4g住宅代理,也可根据需求指定自己的代理)。
  • 具有智能重试和默认设置的超时
  • 允许从原始HTML中提取任意数据,无需处理PHP HTML解析库:只需传递用JavaScript编写的extractor函数,它将在ScrapeNinja服务器上执行。ScrapeNinja使用Cheerio,这是一个类似jQuery的库,用于从HTML中提取数据,您可以在Live Cheerio沙盒中快速构建和测试您的extractor函数,有关从HackerNews HTML源获取纯数据的extractor示例,请参阅/examples/extractor.php

ScrapeNinja完整API文档

https://rapidapi.com/restyler/api/scrapeninja

ScrapeNinja实时沙盒

ScrapeNinja允许您在浏览器中快速创建和测试您的Web抓取器: https://scrapeninja.net/scraper-sandbox

用例

ScrapeNinja的常见用例是当常规Guzzle/cURL无法可靠地获取抓取网站响应时,即使头部与真实浏览器完全相同,也会收到403或5xx错误。

另一个主要用例是当您想避免Puppeteer设置和维护,但仍需要真实的JavaScript渲染而不是发送原始网络请求时。

ScrapeNinja有助于减少获取HTTP响应和处理重试、代理处理和超时的代码量。

了解更多关于ScrapeNinja的信息

https://pixeljets.com/blog/bypass-cloudflare/ https://scrapeninja.net

在此处获取免费访问密钥

https://rapidapi.com/restyler/api/scrapeninja

查看/example文件夹中的示例

安装

composer require restyler/scrapeninja-api-php-client

示例

/examples文件夹包含ScrapeNinja如何使用的快速启动示例。要在终端中执行这些示例,请检索API密钥并将其设置为环境变量

export SCRAPENINJA_RAPIDAPI_KEY=YOUR-KEY
php ./examples/extractor.php 

基本抓取请求

use ScrapeNinja\Client;

$scraper = new Client([
        "rapidapi_key" => getenv('SCRAPENINJA_RAPIDAPI_KEY')
    ]
);

$response = $client->scrape([
  // target website URL
  "url" => "https://news.ycombinator.com/", 
  
  // Proxy geo. eu, br, de, fr, 4g-eu, us proxy locations are available. Default: "us"
  "geo" => "us", 
  
  // Custom headers to pass to target website. Space after ':' is mandatory according to HTTP spec. 
  // User-agent header is not required, it is attached automatically.
  "headers" => ["Some-custom-header: header1-val", "Other-header: header2-val"], 
  
  "method" => "GET" // HTTP method to use. Default: "GET". Allowed: "GET", "POST", "PUT". 
]);

echo '<h2>Basic scrape response:</h2><pre>';

// response contains associative array with response, with 
// 'body'  containing target website response (as a string) and 
// 'info' property containing all the metadata.
echo 'HTTP Response status: ' . $response['info']['statusCode'] . "\n";
echo 'HTTP Response status: ' . print_r($response['info']['headers'], 1) . "\n";
echo 'HTTP Response body (truncated): ' . mb_substr($response['body'], 0, 300) . '...' . "\n";


/*
    Array
(
    [info] => Array
        (
            [version] => 1.1 (string)
            [statusCode] => 200 (integer)
            [statusMessage] => OK (string)
            [headers] => Array
                (
                    [server] => nginx
                    [date] => Mon, 02 May 2022 04:38:12 GMT
                    [content-type] => text/html; charset=utf-8
                    [content-encoding] => gzip
                )

        )

    [body] => <html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?5eYyZbFhPFukXyt5EaSy">...
)
    */

获取由真实浏览器(Puppeteer)渲染的完整HTML(PHP)

$response = $client->scrapeJs([
    "url" => "https://news.ycombinator.com/"
]);

从原始HTML中提取数据

// javascript extractor function, executed on ScrapeNinja servers 
$extractor = "// define function which accepts body and cheerio as args
    function extract(input, cheerio) {
        // return object with extracted values              
        let $ = cheerio.load(input);
      
        let items = [];
        $('.titleline').map(function() {
                  let infoTr = $(this).closest('tr').next();
                  let commentsLink = infoTr.find('a:contains(comments)');
                items.push([
                    $(this).text(),
                      $('a', this).attr('href'),
                      infoTr.find('.hnuser').text(),
                      parseInt(infoTr.find('.score').text()),
                      infoTr.find('.age').attr('title'),
                      parseInt(commentsLink.text()),
                      'https://news.ycombinator.com/' + commentsLink.attr('href'),
                      new Date()
                ]);
            });
      
      return { items };
    }";

// the extractor function works identically with both scrape() and scrapeJs() ScrapeNinja rendering modes
$response = $client->scrapeJs([
    'url' => 'https://scrapeninja.net/samples/hackernews.html',
    'extractor' => $extractor
]);


echo '<h2>Extractor function test:</h2><pre>';
print_r($response['extractor']);

响应将包含包含纯数据的PHP数组

(
    [result] => Array
        (
            [items] => Array
                (
                    [0] => Array
                        (
                            [0] => A bug fix in the 8086 microprocessor, revealed in the die's silicon (righto.com)
                            [1] => https://www.righto.com/2022/11/a-bug-fix-in-8086-microprocessor.html
                            [2] => _Microft
                            [3] => 216
                            [4] => 2022-11-26T22:28:40
                            [5] => 66
                            [6] => https://news.ycombinator.com/item?id=33757484
                            [7] => 2022-12-19T09:20:53.875Z
                        )

                    [1] => Array
                        (
                            [0] => Cache invalidation is one of the hardest problems in computer science (surfingcomplexity.blog)
                            [1] => https://surfingcomplexity.blog/2022/11/25/cache-invalidation-really-is-one-of-the-hardest-things-in-computer-science/
                            [2] => azhenley
                            [3] => 126
                            [4] => 2022-11-26T03:43:06
                            [5] => 66
                            [6] => https://news.ycombinator.com/item?id=33749677
                            [7] => 2022-12-19T09:20:53.878Z
                        )

                    [2] => Array
                        (
                            [0] => FCC Bans Authorizations for Devices That Pose National Security Threat (fcc.gov)
                            [1] => https://www.fcc.gov/document/fcc-bans-authorizations-devices-pose-national-security-threat
                            [2] => terramex
                            [3] => 236
                            [4] => 2022-11-26T20:01:49
                            [5] => 196
                            [6] => https://news.ycombinator.com/item?id=33756089
                            [7] => 2022-12-19T09:20:53.881Z
                        )
    ....

发送POST请求

ScrapeNinja可以执行POST请求。

发送JSON POST

$response = $client->scrape([
    "url" => "https://news.ycombinator.com/", 
    "headers" => ["Content-Type: application/json"], 
    "method" => "POST" 
    "data" => "{\"fefe\":\"few\"}"
]);

发送www编码的POST

$response = $client->scrape([
    "url" => "https://news.ycombinator.com/", 
    "headers" => ["Content-Type: application/x-www-form-urlencoded"], 
    "method" => "POST" 
    "data" => "key1=val1&key2=val2"
]);

重试逻辑

ScrapeNinja默认情况下将请求重试2次(因此总共3次请求),在失败的情况下(目标网站超时、代理超时、某些提供者的验证码请求)。此行为可以修改并禁用。

ScrapeNinja 还可以设置在遇到特定的 http 响应状态码和响应体中的文本时重试(这对于自定义验证码非常有用)。

$response = $client->scrape([
    "url" => "https://news.ycombinator.com/",
    "retryNum": 1, // 0 to disable retries
    "textNotExpected": [
        "random-captcha-text-which-might-appear"
    ],
    "statusNotExpected": [
        403,
        502
    ]
]);

错误处理

您绝对应该在 scrape() 调用中包含 try-catch 处理程序并记录您的错误。RapidAPI 可能会宕机,ScrapeNinja 服务器可能会宕机,目标网站可能会宕机。

  • 如果 RapidAPI 或 ScrapeNinja 崩溃,您将得到 Guzzle 异常,它将 ScrapeNinja 服务器返回的非 200 响应视为异常情况(这是好事)。如果您超过计划限制,可能会得到 429 错误。
  • 如果 ScrapeNinja 尝试了 3 次仍未能获取“良好”的响应,它可能会抛出 503 错误。

在这些情况下,获取失败的 HTTP 响应很有用。

try {
   $response = $ninja->scrape($requestOpts);
   
   // you might want to add your custom errors here
   if ($response['info']['statusCode'] != 200) {
     throw new \Exception('your custom exception because this you didn\'t expect this from target website');
   }
} catch (GuzzleHttp\Exception\ClientException $e) {
    $response = $e->getResponse();
    
    echo 'Status code: ' . $response->getStatusCode() . "\n";
    echo 'Err message: ' . $e->getMessage() . "\n";
    

} catch (\Exception $e) {
   // your custom error handling logic, this is a non-Guzzle error
}

(请参阅 examples/ 文件夹以获取完整的错误处理示例)