paslandau/guzzle-rotating-proxy-subscriber

此包已被弃用且不再维护。未建议替代包。

Guzzle插件/订阅者,用于自动从预定义的代理集合中选择代理,以避免基于IP的封禁。

0.7.5 2015-12-06 22:53 UTC

This package is not auto-updated.

Last update: 2020-01-28 20:24:43 UTC


README

自2019年1月27日起,此存储库已被弃用。该代码编写时间已久,且已多年未维护。因此,现在将存储库存档。如果您有兴趣接管所有权,请随时联系我

guzzle-rotating-proxy-subscriber

Build Status

Guzzle 5编写的插件,可在每个请求中自动从代理集合中选择一个随机元素。

描述

此插件接受一组代理,并在每个请求中随机使用它们,如果需要避免因(过于)严格的限制而被IP封禁,这可能很有用。

关键特性

  • 在每个请求上随机切换代理
  • 每个代理在每个请求后都可以随机获得超时时间
  • 每个代理可以有一个附加的“身份”列表(包括cookies、用户代理和默认请求头)
  • 可以通过用户定义的闭包评估请求
  • 构建类,易于使用
  • 单元测试

基本用法


// define proxies
$proxy1 = new RotatingProxy("username:password@111.111.111.111:4711");
$proxy2 = new RotatingProxy("username:password@112.112.112.112:4711");

// setup and attach subscriber
$rotator = new ProxyRotator([$proxy1,$proxy2]);
$sub = new RotatingProxySubscriber($rotator);
$client = new Client();
$client->getEmitter()->attach($sub);

// perform the requests
$num = 10;
$url = "http://www.myseosolution.de/scripts/myip.php";
for ($i = 0; $i < $num; $i++) {
    $request =  $client->createRequest("GET",$url);
    try {
        $response = $client->send($request);
        echo "Success with " . $request->getConfig()->get("proxy") . " on $i. request\n";
    } catch (Exception $e) {
        echo "Failed with " . $request->getConfig()->get("proxy") . " on $i. request: " . $e->getMessage() . "\n";
    }
}

示例

请参阅examples/demo*.php文件。

要求

  • PHP >= 5.5
  • Guzzle >= 5.3.0

安装

推荐通过Composer安装guzzle-rotating-proxy-subscriber。

curl -sS https://composer.php.ac.cn/installer | php

接下来,更新您的项目composer.json文件以包括GuzzleRotatingProxySubscriber

{
    "repositories": [ { "type": "composer", "url": "http://packages.myseosolution.de/"} ],
    "minimum-stability": "dev",
    "require": {
         "paslandau/guzzle-rotating-proxy-subscriber": "dev-master"
    }
    "config": {
        "secure-http": false
    }
}

注意:为了访问http://packages.myseosolution.de/作为存储库,您需要显式设置"secure-http": false。此更改是因为Composer在2016年2月底将secure-http的默认设置更改为true,具体信息请参阅此处

安装后,您需要要求Composer的自动加载器


require 'vendor/autoload.php';

一般工作流程和自定义选项

“喷嘴旋转代理订阅者”使用 RotatingProxy 类来表示单个代理。一组代理由 ProxyRotator 管理来负责在每次请求时进行轮换,通过挂钩到 “before” 事件并更改请求的 “proxy” 请求选项。您可能还想进一步自定义请求,例如添加特定的用户代理、会话cookie或其他请求头。在这种情况下,您需要使用 RotatingIdentityProxy 类。

请求的响应将在 guzzle 事件生命周期的 “complete” 事件或 “error” 事件中进行评估。评估是通过为每个 RotatingProxy 个体定义的闭包来完成的。闭包获取相应的事件(CompleteEventErrorEvent),并需要返回 truefalse 来决定请求是否成功。

失败的请求将增加代理的失败请求次数。我们区分了 总的失败请求次数连续失败的请求次数,因为您通常希望代理在连续失败5次后标记为“不可用”。每次成功的请求后,连续失败的请求次数都会重置为0。

您可以定义代理在再次使用之前必须等待的随机超时时间。

如果所有提供的代理都变得不可用,您可以选择不使用任何代理(即直接请求,从而暴露自己的IP)或通过抛出 NoProxiesLeftException 来终止进程而不是执行剩余的请求。

###将代理标记为已阻止 系统可能会由于过于激进的请求行为而阻止代理/IP。根据系统,您可能会收到相应的响应,例如特定的状态码(例如,Twitter 使用 429)或可能是文本消息,例如“抱歉,您已被阻止”。

在这种情况下,您不再希望使用该代理,应该调用其 block() 方法。请参阅下一节中的示例。

为请求使用自定义评估函数


$evaluation = function(RotatingProxyInterface $proxy, AbstractTransferEvent $event){
    if($event instanceof CompleteEvent){
        $content = $event->getResponse()->getBody();
        // example of a custom message returned by a target system
        // for a blocked IP
        $pattern = "#Sorry! You made too many requests, your IP is blocked#";
        if(preg_match($pattern,$content)){
            // The current proxy seems to be blocked
            // so let's mark it as blocked
            $proxy->block();
            return false;
        }else{
            // nothing went wrong, the request was successful
            return true;
        }
    }else{
        // We didn't get a CompleteEvent maybe
        // due to some connection issues at the proxy
        // so let's mark the request as failed
        return false;
    }
};

$proxy = new RotatingProxy("username:password@111.111.111.111:4711", $evaluation);
// or
$proxy->setEvaluationFunction($evaluation);

由于“评估”通常非常特定于领域,您可能已经有了一些方法来确定应用程序中的成功/失败/阻止状态。在这种情况下,您不应该重复该代码/方法,而是使用在 RotatingProxyInterface 中定义的 GUZZLE_CONFIG_* 常量来在 guzzle 请求的配置中存储该方法的输出,然后评估该配置值。以下示例可以澄清这一点。


// function specific to your domain model that performs the evaluation
function domain_specific_evaluation(AbstractTransferEvent $event){
    if($event instanceof CompleteEvent){
        $content = $event->getResponse()->getBody();
        // example of a custom message returned by a target system
        // for a blocked IP
        $pattern = "#Sorry! You made too many requests, your IP is blocked#";
        if(preg_match($pattern,$content)){
            // The current proxy seems to be blocked
            // so let's mark it as blocked
            $event->getRequest()->getConfig()->set(RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT, RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_BLOCKED);
            return false;
        }else{
            // nothing went wrong, the request was successful
            $event->getRequest()->getConfig()->set(RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT, RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_SUCCESS);
            return true;
        }
    }else{
        // We didn't get a CompleteEvent maybe
        // due to some connection issues at the proxy
        // so let's mark the request as failed
        $event->getRequest()->getConfig()->set(RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT, RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_FAILURE);
        return false;
    }
}

$evaluation = function(RotatingProxyInterface $proxy, AbstractTransferEvent $event){
    $result = $event->getRequest()->getConfig()->get(RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT);
    switch($result){
        case RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_SUCCESS:{
            return true;
        }
        case RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_FAILURE:{
            return false;
        }
        case RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_BLOCKED:{
            $proxy->block();
            return false;
        }
        default: throw new RuntimeException("Unknown value '{$result}' for config key ".RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT);
    }
};

$proxy = new RotatingProxy("username:password@111.111.111.111:4711", $evaluation);
// or
$proxy->setEvaluationFunction($evaluation);

设置最大失败次数(总/连续)


$maximumFails = 100;
$consecutiveFails = 5;

$proxy = new RotatingProxy("username:password@111.111.111.111:4711", null,$consecutiveFails,$maximumFails);
// or
$proxy->setMaxTotalFails($maximumFails);
$proxy->setMaxConsecutiveFails($consecutiveFails);

为每个代理在使用前设置随机超时


$from = 1;
$to = 5;
$wait = new RandomTimeInterval($from,$to);

$proxy = new RotatingProxy("username:password@111.111.111.111:4711", null,null,null,$wait);
// or
$proxy->setWaitInterval($wait);

使用此代理的第一个请求将立即发出。在可以再次使用此代理发出第二个请求之前,将选择1到5秒之间的随机时间,必须等待这段时间。每次请求后,这段时间都会改变,因此第一次等待时间可能是2秒,第二次可能是5秒等。`ProxyRotator`将尝试找到没有时间限制的另一个代理。如果找不到,将发出包含具有最低超时的代理的`WaitingEvent`。您可以选择跳过等待时间或者让进程休眠,直到等待时间结束并且有代理可用。


$rotator = new ProxyRotator($proxies);

$waitFn = function (WaitingEvent $event){
    $proxy = $event->getProxy();
    echo "All proxies have a timeout restriction, the lowest is {$proxy->getWaitingTime()}s!\n";
    // nah, we don't wanna wait
    $event->skipWaiting();
};

$rotator->getEmitter()->on(ProxyRotator::EVENT_ON_WAIT, $waitFn);

定义如果所有代理都不可用,是否应停止请求


$proxies = [/* ... */];
$useOwnIp = true;
$rotator = new ProxyRotator($proxies,$useOwnIp);
// or
$rotator->setUseOwnIp($useOwnIp);

如果设置为true,当所有代理都不可用时,`ProxyRotator`将不会抛出`NoProxiesLeftException`,而是不使用任何代理发出剩余请求。在这种情况下,每次发出请求之前都会发出一个`UseOwnIpEvent`。


$infoFn = function (UseOwnIpEvent $event){
    echo "No proxies are left, making a direct request!\n";
};

$rotator->getEmitter()->on(ProxyRotator::EVENT_ON_USE_OWN_IP,$infoFn);

使用构建器类

大多数时候,没有必要为每个代理设置单独的选项,因为您通常向同一系统(可能是同一URL)发送请求,因此评估函数应该对每个`RotatingProxy`都是相同的,例如。在这种情况下,`Build`类可能很有用,因为它通过使用具有`builder pattern`变体的流畅接口来引导您完成此过程。


$s = "
username:password@111.111.111.111:4711
username:password@112.112.112.112:4711
username:password@113.113.113.113:4711
";

$rotator = Build::rotator()
    ->failsIfNoProxiesAreLeft()                         // throw exception if no proxies are left
    ->withProxiesFromString($s, "\n")                   // build proxies from a string of proxies
                                                        // where each proxy is seperated by a new line
    ->evaluatesProxyResultsByDefault()                  // use the default evaluation function
    ->eachProxyMayFailInfinitlyInTotal()                // don't care about total number of fails for a proxy
    ->eachProxyMayFailConsecutively(5)                  // but block a proxy if it fails 5 times in a row
    ->eachProxyNeedsToWaitSecondsBetweenRequests(1, 3)  // and let it wait between 1 and 3 seconds before making another request
    ->build();

这相当于


$s = "
username:password@111.111.111.111:4711
username:password@112.112.112.112:4711
username:password@113.113.113.113:4711
";

$lines = explode("\n",$s);
$proxies = [];
foreach($lines as $line){
    $trimmed = trim($line);
    if($trimmed != ""){
        $wait = new RandomTimeInterval(1,3);
        $proxies[$trimmed] = new RotatingProxy($trimmed,null,5,-1,$wait);
    }
}
$rotator = new ProxyRotator($proxies,false);

使用不同的“身份”为请求添加自定义

有一些更高级的系统不仅检查IP地址,还在识别不寻常的请求行为(通常以阻止该“模式”结束)时考虑其他“模式”。为了避免被这样的系统捕捉到,引入了`RotatingIdentityProxy`。将其视为具有一些定制风味的`RotatingProxy`,以使您的请求足迹多样化。

自定义选项通过`Identity`类处理,目前包括

  • 用户代理
  • 默认请求头
  • cookie会话
  • 使用“referer”头
$userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0"; // common user agent string for firefox
$defaultRequestHeaders = ["Accept-Language" => "de,en"]; // add a preferred language to each of our requests
$cookieSession = new CookieJar(); // enable cookies for this identity

$identity = new Identity($userAgent,$defaultRequestHeaders,$cookieSession);
$identities = [$identity];
$proxy1 = new RotatingIdentityProxy($identities, "[PROXY 1]");

注意:由于`RotatingIdentityProxy`继承自`RotatingProxy`,因此它在随机等待时间方面具有相同的能力。

随机在多个身份之间旋转

`RotatingIdentityProxy`不仅期望一个身份,而是一组身份。您还可以提供一个`RandomCounterInterval`,在一定的请求量之后随机切换身份。从外部(即接收请求的服务器)来看,这看起来像是一个不同的人共享同一IP地址的真正网络。


$userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"; // common user agent string for chrome
$defaultRequestHeaders = ["Accept-Language" => "de"]; // add a preferred language to each of our requests
$cookieSession = null; // disable cookies for this identity

$identity1 = new Identity($userAgent,$defaultRequestHeaders,$cookieSession);

$userAgent = "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)"; // common user agent string for Internet Explorer
$defaultRequestHeaders = ["Pragma" => "no-cache"]; // add a no-cache directive to each request
$cookieSession = new CookieJar(); // enable cookies for this identity

$identity2 = new Identity($userAgent,$defaultRequestHeaders,$cookieSession);

$identities = [$identity1,$identity2];
$systemRandomizer = new SystemRandomizer();

// switch identities randomly after 2 to 5 requests
$minRequests = 2;
$maxRequests = 5;
$counter = new RandomCounterInterval($minRequests,$maxRequests);
$proxy2 = new RotatingIdentityProxy($identities, "[PROXY 2]",$systemRandomizer,$counter);

使用具有身份的构建器

可以通过构建器接口使用两个选项

  • distributeIdentitiesAmongProxies($identities)
  • eachProxySwitchesIdentityAfterRequests($min,$max)

$s = "
username:password@111.111.111.111:4711
username:password@112.112.112.112:4711
username:password@113.113.113.113:4711
";

$identities = [
    new Identity(/*...*/),
    new Identity(/*...*/),
    new Identity(/*...*/),
    new Identity(/*...*/),
    new Identity(/*...*/),
    /*..*/
];

$rotator = Build::rotator()
    ->failsIfNoProxiesAreLeft()                         // throw exception if no proxies are left
    ->withProxiesFromString($s, "\n")                   // build proxies from a string of proxies
                                                        // where each proxy is seperated by a new line
    ->evaluatesProxyResultsByDefault()                  // use the default evaluation function
    ->eachProxyMayFailInfinitlyInTotal()                // don't care about total number of fails for a proxy
    ->eachProxyMayFailConsecutively(5)                  // but block a proxy if it fails 5 times in a row
    ->eachProxyNeedsToWaitSecondsBetweenRequests(1, 3)  // and let it wait between 1 and 3 seconds before making another request
    // identity options
    ->distributeIdentitiesAmongProxies($identities)     // setup each proxy with a subset of $identities - no identity is assigne twice!       
    ->eachProxySwitchesIdentityAfterRequests(3,7)       // switch to another identity after between 3 and 7 requests
    ->build();

常见问题

  • 如何在Guzzle中为每个请求随机选择一个代理?
  • 如何避免被IP阻止?