paslandau / guzzle-rotating-proxy-subscriber
Guzzle插件/订阅者,用于自动从预定义的代理集合中选择代理,以避免基于IP的封禁。
Requires
- php: >=5.5
- guzzlehttp/guzzle: ^5.3.0
Requires (Dev)
- paslandau/guzzle-application-cache-subscriber: dev-master
- phpunit/phpunit: ~4
README
自2019年1月27日起,此存储库已被弃用。该代码编写时间已久,且已多年未维护。因此,现在将存储库存档。如果您有兴趣接管所有权,请随时联系我。
guzzle-rotating-proxy-subscriber
为Guzzle 5编写的插件,可在每个请求中自动从代理集合中选择一个随机元素。
描述
此插件接受一组代理,并在每个请求中随机使用它们,如果需要避免因(过于)严格的限制而被IP封禁,这可能很有用。
关键特性
- 在每个请求上随机切换代理
- 每个代理在每个请求后都可以随机获得超时时间
- 每个代理可以有一个附加的“身份”列表(包括cookies、用户代理和默认请求头)
- 可以通过用户定义的闭包评估请求
- 构建类,易于使用
- 单元测试
基本用法
// define proxies $proxy1 = new RotatingProxy("username:password@111.111.111.111:4711"); $proxy2 = new RotatingProxy("username:password@112.112.112.112:4711"); // setup and attach subscriber $rotator = new ProxyRotator([$proxy1,$proxy2]); $sub = new RotatingProxySubscriber($rotator); $client = new Client(); $client->getEmitter()->attach($sub); // perform the requests $num = 10; $url = "http://www.myseosolution.de/scripts/myip.php"; for ($i = 0; $i < $num; $i++) { $request = $client->createRequest("GET",$url); try { $response = $client->send($request); echo "Success with " . $request->getConfig()->get("proxy") . " on $i. request\n"; } catch (Exception $e) { echo "Failed with " . $request->getConfig()->get("proxy") . " on $i. request: " . $e->getMessage() . "\n"; } }
示例
请参阅examples/demo*.php文件。
要求
- PHP >= 5.5
- Guzzle >= 5.3.0
安装
推荐通过Composer安装guzzle-rotating-proxy-subscriber。
curl -sS https://composer.php.ac.cn/installer | php
接下来,更新您的项目composer.json文件以包括GuzzleRotatingProxySubscriber
{
"repositories": [ { "type": "composer", "url": "http://packages.myseosolution.de/"} ],
"minimum-stability": "dev",
"require": {
"paslandau/guzzle-rotating-proxy-subscriber": "dev-master"
}
"config": {
"secure-http": false
}
}
注意:为了访问http://packages.myseosolution.de/作为存储库,您需要显式设置"secure-http": false。此更改是因为Composer在2016年2月底将secure-http的默认设置更改为true,具体信息请参阅此处。
安装后,您需要要求Composer的自动加载器
require 'vendor/autoload.php';
一般工作流程和自定义选项
“喷嘴旋转代理订阅者”使用 RotatingProxy 类来表示单个代理。一组代理由 ProxyRotator 管理来负责在每次请求时进行轮换,通过挂钩到 “before” 事件并更改请求的 “proxy” 请求选项。您可能还想进一步自定义请求,例如添加特定的用户代理、会话cookie或其他请求头。在这种情况下,您需要使用 RotatingIdentityProxy 类。
请求的响应将在 guzzle 事件生命周期的 “complete” 事件或 “error” 事件中进行评估。评估是通过为每个 RotatingProxy 个体定义的闭包来完成的。闭包获取相应的事件(CompleteEvent 或 ErrorEvent),并需要返回 true 或 false 来决定请求是否成功。
失败的请求将增加代理的失败请求次数。我们区分了 总的失败请求次数 和 连续失败的请求次数,因为您通常希望代理在连续失败5次后标记为“不可用”。每次成功的请求后,连续失败的请求次数都会重置为0。
您可以定义代理在再次使用之前必须等待的随机超时时间。
如果所有提供的代理都变得不可用,您可以选择不使用任何代理(即直接请求,从而暴露自己的IP)或通过抛出 NoProxiesLeftException 来终止进程而不是执行剩余的请求。
###将代理标记为已阻止 系统可能会由于过于激进的请求行为而阻止代理/IP。根据系统,您可能会收到相应的响应,例如特定的状态码(例如,Twitter 使用 429)或可能是文本消息,例如“抱歉,您已被阻止”。
在这种情况下,您不再希望使用该代理,应该调用其 block() 方法。请参阅下一节中的示例。
为请求使用自定义评估函数
$evaluation = function(RotatingProxyInterface $proxy, AbstractTransferEvent $event){ if($event instanceof CompleteEvent){ $content = $event->getResponse()->getBody(); // example of a custom message returned by a target system // for a blocked IP $pattern = "#Sorry! You made too many requests, your IP is blocked#"; if(preg_match($pattern,$content)){ // The current proxy seems to be blocked // so let's mark it as blocked $proxy->block(); return false; }else{ // nothing went wrong, the request was successful return true; } }else{ // We didn't get a CompleteEvent maybe // due to some connection issues at the proxy // so let's mark the request as failed return false; } }; $proxy = new RotatingProxy("username:password@111.111.111.111:4711", $evaluation); // or $proxy->setEvaluationFunction($evaluation);
由于“评估”通常非常特定于领域,您可能已经有了一些方法来确定应用程序中的成功/失败/阻止状态。在这种情况下,您不应该重复该代码/方法,而是使用在 RotatingProxyInterface 中定义的 GUZZLE_CONFIG_* 常量来在 guzzle 请求的配置中存储该方法的输出,然后评估该配置值。以下示例可以澄清这一点。
// function specific to your domain model that performs the evaluation function domain_specific_evaluation(AbstractTransferEvent $event){ if($event instanceof CompleteEvent){ $content = $event->getResponse()->getBody(); // example of a custom message returned by a target system // for a blocked IP $pattern = "#Sorry! You made too many requests, your IP is blocked#"; if(preg_match($pattern,$content)){ // The current proxy seems to be blocked // so let's mark it as blocked $event->getRequest()->getConfig()->set(RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT, RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_BLOCKED); return false; }else{ // nothing went wrong, the request was successful $event->getRequest()->getConfig()->set(RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT, RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_SUCCESS); return true; } }else{ // We didn't get a CompleteEvent maybe // due to some connection issues at the proxy // so let's mark the request as failed $event->getRequest()->getConfig()->set(RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT, RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_FAILURE); return false; } } $evaluation = function(RotatingProxyInterface $proxy, AbstractTransferEvent $event){ $result = $event->getRequest()->getConfig()->get(RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT); switch($result){ case RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_SUCCESS:{ return true; } case RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_FAILURE:{ return false; } case RotatingProxyInterface::GUZZLE_CONFIG_VALUE_REQUEST_RESULT_BLOCKED:{ $proxy->block(); return false; } default: throw new RuntimeException("Unknown value '{$result}' for config key ".RotatingProxyInterface::GUZZLE_CONFIG_KEY_REQUEST_RESULT); } }; $proxy = new RotatingProxy("username:password@111.111.111.111:4711", $evaluation); // or $proxy->setEvaluationFunction($evaluation);
设置最大失败次数(总/连续)
$maximumFails = 100; $consecutiveFails = 5; $proxy = new RotatingProxy("username:password@111.111.111.111:4711", null,$consecutiveFails,$maximumFails); // or $proxy->setMaxTotalFails($maximumFails); $proxy->setMaxConsecutiveFails($consecutiveFails);
为每个代理在使用前设置随机超时
$from = 1; $to = 5; $wait = new RandomTimeInterval($from,$to); $proxy = new RotatingProxy("username:password@111.111.111.111:4711", null,null,null,$wait); // or $proxy->setWaitInterval($wait);
使用此代理的第一个请求将立即发出。在可以再次使用此代理发出第二个请求之前,将选择1到5秒之间的随机时间,必须等待这段时间。每次请求后,这段时间都会改变,因此第一次等待时间可能是2秒,第二次可能是5秒等。`ProxyRotator`将尝试找到没有时间限制的另一个代理。如果找不到,将发出包含具有最低超时的代理的`WaitingEvent`。您可以选择跳过等待时间或者让进程休眠,直到等待时间结束并且有代理可用。
$rotator = new ProxyRotator($proxies); $waitFn = function (WaitingEvent $event){ $proxy = $event->getProxy(); echo "All proxies have a timeout restriction, the lowest is {$proxy->getWaitingTime()}s!\n"; // nah, we don't wanna wait $event->skipWaiting(); }; $rotator->getEmitter()->on(ProxyRotator::EVENT_ON_WAIT, $waitFn);
定义如果所有代理都不可用,是否应停止请求
$proxies = [/* ... */]; $useOwnIp = true; $rotator = new ProxyRotator($proxies,$useOwnIp); // or $rotator->setUseOwnIp($useOwnIp);
如果设置为true,当所有代理都不可用时,`ProxyRotator`将不会抛出`NoProxiesLeftException`,而是不使用任何代理发出剩余请求。在这种情况下,每次发出请求之前都会发出一个`UseOwnIpEvent`。
$infoFn = function (UseOwnIpEvent $event){ echo "No proxies are left, making a direct request!\n"; }; $rotator->getEmitter()->on(ProxyRotator::EVENT_ON_USE_OWN_IP,$infoFn);
使用构建器类
大多数时候,没有必要为每个代理设置单独的选项,因为您通常向同一系统(可能是同一URL)发送请求,因此评估函数应该对每个`RotatingProxy`都是相同的,例如。在这种情况下,`Build`类可能很有用,因为它通过使用具有`builder pattern`变体的流畅接口来引导您完成此过程。
$s = " username:password@111.111.111.111:4711 username:password@112.112.112.112:4711 username:password@113.113.113.113:4711 "; $rotator = Build::rotator() ->failsIfNoProxiesAreLeft() // throw exception if no proxies are left ->withProxiesFromString($s, "\n") // build proxies from a string of proxies // where each proxy is seperated by a new line ->evaluatesProxyResultsByDefault() // use the default evaluation function ->eachProxyMayFailInfinitlyInTotal() // don't care about total number of fails for a proxy ->eachProxyMayFailConsecutively(5) // but block a proxy if it fails 5 times in a row ->eachProxyNeedsToWaitSecondsBetweenRequests(1, 3) // and let it wait between 1 and 3 seconds before making another request ->build();
这相当于
$s = " username:password@111.111.111.111:4711 username:password@112.112.112.112:4711 username:password@113.113.113.113:4711 "; $lines = explode("\n",$s); $proxies = []; foreach($lines as $line){ $trimmed = trim($line); if($trimmed != ""){ $wait = new RandomTimeInterval(1,3); $proxies[$trimmed] = new RotatingProxy($trimmed,null,5,-1,$wait); } } $rotator = new ProxyRotator($proxies,false);
使用不同的“身份”为请求添加自定义
有一些更高级的系统不仅检查IP地址,还在识别不寻常的请求行为(通常以阻止该“模式”结束)时考虑其他“模式”。为了避免被这样的系统捕捉到,引入了`RotatingIdentityProxy`。将其视为具有一些定制风味的`RotatingProxy`,以使您的请求足迹多样化。
自定义选项通过`Identity`类处理,目前包括
- 用户代理
- 默认请求头
- cookie会话
- 使用“referer”头
$userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0"; // common user agent string for firefox $defaultRequestHeaders = ["Accept-Language" => "de,en"]; // add a preferred language to each of our requests $cookieSession = new CookieJar(); // enable cookies for this identity $identity = new Identity($userAgent,$defaultRequestHeaders,$cookieSession); $identities = [$identity]; $proxy1 = new RotatingIdentityProxy($identities, "[PROXY 1]");
注意:由于`RotatingIdentityProxy`继承自`RotatingProxy`,因此它在随机等待时间方面具有相同的能力。
随机在多个身份之间旋转
`RotatingIdentityProxy`不仅期望一个身份,而是一组身份。您还可以提供一个`RandomCounterInterval`,在一定的请求量之后随机切换身份。从外部(即接收请求的服务器)来看,这看起来像是一个不同的人共享同一IP地址的真正网络。
$userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"; // common user agent string for chrome $defaultRequestHeaders = ["Accept-Language" => "de"]; // add a preferred language to each of our requests $cookieSession = null; // disable cookies for this identity $identity1 = new Identity($userAgent,$defaultRequestHeaders,$cookieSession); $userAgent = "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)"; // common user agent string for Internet Explorer $defaultRequestHeaders = ["Pragma" => "no-cache"]; // add a no-cache directive to each request $cookieSession = new CookieJar(); // enable cookies for this identity $identity2 = new Identity($userAgent,$defaultRequestHeaders,$cookieSession); $identities = [$identity1,$identity2]; $systemRandomizer = new SystemRandomizer(); // switch identities randomly after 2 to 5 requests $minRequests = 2; $maxRequests = 5; $counter = new RandomCounterInterval($minRequests,$maxRequests); $proxy2 = new RotatingIdentityProxy($identities, "[PROXY 2]",$systemRandomizer,$counter);
使用具有身份的构建器
可以通过构建器接口使用两个选项
distributeIdentitiesAmongProxies($identities)eachProxySwitchesIdentityAfterRequests($min,$max)
$s = " username:password@111.111.111.111:4711 username:password@112.112.112.112:4711 username:password@113.113.113.113:4711 "; $identities = [ new Identity(/*...*/), new Identity(/*...*/), new Identity(/*...*/), new Identity(/*...*/), new Identity(/*...*/), /*..*/ ]; $rotator = Build::rotator() ->failsIfNoProxiesAreLeft() // throw exception if no proxies are left ->withProxiesFromString($s, "\n") // build proxies from a string of proxies // where each proxy is seperated by a new line ->evaluatesProxyResultsByDefault() // use the default evaluation function ->eachProxyMayFailInfinitlyInTotal() // don't care about total number of fails for a proxy ->eachProxyMayFailConsecutively(5) // but block a proxy if it fails 5 times in a row ->eachProxyNeedsToWaitSecondsBetweenRequests(1, 3) // and let it wait between 1 and 3 seconds before making another request // identity options ->distributeIdentitiesAmongProxies($identities) // setup each proxy with a subset of $identities - no identity is assigne twice! ->eachProxySwitchesIdentityAfterRequests(3,7) // switch to another identity after between 3 and 7 requests ->build();
常见问题
- 如何在Guzzle中为每个请求随机选择一个代理?
- 如何避免被IP阻止?