vantoozz / proxy-scraper
免费代理爬虫
Requires
- php: ^7.3|^8.0
- psr/http-client: ~1
- psr/http-client-implementation: ~1
- psr/http-factory: ~1
- psr/http-message-implementation: ~1
- symfony/css-selector: ~5
- symfony/dom-crawler: >=5.1.4
Requires (Dev)
Suggests
- ext-json: *
- ext-simplexml: *
- guzzlehttp/guzzle: to use Guzzle as HTTP client
- guzzlehttp/psr7: to use Guzzle as HTTP client
- hanneskod/classtools: to enable scrapers auto-discovery
- php-http/message: to use Psr18HttpClient (deprecated)
- php-http/message-factory: to use Psr18HttpClient (deprecated)
Conflicts
- nikic/php-parser: <3
This package is auto-updated.
Last update: 2024-09-06 23:51:50 UTC
README
用PHP编写的爬取免费代理列表的库
快速开始
composer require vantoozz/proxy-scraper:~3 guzzlehttp/guzzle:~7 guzzlehttp/psr7 hanneskod/classtools
<?php declare(strict_types=1); use function Vantoozz\ProxyScraper\proxyScraper; require_once __DIR__ . '/vendor/autoload.php'; foreach (proxyScraper()->get() as $proxy) { echo $proxy . "\n"; }
旧版本
这是库的第3版。对于第2版,请查看v2分支;对于第1版,请查看v1分支。
升级
设置
该库需要PSR-18兼容的HTTP客户端。要使用该库,您必须安装其中任何一个,例如。
composer require guzzlehttp/guzzle:~7 guzzlehttp/psr7
所有可用的客户端都列在Packagist上:https://packagist.org.cn/providers/psr/http-client-implementation。
然后安装proxy-scraper库本身
composer require vantoozz/proxy-scraper:~3
用法
自动配置
开始使用库的最简单方法是使用proxyScraper()
函数,该函数实例化和配置所有爬虫。
请注意,自动配置函数除了需要guzzlehttp/guzzle:~7
和guzzlehttp/psr7
之外,还需要hanneskod/classtools
依赖项。
composer require guzzlehttp/guzzle:~7 guzzlehttp/psr7 hanneskod/classtools
<?php declare(strict_types=1); use function Vantoozz\ProxyScraper\proxyScraper; require_once __DIR__ . '/vendor/autoload.php'; foreach (proxyScraper()->get() as $proxy) { echo $proxy . "\n"; }
HTTP客户端
如果不使用自动配置,您将需要一个HTTP客户端。
该库提供了guzzleHttpClient()
函数,用于创建和配置客户端。
<?php declare(strict_types=1); use Vantoozz\ProxyScraper\Exceptions\ScraperException; use function Vantoozz\ProxyScraper\guzzleHttpClient; use function Vantoozz\ProxyScraper\proxyScraper; require_once __DIR__ . '/vendor/autoload.php'; $httpClient = guzzleHttpClient(); $scraper = proxyScraper($httpClient); try { echo $scraper->get()->current()->getIpv4(). "\n"; } catch (ScraperException $e) { echo $e->getMessage() . "\n"; }
您可以通过实现HttpClientInterface
来创建自己的HTTP客户端
<?php declare(strict_types=1); use Vantoozz\ProxyScraper\Exceptions\ScraperException; use Vantoozz\ProxyScraper\HttpClient\HttpClientInterface; use function Vantoozz\ProxyScraper\proxyScraper; require_once __DIR__ . '/vendor/autoload.php'; $httpClient = new class implements HttpClientInterface { /** * @param string $uri * @return string */ public function get(string $uri): string { return "some string"; } }; $scraper = proxyScraper($httpClient); try { echo $scraper->get()->current()->getIpv4(). "\n"; } catch (ScraperException $e) { echo $e->getMessage() . "\n"; }
当然,您也可以手动配置爬虫和底层的HTTP客户端
单个爬虫
<?php declare(strict_types=1); use Vantoozz\ProxyScraper\Scrapers; use function Vantoozz\ProxyScraper\guzzleHttpClient; require_once __DIR__ . '/vendor/autoload.php'; $scraper = new Scrapers\UsProxyScraper(guzzleHttpClient()); foreach ($scraper->get() as $proxy) { echo $proxy . "\n"; }
组合爬虫
您可以从多个爬虫中轻松获取数据
<?php declare(strict_types=1); use Vantoozz\ProxyScraper\Scrapers; use function Vantoozz\ProxyScraper\guzzleHttpClient; require_once __DIR__ . '/vendor/autoload.php'; $httpClient = guzzleHttpClient(); $compositeScraper = new Scrapers\CompositeScraper; $compositeScraper->addScraper(new Scrapers\FreeProxyListScraper($httpClient)); $compositeScraper->addScraper(new Scrapers\CoolProxyScraper($httpClient)); $compositeScraper->addScraper(new Scrapers\SocksProxyScraper($httpClient)); foreach ($compositeScraper->get() as $proxy) { echo $proxy . "\n"; }
错误处理
有时事情会出错。以下示例展示了在从多个爬虫获取数据时如何处理错误
<?php declare(strict_types=1); use Vantoozz\ProxyScraper\Exceptions\ScraperException; use Vantoozz\ProxyScraper\Ipv4; use Vantoozz\ProxyScraper\Port; use Vantoozz\ProxyScraper\Proxy; use Vantoozz\ProxyScraper\Scrapers; require_once __DIR__ . '/vendor/autoload.php'; $compositeScraper = new Scrapers\CompositeScraper; // Set exception handler $compositeScraper->handleScraperExceptionWith(function (ScraperException $e) { echo 'An error occurred: ' . $e->getMessage() . "\n"; }); // Fake scraper throwing an exception $compositeScraper->addScraper(new class implements Scrapers\ScraperInterface { public function get(): Generator { throw new ScraperException('some error'); } }); // Fake scraper with no exceptions $compositeScraper->addScraper(new class implements Scrapers\ScraperInterface { public function get(): Generator { yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888)); } }); //Run composite scraper foreach ($compositeScraper->get() as $proxy) { echo $proxy . "\n"; }
输出
An error occurred: some error
192.168.0.1:8888
以相同的方式,您可以为使用proxyScraper()
函数创建的爬虫配置异常处理,因为它返回一个CompositeScraper
实例
<?php declare(strict_types=1); use Vantoozz\ProxyScraper\Exceptions\ScraperException; use function Vantoozz\ProxyScraper\proxyScraper; require_once __DIR__ . '/vendor/autoload.php'; $scraper = proxyScraper(); $scraper->handleScraperExceptionWith(function (ScraperException $e) { echo 'An error occurs: ' . $e->getMessage() . "\n"; });
验证代理
可以添加验证步骤
<?php declare(strict_types = 1); use Vantoozz\ProxyScraper\Exceptions\ValidationException; use Vantoozz\ProxyScraper\Ipv4; use Vantoozz\ProxyScraper\Port; use Vantoozz\ProxyScraper\Proxy; use Vantoozz\ProxyScraper\Scrapers; use Vantoozz\ProxyScraper\Validators; require_once __DIR__ . '/vendor/autoload.php'; $scraper = new class implements Scrapers\ScraperInterface { public function get(): \Generator { yield new Proxy(new Ipv4('104.202.117.106'), new Port(1234)); yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888)); } }; $validator = new Validators\ValidatorPipeline; $validator->addStep(new Validators\Ipv4RangeValidator); foreach ($scraper->get() as $proxy) { try { $validator->validate($proxy); echo '[OK] ' . $proxy . "\n"; } catch (ValidationException $e) { echo '[Error] ' . $e->getMessage() . ': ' . $proxy . "\n"; } }
输出
[OK] 104.202.117.106:1234
[Error] IPv4 is in private range: 192.168.0.1:8888
度量
代理对象可以与度量(元数据)相关联。
默认情况下,代理对象具有source度量
<?php declare(strict_types=1); use Vantoozz\ProxyScraper\Proxy; use Vantoozz\ProxyScraper\Scrapers; use function Vantoozz\ProxyScraper\guzzleHttpClient; require_once __DIR__ . '/vendor/autoload.php'; $scraper = new Scrapers\UsProxyScraper(guzzleHttpClient()); /** @var Proxy $proxy */ $proxy = $scraper->get()->current(); foreach ($proxy->getMetrics() as $metric) { echo $metric->getName() . ': ' . $metric->getValue() . "\n"; }
输出
source: Vantoozz\ProxyScraper\Scrapers\UsProxyScraper
注意。示例使用Guzzle作为HTTP客户端。
测试
单元测试
./vendor/bin/phpunit --testsuite=unit
集成测试
./vendor/bin/phpunit --testsuite=integration
系统测试
php ./tests/systemTests.php
从版本2升级
与版本2最大的不同之处在于HTTP客户端配置。
而不是
$httpClient = new \Vantoozz\ProxyScraper\HttpClient\Psr18HttpClient( new \Http\Adapter\Guzzle6\Client(new \GuzzleHttp\Client([ 'connect_timeout' => 2, 'timeout' => 3, ])), new \Http\Message\MessageFactory\GuzzleMessageFactory );
客户端应按如下方式实例化
$httpClient = \Vantoozz\ProxyScraper\guzzleHttpClient();