vantoozz/proxy-scraper

免费代理爬虫

v3.0.0 2021-02-21 14:53 UTC

README

用PHP编写的爬取免费代理列表的库

Build Status Coverage Status Codacy Badge Packagist

快速开始

composer require vantoozz/proxy-scraper:~3 guzzlehttp/guzzle:~7 guzzlehttp/psr7 hanneskod/classtools
<?php declare(strict_types=1);

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

foreach (proxyScraper()->get() as $proxy) {
    echo $proxy . "\n";
}

旧版本

这是库的第3版。对于第2版,请查看v2分支;对于第1版,请查看v1分支。

升级

如何升级

设置

该库需要PSR-18兼容的HTTP客户端。要使用该库,您必须安装其中任何一个,例如。

composer require guzzlehttp/guzzle:~7 guzzlehttp/psr7

所有可用的客户端都列在Packagist上:https://packagist.org.cn/providers/psr/http-client-implementation

然后安装proxy-scraper库本身

composer require vantoozz/proxy-scraper:~3

用法

自动配置

开始使用库的最简单方法是使用proxyScraper()函数,该函数实例化和配置所有爬虫。

请注意,自动配置函数除了需要guzzlehttp/guzzle:~7guzzlehttp/psr7之外,还需要hanneskod/classtools依赖项。

composer require guzzlehttp/guzzle:~7 guzzlehttp/psr7 hanneskod/classtools
<?php declare(strict_types=1);

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

foreach (proxyScraper()->get() as $proxy) {
    echo $proxy . "\n";
}
HTTP客户端

如果不使用自动配置,您将需要一个HTTP客户端。

该库提供了guzzleHttpClient()函数,用于创建和配置客户端。

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;

use function Vantoozz\ProxyScraper\guzzleHttpClient;
use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = guzzleHttpClient();

$scraper = proxyScraper($httpClient);

try {
    echo $scraper->get()->current()->getIpv4(). "\n";
} catch (ScraperException $e) {
    echo $e->getMessage() . "\n";
}

您可以通过实现HttpClientInterface来创建自己的HTTP客户端

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use Vantoozz\ProxyScraper\HttpClient\HttpClientInterface;

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = new class implements HttpClientInterface {
    /**
     * @param string $uri
     * @return string
     */
    public function get(string $uri): string
    {
        return "some string";
    }
};

$scraper = proxyScraper($httpClient);

try {
    echo $scraper->get()->current()->getIpv4(). "\n";
} catch (ScraperException $e) {
    echo $e->getMessage() . "\n";
}

当然,您也可以手动配置爬虫和底层的HTTP客户端

单个爬虫

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Scrapers;

use function Vantoozz\ProxyScraper\guzzleHttpClient;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = new Scrapers\UsProxyScraper(guzzleHttpClient());

foreach ($scraper->get() as $proxy) {
    echo $proxy . "\n";
}

组合爬虫

您可以从多个爬虫中轻松获取数据

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Scrapers;

use function Vantoozz\ProxyScraper\guzzleHttpClient;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = guzzleHttpClient();

$compositeScraper = new Scrapers\CompositeScraper;

$compositeScraper->addScraper(new Scrapers\FreeProxyListScraper($httpClient));
$compositeScraper->addScraper(new Scrapers\CoolProxyScraper($httpClient));
$compositeScraper->addScraper(new Scrapers\SocksProxyScraper($httpClient));

foreach ($compositeScraper->get() as $proxy) {
    echo $proxy . "\n";
}

错误处理

有时事情会出错。以下示例展示了在从多个爬虫获取数据时如何处理错误

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use Vantoozz\ProxyScraper\Ipv4;
use Vantoozz\ProxyScraper\Port;
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;

require_once __DIR__ . '/vendor/autoload.php';

$compositeScraper = new Scrapers\CompositeScraper;

// Set exception handler
$compositeScraper->handleScraperExceptionWith(function (ScraperException $e) {
    echo 'An error occurred: ' . $e->getMessage() . "\n";
});

// Fake scraper throwing an exception
$compositeScraper->addScraper(new class implements Scrapers\ScraperInterface {
    public function get(): Generator
    {
        throw new ScraperException('some error');
    }
});

// Fake scraper with no exceptions
$compositeScraper->addScraper(new class implements Scrapers\ScraperInterface {
    public function get(): Generator
    {
        yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888));
    }
});

//Run composite scraper
foreach ($compositeScraper->get() as $proxy) {
    echo $proxy . "\n";
}

输出

An error occurred: some error
192.168.0.1:8888

以相同的方式,您可以为使用proxyScraper()函数创建的爬虫配置异常处理,因为它返回一个CompositeScraper实例

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = proxyScraper();

$scraper->handleScraperExceptionWith(function (ScraperException $e) {
    echo 'An error occurs: ' . $e->getMessage() . "\n";
});

验证代理

可以添加验证步骤

<?php declare(strict_types = 1);

use Vantoozz\ProxyScraper\Exceptions\ValidationException;
use Vantoozz\ProxyScraper\Ipv4;
use Vantoozz\ProxyScraper\Port;
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;
use Vantoozz\ProxyScraper\Validators;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = new class implements Scrapers\ScraperInterface
{
    public function get(): \Generator
    {
        yield new Proxy(new Ipv4('104.202.117.106'), new Port(1234));
        yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888));
    }
};

$validator = new Validators\ValidatorPipeline;
$validator->addStep(new Validators\Ipv4RangeValidator);

foreach ($scraper->get() as $proxy) {
    try {
        $validator->validate($proxy);
        echo '[OK] ' . $proxy . "\n";
    } catch (ValidationException $e) {
        echo '[Error] ' . $e->getMessage() . ': ' . $proxy . "\n";
    }
}

输出

[OK] 104.202.117.106:1234
[Error] IPv4 is in private range: 192.168.0.1:8888

度量

代理对象可以与度量(元数据)相关联。

默认情况下,代理对象具有source度量

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;

use function Vantoozz\ProxyScraper\guzzleHttpClient;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = new Scrapers\UsProxyScraper(guzzleHttpClient());

/** @var Proxy $proxy */
$proxy = $scraper->get()->current();

foreach ($proxy->getMetrics() as $metric) {
    echo $metric->getName() . ': ' . $metric->getValue() . "\n";
}

输出

source: Vantoozz\ProxyScraper\Scrapers\UsProxyScraper

注意。示例使用Guzzle作为HTTP客户端。

测试

单元测试
./vendor/bin/phpunit --testsuite=unit
集成测试
./vendor/bin/phpunit --testsuite=integration
系统测试
php ./tests/systemTests.php

从版本2升级

与版本2最大的不同之处在于HTTP客户端配置。

而不是

$httpClient = new \Vantoozz\ProxyScraper\HttpClient\Psr18HttpClient(
    new \Http\Adapter\Guzzle6\Client(new \GuzzleHttp\Client([
        'connect_timeout' => 2,
        'timeout' => 3,
    ])),
    new \Http\Message\MessageFactory\GuzzleMessageFactory
);

客户端应按如下方式实例化

$httpClient = \Vantoozz\ProxyScraper\guzzleHttpClient();