blogdaren/phpcreeper

基于Workerman的新一代多进程异步事件驱动爬虫引擎

v1.9.4 2024-09-13 06:47 UTC

README

php posix pcntl event redis license

这是什么

PHPCreeper是基于workerman开发的新一代多进程异步事件驱动爬虫引擎。

  • 专注于高效敏捷开发,使爬取工作变得更加简单
  • 解决传统爬虫框架的性能和扩展瓶颈问题
  • 充分利用多进程+分布式+分离式部署环境下的爬取优势
  • 支持无头浏览器,可执行JavaScript代码以爬取动态页面

爬山虎是基于workerman开发的全新一代多进程异步事件驱动型PHP爬虫引擎,它有助于:

  • 专注于高效敏捷开发,让爬取工作变得更加简单。
  • 解决传统型PHP爬虫框架的性能和扩展瓶颈问题。
  • 充分发挥多进程+分布式+分离式部署环境下的爬取优势。
  • 支持无头浏览器,即支持运行JavaScript代码及其渲染页。

文档

中文文档相对完整,英文文档将在此处不断更新。
注意: 爬山虎中文开发文档相对比较完善,各位小伙伴直接点击下方链接阅读即可.

  • 爬山虎中文官方网站:http://www.phpcreeper.com
  • 中文开发文档主节点:http://www.phpcreeper.com/docs/
  • 中文开发文档备节点:http://www.blogdaren.com/docs/
  • 爬山虎是一个免费开源的佛系爬虫项目,欢迎小星星Star支持,让更多的人发现、使用并受益。
  • 爬山虎源码根目录下有一个Examples/start.php样例脚本,开发之前建议先阅读它而后运行它。
  • 爬山虎提供的例子如果未能按照预期工作,请检查修改爬取规则,因为源站DOM极可能更新了。

技术交流

  • 下方绿色二维码为微信交流群:phpcreeper 【进群之前需先加此专属微信并备明来意或附上备注:爬山虎】
  • 若是奔着虎哥的原创视频《深入PHP内核源码》而来,务必添加专属微信方可获得配套教程文档,弥足珍贵。
  • 微信群主要围绕 爬山虎workerman深入PHP内核源码 开展技术交流,观看PHP内核视频请移步至B站

截图

DemoShow

特性

  • 几乎继承了workerman的所有功能
  • 支持无头浏览器爬取动态页面
  • 支持类似于Linux-Crontab的Crontab-Jobs
  • 支持分布式和分离式部署
  • 支持使用PHPCreeper-Application进行敏捷开发
  • 可以自由定制各种回调和第三方中间件
  • 使用PHPQuery作为优雅的内容提取器
  • 具有高性能和强可扩展性

先决条件

  • PHP_VERSION ≥ 7.0.0 (由于兼容性问题,最好选择PHP 7.4+)
  • POSIX兼容操作系统(Linux、OSX、BSD)
  • PHP的POSIX扩展(必选)
  • PHP的PCNTL扩展(必选)
  • PHP的REDIS扩展(可选,注意:从v1.4.2版本开始,默认使用predis客户端,因此不再强制依赖REDIS扩展)
  • PHP的EVENT扩展(可选,强烈建议安装,这是提升性能的主要支持;另外请注意需要优化Linux内核
  • 简而言之:只要能运行workerman,就能运行PHPCreeper,所以安装要求和workerman完全一致。
  • POSIX扩展和PCNTL扩展是必选项:PHP发行版通常都会默认安装这两个扩展,如果没有,请自行编译安装。
  • EVENT扩展是可选项:建议最好安装,这是提升性能的主要支持;另外请注意需要优化Linux内核
  • REDIS扩展是可选项: 注意:从v1.4.2版本开始,引擎默认采用predis客户端,因此不再强制依赖REDIS扩展。

安装

安装PHPCreeper的推荐方法是使用Composer

composer require blogdaren/phpcreeper

用法:不依赖于PHPCreeper应用框架

首先,还有一个名为 PHPCreeper-Application 的匹配应用程序框架,它同时发布以方便您的开发,尽管这个框架并非必需,但我们强烈建议您使用它,这将大大提高您的工作效率。此外,我们当然可以编写不依赖于框架的代码,而且也容易实现。

接下来,让我们通过一个例子来展示如何捕获按星级排名的 github top 10 repos

<?php 
require "./vendor/autoload.php";

use PHPCreeper\PHPCreeper;
use PHPCreeper\Producer;
use PHPCreeper\Downloader;
use PHPCreeper\Parser;
use PHPCreeper\Server;
use PHPCreeper\Crontab;
use PHPCreeper\Timer;

//enable the single worker mode so that we can run without redis, however, you should note 
//it will be limited to run only all the downloader workers in this case【version >= 1.3.2】
//PHPCreeper::enableMultiWorkerMode(false);

//switch runtime language between `zh` and `en`, default is `zh`【version >= 1.3.7】
PHPCreeper::setLang('en');

//set master pid file manually as needed【version >= 1.3.8】
//PHPCreeper::setMasterPidFile('/path/to/master.pid');

//set worker log file when start as daemon mode as needed【version >= 1.3.8】
//PHPCreeper::setLogFile('/path/to/phpcreeper.log');

//note that `predis` will be the default redis client since【version >= 1.4.2】
//but you could still switch it to be `redis` if you prefer to use ext-redis
//PHPCreeper::setDefaultRedisClient('redis');

//set default timezone, default is `Asia/Shanghai`【version >= 1.5.4】
//PHPCreeper::setDefaultTimezone('Asia/Shanghai');

//redirect all stdandard out to file when run as daemonize【version >= 1.7.0】
//PHPCreeper::setStdoutFile("/path/to/stdout.log");

//set default headless browser, default is `chrome`【version >= 1.8.7】
//PHPCreeper::setDefaultHeadlessBrowser('chrome');

//Global-Redis-Config: support array value with One-Dimension or Two-Dimension, 
//NOTE: since v1.6.4, it's been upgraded to use a more secure and officially
//recommended distributed red lock mechanism by default, but it will use the
//old version of the lock mechanism degenerate only when all the redis instances 
//are explicitly configured with the option [use_red_lock === false] as below.
//for details on how to configure the value, refer to the Follow-Up sections.
$config['redis'] = [
    [
        'host'      =>  '127.0.0.1',
        'port'      =>  6379,
        'database'  =>  '0',
        'auth'      =>  false,
        'pass'      =>  'guest',
        'prefix'    =>  'PHPCreeper', 
        'connection_timeout' => 5,
        'read_write_timeout' => 0,
        'use_red_lock'       => true,   //default to true since v1.6.4
    ],
];

//Global-Task-Config: the context member configured here is a global context,
//we can also set a private context for each task, finally the global context 
//and private task context will adopt the strategy of merging and covering.
//you can free to customize various context settings, including user-defined,
//for details on how to configure it, please refer to the Follow-Up sections.
$config['task'] = array( 
    //'crawl_interval'  => 1,
    //'max_number'      => 1000,
    //'max_connections' => 1,
    //'max_request'     => 1000,
    'context' => [
        'cache_enabled'    => true,
        'cache_directory'  => sys_get_temp_dir() . '/DownloadCache4PHPCreeper/',
        'allow_url_repeat' => true,
        'headless_browser' => ['headless' => false, /*more browser options*/],
        //please refer to the Follow-Up sections to find more context options
    ],
); 

function startAppProducer()
{
    global $config;
    $producer = new Producer($config);

    $producer->setName('AppProducer')->setCount(1);
    $producer->onProducerStart = function($producer){
        //private task context which will be merged with global context
        $private_task_context = [];

        //【version <  1.6.0】: we mainly use an OOP style API to create task     
        //$producer->newTaskMan()->setXXX()->setXXX()->createTask()
        //$producer->newTaskMan()->setXXX()->setXXX()->createTask($task)
        //$producer->newTaskMan()->setXXX()->setXXX()->createMultiTask()
        //$producer->newTaskMan()->setXXX()->setXXX()->createMultiTask($task)

        //【version >= 1.6.0】: we provide a shorter and easier API to create task    
        //with more rich parameter types, and the old OOP style API can still be used,    
        //and extension jobs are promoted just to maintain backward compatibility
        //1. Single-Task-API: $task parameter types supported: [string | 1D-array]    
        //2. Single-Task-API:$producer->createTask($task);   
        //3. Multi-Task-API:  $task parameter types supported: [string | 1D-array | 2D-array]   
        //4. Multi-Task-API: $producer->createMultiTask($task);

        //use string: not recommended to use because the configuration is limited.    
        //so the question is that you need to process the fetching result by yourself     
        //$task = "https://github.com/search?q=stars:%3E1&s=stars&type=Repositories";
        //$producer->createTask($task);
        //$producer->createMultiTask($task);

        //use 1D-array:recommeded to use, rich configuration, engine helps to deal with all    
        $task = array(
            'url'       =>  'https://github.com/search?q=stars:%3E1&s=stars&type=Repositories',
            'rule'      =>  [
                'title' => ['div.Box-sc-g0xbh4-0.bDcVHV div.search-title a span', 'text'],
                'stars' => ['div.Box-sc-g0xbh4-0.bDcVHV ul li a span.Text-sc-17v1xeu-0.gPDEWA', 'text'],
            ],
            'rule_name' =>  '',       //md5($task_id) will be the rule_name if leave empty  
            'refer'     =>  '',
            'type'      =>  'text',   //it has lost the original concept setting, which can be set freely
            'method'    =>  'get',
            "context"   =>  $private_task_context, 
        );
        $producer->createTask($task);
        $producer->createMultiTask($task);

        //use 2D-array: recommed to use, rich configuration,engine helps to deal with all
        //since it is multitasking, only the createMultiTask() API can be called
        $task = array(
            array(
                'url'       => 'https://github.com/search?q=stars:%3E1&s=stars&type=Repositories',
                'rule'      =>  [
                    'title' => ['div.Box-sc-g0xbh4-0.bDcVHV div.search-title a span', 'text'],
                    'stars' => ['div.Box-sc-g0xbh4-0.bDcVHV ul li a span.Text-sc-17v1xeu-0.gPDEWA', 'text'],
                ],
                'rule_name' => 'r1',
                "context"   => $private_task_context,
            ),
            array(
                'url'       => 'https://github.com/search?q=stars:%3E1&s=stars&type=Repositories',
                'rule'      =>  [
                    'title' => ['div.Box-sc-g0xbh4-0.bDcVHV div.search-title a span', 'text'],
                    'stars' => ['div.Box-sc-g0xbh4-0.bDcVHV ul li a span.Text-sc-17v1xeu-0.gPDEWA', 'text'],
                ],
                'rule_name' => 'r2', 
                "context"   => $private_task_context,
            ),
        );
        $producer->createMultiTask($task);

        //use headless browser to crawl dynamic page rendered by javascript
        $private_task_context['headless_browser']['headless'] = true;
        $dynamic_task = array(
            'url' => 'https://www.toutiao.com',
            'rule' => array(
                'title' => ['div.show-monitor ol li a', 'aria-label'],
                'link'  => ['div.show-monitor ol li a', 'href'],
            ), 
            'context' => $private_task_context,
        );
        $producer->createTask($dynamic_task);
    };
}

function startAppDownloader()
{
    global $config;
    $downloader = new Downloader($config);

    //set the client socket address based on the listening parser server 
    $downloader->setName('AppDownloader')->setCount(2)->setClientSocketAddress([
        'ws://127.0.0.1:8888',
    ]);

    $downloader->onDownloadBefore = function($downloader, $task){
        //disable http ssl verify in any of the following two ways 
        //$downloader->httpClient->disableSSL();
        //$downloader->httpClient->setOptions(['verify' => false]);
    }; 

    //use headless browser by user callback or API directly
    $downloader->onHeadlessBrowserOpenPage = function($downloader, $browser, $page, $url){
        //Note: keeping flexible types of return values helps to deal with various complex app scenarios.
        //1. Returning false  will trigger the interruption of subsequent business logic.
        //2. Returning string will trigger the interruption of subsequent business logic, 
        //   it is often used to return the HTML of the web page.
        //3. Returning array  will continue to execute subsequent business logic, 
        //   it is often used to return headless browser options.
        //4. Returning others will continue to execute subsequent business logic, 
        //   which is equivalent to do nothing.

        //Note: Generally, there is no need to call the following lines of code, because 
        //Note: PHPCreeper will automatically call the headless API by default to do the same work.
        //$page->navigate($url)->waitForNavigation('firstMeaningfulPaint');
        //$html = $page->getHtml();
        //return $html;
    };

    //more downloader or download callbacks frequently used
    //$downloader->onDownloaderStart = function($downloader){};
    //$downloader->onDownloaderStop  = function($downloader){};
    //$downloader->onDownloaderMessage = function($downloader, $parser_reply){};
    //$downloader->onDownloaderConnectToParser = function($connection){};
    //$downloader->onDownloadStart = function($downloader, $task){};
    //$downloader->onDownloadAfter = function($downloader, $download_data, $task){};
    //$downloader->onDownloadFail  = function($downloader, $error, $task){};
    //$downloader->onDownloadTaskEmpty = function($downloader){};
}

function startAppParser()
{
    $parser = new Parser();
    $parser->setName('AppParser')->setCount(1)->setServerSocketAddress('websocket://0.0.0.0:8888');
    $parser->onParserExtractField = function($parser, $download_data, $fields){
        pprint($fields);
    };

    //more parser callbacks frequently used
    //$parser->onParserStart = function($parser){};
    //$parser->onParserStop  = function($parser){};
    //$parser->onParserMessage = function($parser, $connection, $download_data){};
    //$parser->onParserFindUrl = function($parser, $sub_url){};
}

function startAppServer()
{
    $server = new Server();
    $server->onServerStart = function(){
        //execute the task every 1 second
        new Crontab('*/1 * * * * *', function(){
            pprint("print the current time every 1 second: " . time());
        });

        //execute the task every 2 minutes 
        new Crontab('*/2 * * * *', function(){
            pprint("print the current time every 2 minutes: " . time());
        });
    };
}

//start producer component
startAppProducer();

//start downloader component
startAppDownloader();

//start parser component 
startAppParser();

//start server component
startAppServer();

//start phpcreeper engine
PHPCreeper::start();

现在,将上面的示例代码保存到文件中,并将其命名为 github.php 作为启动脚本,然后按照以下方式运行:

/path/to/php github.php start

用法:依赖PHPCreeper应用程序框架

如果您想基于 PHPCreeper应用程序框架 开发应用程序或 查看更多配置,请点击此处

如何设置提取规则

//NOTE: this is new usage for【version >= v1.6.0】, strongly recommended to use.
$rule = array( 
    'field1' => ['selector', 'action', 'range', 'callback'],
    .....................................................,
    'fieldN' => ['selector', 'action', 'range', 'callback'],
);

//Single-Task
$task = array(
    'url'  => 'http://www.weather.com.cn/weather/101010100.shtml',
    'rule' => $rule,
    'rule_name' =>  'r1',   
); 

//Multi-Task
$task = array(
    array(
        'url'  => 'http://www.weather.com.cn/weather/101010100.shtml',
        'rule' => $rule,
        'rule_name' => 'r1', 
        "context" => $context,
    ),
    array(
        'url'  => 'http://www.weather.com.cn/weather/201010100.shtml',
        'rule' => $rule,
        'rule_name' => 'r2', 
        "context" => $context,
    ),
);
  • 每个URL配置项与一个唯一的规则配置项匹配,且 rule_name 必须是一对一对应
  • 规则值的类型必须是 数组
  • 对于单个任务,对应规则项的深度,即数组的深度,只能为2
  • 对于多个任务,对应规则项的深度,即数组的深度,只能为3
//NOTE: this is outdated usage for【version < v1.6.0】, not recommended to use.
<?php
$urls = array(
    'rule_name1' => 'http://www.blogdaren.com';
    '..........' => '........................';
    'rule_nameN' => 'http://www.phpcreeper.com';
);

$rule = array( 
    'rule_name1' => array(
        'field1' => ['selector', 'action', 'range', 'callback'],
        '......' => ['........', '....', '.....', '........'];
        'fieldN' => ['selector', 'action', 'range', 'callback'],
    );
    .........................................................,
    'rule_nameN' => array(
        'field1' => ['selector', 'action', 'range', 'callback'],
        '......' => ['........', '....', '.....', '........'];
        'fieldN' => ['selector', 'action', 'range', 'callback'],
    );
);
  • rule_name
    您应该为每个任务提供一个唯一的规则名称,这样我们就可以轻松地索引我们想要的数据,如果您留空,它将使用 md5($task_id) 而不是 md5($task_url) 作为唯一的规则名称,从v1.6.0版本开始,这存在潜在的陷阱。

  • selector
    必须提供选择器,否则它将被忽略,就像jQuery选择器一样,其值可以是 #idName.classNameHtml Element 等等。

  • action
    默认值为 text,表示我们应该采取什么操作,值可以是以下之一
    text:用于获取HTML元素的内部文本
    html:用于获取带有HTML元素标签的内部文本
    attr:用于获取HTML元素的属性值
      【注意:实际值应该是像 srchref 等属性,而不是 attr 本身
    css :  特别用于获取HTML元素的样式属性,并以数组形式返回
      【注意:还支持更变的形式,如 css:*css:prop1,prop2,...propN

  • range
    用于缩小条目,使其仅匹配,就像jQuery选择器一样,其值可以是 #idName.classNameHtml Element 等等。

  • callback
    您可以在这里触发一个 callback字符串callback函数,但请记住返回预期的数据。

    callback字符串:推荐使用,与PHP原生回调函数语义等效。
    callback函数:请注意,您应该使用 callback字符串 而不是 callback函数,因为PHP原生回调函数可能在跨多进程环境通信时工作不正常。

<?php
//extractor rule code example
$html = "<div><a href='http://www.phpcreeper.com' id='site' style='color:red;font-size:100px;'>PHPCreeper</a></div>";
$rule = array(
    'link_element'  => ['div',      'html'],
    'link_text '    => ['#site',    'text'],
    'link_address'  => ['a',        'href'],
    'link_css1'     => ['a',        'css'],
    'link_css2'     => ['div>a',    'css:font-size'],
    'callback_data' => ['#site',    'text', [], 'function($field_name, $data){
        return "Hello " . $data;
    }'], 
);  
$data = $parser->extractField($html, $rule, 'rule1');
pprint($data['rule1']);

//output
Array
(
    [0] => Array
        (
            [link_element] => <a href="http://www.phpcreeper.com" id="site" style="color:red;font-size:100px;">PHPCreeper</a>
            [link_text ] => PHPCreeper
            [link_address] => http://www.phpcreeper.com
            [link_css1] => Array
                (
                    [color] => red
                    [font-size] => 100px
                )

            [link_css2] => Array
                (
                    [font-size] => 100px
                )

            [callback_data] => Hello PHPCreeper
        )

)

使用数据库

PHPCreeper用类似于Medoo风格的轻量级数据库进行包装,如果您想了解更多关于它的用法,请访问Medoo官方网站。现在我们只需要找出如何获取DBO,实际上,这是非常简单的。

首先配置 database.php,然后添加以下代码

<?php
return array(
    'dbo' => array(
        'test' => array(
            'database_type' => 'mysql',
            'database_name' => 'test',
            'server'        => '127.0.0.1',
            'username'      => 'root',
            'password'      => 'root',
            'charset'       => 'utf8'
        ),
    ),
);

现在我们可以获取DBO并开始查询或其他操作,就像您喜欢的那样

<?php
$downloader->onAfterDownloader = function($downloader){
    //dbo single instance and we can pass the DSN string `test`
    $downloader->getDbo('test')->select('title', '*');
    
    //dbo single instance and we can pass the configuration array
    $config = Configurator::get('globalConfig/database/dbo/test')
    $downloader->getDbo($config)->select('title', '*');

    //dbo new instance and we can pass the DSN string `test`
    $downloader->newDbo('test')->select('title', '*');

    //dbo new instance and we can pass the configuration array
    $config = Configurator::get('globalConfig/database/dbo/test')
    $downloader->newDbo($config)->select('title', '*');
};

可用命令

请注意,PHPCreeper中的所有命令只能在命令行上运行,您必须在开始任何爬取任务之前编写一个全局入口启动脚本,其名称假定是start.php,但如果您使用PHPCreeper-Application框架进行开发,它将自动帮助您生成包括全局所需的全部启动脚本。

php start.php start
php start.php start -d
php start.php stop
php start.php restart
php start.php reload
php start.php reload -g
php start.php status
php start.php connections

相关链接和感谢

许可证

PHPCreeper采用MIT许可证发布。

免责声明

不要将PHPCreeper用于您所在国家法律不允许的任何业务。我对该代码不承担任何保证或责任。使用风险自负。