blogdaren / phpcreeper
基于Workerman的新一代多进程异步事件驱动爬虫引擎
Requires
- php: >=7.0.0
- blogdaren/configurator: ^1.0
- blogdaren/logger: ^1.1.2
- chrome-php/chrome: ^1.11
- guzzlehttp/guzzle: ^6.4 || ^7.0
- predis/predis: 2.0.*
- workerman/workerman: >=3.5.0,<5.0.0
Suggests
- ext-event: For better performance
- dev-master
- v1.9.4
- v1.9.3
- v1.9.2
- v1.9.1
- v1.9.0
- v1.8.9
- v1.8.8
- v1.8.7
- v1.8.5
- v1.8.3
- v1.8.1
- v1.8.0
- v1.7.2
- v1.7.1
- v1.7.0
- v1.6.9
- v1.6.8
- v1.6.7
- v1.6.6
- v1.6.5
- v1.6.4
- v1.6.3
- v1.6.2
- v1.6.1
- v1.6.0
- v1.5.6
- v1.5.5
- v1.5.4
- v1.5.1
- v1.5.0
- v1.4.9
- v1.4.8
- v1.4.7
- v1.4.6
- v1.4.5
- v1.4.4
- v1.4.3
- v1.4.2
- v1.4.1
- v1.3.9
- v1.3.8
- v1.3.7
- v1.3.6
- v1.3.4
- v1.3.3
- v1.3.2
- v1.3.1
- v1.3.0
- v1.2.2
- v1.2.1
- v1.2.0
- v1.1.6
- v1.1.5
- v1.1.4
- v1.1.3
- v1.1.2
- v1.1.1
This package is auto-updated.
Last update: 2024-09-13 06:56:27 UTC
README
这是什么
PHPCreeper是基于workerman开发的新一代多进程异步事件驱动爬虫引擎。
- 专注于高效敏捷开发,使爬取工作变得更加简单
- 解决传统爬虫框架的性能和扩展瓶颈问题
- 充分利用多进程+分布式+分离式部署环境下的爬取优势
- 支持无头浏览器,可执行JavaScript代码以爬取动态页面
爬山虎是基于workerman开发的全新一代多进程异步事件驱动型PHP爬虫引擎,它有助于:
- 专注于高效敏捷开发,让爬取工作变得更加简单。
- 解决传统型PHP爬虫框架的性能和扩展瓶颈问题。
- 充分发挥多进程+分布式+分离式部署环境下的爬取优势。
- 支持无头浏览器,即支持运行JavaScript代码及其渲染页。
文档
中文文档相对完整,英文文档将在此处不断更新。
注意: 爬山虎中文开发文档相对比较完善,各位小伙伴直接点击下方链接阅读即可.
- 爬山虎中文官方网站:http://www.phpcreeper.com
- 中文开发文档主节点:http://www.phpcreeper.com/docs/
- 中文开发文档备节点:http://www.blogdaren.com/docs/
- 爬山虎是一个免费开源的佛系爬虫项目,欢迎小星星Star支持,让更多的人发现、使用并受益。
- 爬山虎源码根目录下有一个
Examples/start.php
样例脚本,开发之前建议先阅读它而后运行它。 - 爬山虎提供的例子如果未能按照预期工作,请检查修改爬取规则,因为源站DOM极可能更新了。
技术交流
- 下方绿色二维码为微信交流群:phpcreeper 【进群之前需先加此专属微信并备明来意或附上备注:爬山虎】
- 若是奔着虎哥的原创视频《深入PHP内核源码》而来,务必添加专属微信方可获得配套教程文档,弥足珍贵。
- 微信群主要围绕 爬山虎 和 workerman 和 深入PHP内核源码 开展技术交流,观看PHP内核视频请移步至B站。
截图
特性
- 几乎继承了workerman的所有功能
- 支持无头浏览器爬取动态页面
- 支持类似于Linux-Crontab的Crontab-Jobs
- 支持分布式和分离式部署
- 支持使用PHPCreeper-Application进行敏捷开发
- 可以自由定制各种回调和第三方中间件
- 使用PHPQuery作为优雅的内容提取器
- 具有高性能和强可扩展性
先决条件
- PHP_VERSION ≥ 7.0.0 (由于兼容性问题,最好选择PHP 7.4+)
- POSIX兼容操作系统(Linux、OSX、BSD)
- PHP的POSIX扩展(必选)
- PHP的PCNTL扩展(必选)
- PHP的REDIS扩展(可选,注意:从v1.4.2版本开始,默认使用predis客户端,因此不再强制依赖REDIS扩展)
- PHP的EVENT扩展(可选,强烈建议安装,这是提升性能的主要支持;另外请注意需要优化Linux内核)
- 简而言之:只要能运行workerman,就能运行PHPCreeper,所以安装要求和workerman完全一致。
- POSIX扩展和PCNTL扩展是必选项:PHP发行版通常都会默认安装这两个扩展,如果没有,请自行编译安装。
- EVENT扩展是可选项:建议最好安装,这是提升性能的主要支持;另外请注意需要优化Linux内核。
- REDIS扩展是可选项: 注意:从v1.4.2版本开始,引擎默认采用predis客户端,因此不再强制依赖REDIS扩展。
安装
安装PHPCreeper的推荐方法是使用Composer。
composer require blogdaren/phpcreeper
用法:不依赖于PHPCreeper应用框架
首先,还有一个名为 PHPCreeper-Application 的匹配应用程序框架,它同时发布以方便您的开发,尽管这个框架并非必需,但我们强烈建议您使用它,这将大大提高您的工作效率。此外,我们当然可以编写不依赖于框架的代码,而且也容易实现。
接下来,让我们通过一个例子来展示如何捕获按星级排名的 github top 10 repos
:
<?php require "./vendor/autoload.php"; use PHPCreeper\PHPCreeper; use PHPCreeper\Producer; use PHPCreeper\Downloader; use PHPCreeper\Parser; use PHPCreeper\Server; use PHPCreeper\Crontab; use PHPCreeper\Timer; //enable the single worker mode so that we can run without redis, however, you should note //it will be limited to run only all the downloader workers in this case【version >= 1.3.2】 //PHPCreeper::enableMultiWorkerMode(false); //switch runtime language between `zh` and `en`, default is `zh`【version >= 1.3.7】 PHPCreeper::setLang('en'); //set master pid file manually as needed【version >= 1.3.8】 //PHPCreeper::setMasterPidFile('/path/to/master.pid'); //set worker log file when start as daemon mode as needed【version >= 1.3.8】 //PHPCreeper::setLogFile('/path/to/phpcreeper.log'); //note that `predis` will be the default redis client since【version >= 1.4.2】 //but you could still switch it to be `redis` if you prefer to use ext-redis //PHPCreeper::setDefaultRedisClient('redis'); //set default timezone, default is `Asia/Shanghai`【version >= 1.5.4】 //PHPCreeper::setDefaultTimezone('Asia/Shanghai'); //redirect all stdandard out to file when run as daemonize【version >= 1.7.0】 //PHPCreeper::setStdoutFile("/path/to/stdout.log"); //set default headless browser, default is `chrome`【version >= 1.8.7】 //PHPCreeper::setDefaultHeadlessBrowser('chrome'); //Global-Redis-Config: support array value with One-Dimension or Two-Dimension, //NOTE: since v1.6.4, it's been upgraded to use a more secure and officially //recommended distributed red lock mechanism by default, but it will use the //old version of the lock mechanism degenerate only when all the redis instances //are explicitly configured with the option [use_red_lock === false] as below. //for details on how to configure the value, refer to the Follow-Up sections. $config['redis'] = [ [ 'host' => '127.0.0.1', 'port' => 6379, 'database' => '0', 'auth' => false, 'pass' => 'guest', 'prefix' => 'PHPCreeper', 'connection_timeout' => 5, 'read_write_timeout' => 0, 'use_red_lock' => true, //default to true since v1.6.4 ], ]; //Global-Task-Config: the context member configured here is a global context, //we can also set a private context for each task, finally the global context //and private task context will adopt the strategy of merging and covering. //you can free to customize various context settings, including user-defined, //for details on how to configure it, please refer to the Follow-Up sections. $config['task'] = array( //'crawl_interval' => 1, //'max_number' => 1000, //'max_connections' => 1, //'max_request' => 1000, 'context' => [ 'cache_enabled' => true, 'cache_directory' => sys_get_temp_dir() . '/DownloadCache4PHPCreeper/', 'allow_url_repeat' => true, 'headless_browser' => ['headless' => false, /*more browser options*/], //please refer to the Follow-Up sections to find more context options ], ); function startAppProducer() { global $config; $producer = new Producer($config); $producer->setName('AppProducer')->setCount(1); $producer->onProducerStart = function($producer){ //private task context which will be merged with global context $private_task_context = []; //【version < 1.6.0】: we mainly use an OOP style API to create task //$producer->newTaskMan()->setXXX()->setXXX()->createTask() //$producer->newTaskMan()->setXXX()->setXXX()->createTask($task) //$producer->newTaskMan()->setXXX()->setXXX()->createMultiTask() //$producer->newTaskMan()->setXXX()->setXXX()->createMultiTask($task) //【version >= 1.6.0】: we provide a shorter and easier API to create task //with more rich parameter types, and the old OOP style API can still be used, //and extension jobs are promoted just to maintain backward compatibility //1. Single-Task-API: $task parameter types supported: [string | 1D-array] //2. Single-Task-API:$producer->createTask($task); //3. Multi-Task-API: $task parameter types supported: [string | 1D-array | 2D-array] //4. Multi-Task-API: $producer->createMultiTask($task); //use string: not recommended to use because the configuration is limited. //so the question is that you need to process the fetching result by yourself //$task = "https://github.com/search?q=stars:%3E1&s=stars&type=Repositories"; //$producer->createTask($task); //$producer->createMultiTask($task); //use 1D-array:recommeded to use, rich configuration, engine helps to deal with all $task = array( 'url' => 'https://github.com/search?q=stars:%3E1&s=stars&type=Repositories', 'rule' => [ 'title' => ['div.Box-sc-g0xbh4-0.bDcVHV div.search-title a span', 'text'], 'stars' => ['div.Box-sc-g0xbh4-0.bDcVHV ul li a span.Text-sc-17v1xeu-0.gPDEWA', 'text'], ], 'rule_name' => '', //md5($task_id) will be the rule_name if leave empty 'refer' => '', 'type' => 'text', //it has lost the original concept setting, which can be set freely 'method' => 'get', "context" => $private_task_context, ); $producer->createTask($task); $producer->createMultiTask($task); //use 2D-array: recommed to use, rich configuration,engine helps to deal with all //since it is multitasking, only the createMultiTask() API can be called $task = array( array( 'url' => 'https://github.com/search?q=stars:%3E1&s=stars&type=Repositories', 'rule' => [ 'title' => ['div.Box-sc-g0xbh4-0.bDcVHV div.search-title a span', 'text'], 'stars' => ['div.Box-sc-g0xbh4-0.bDcVHV ul li a span.Text-sc-17v1xeu-0.gPDEWA', 'text'], ], 'rule_name' => 'r1', "context" => $private_task_context, ), array( 'url' => 'https://github.com/search?q=stars:%3E1&s=stars&type=Repositories', 'rule' => [ 'title' => ['div.Box-sc-g0xbh4-0.bDcVHV div.search-title a span', 'text'], 'stars' => ['div.Box-sc-g0xbh4-0.bDcVHV ul li a span.Text-sc-17v1xeu-0.gPDEWA', 'text'], ], 'rule_name' => 'r2', "context" => $private_task_context, ), ); $producer->createMultiTask($task); //use headless browser to crawl dynamic page rendered by javascript $private_task_context['headless_browser']['headless'] = true; $dynamic_task = array( 'url' => 'https://www.toutiao.com', 'rule' => array( 'title' => ['div.show-monitor ol li a', 'aria-label'], 'link' => ['div.show-monitor ol li a', 'href'], ), 'context' => $private_task_context, ); $producer->createTask($dynamic_task); }; } function startAppDownloader() { global $config; $downloader = new Downloader($config); //set the client socket address based on the listening parser server $downloader->setName('AppDownloader')->setCount(2)->setClientSocketAddress([ 'ws://127.0.0.1:8888', ]); $downloader->onDownloadBefore = function($downloader, $task){ //disable http ssl verify in any of the following two ways //$downloader->httpClient->disableSSL(); //$downloader->httpClient->setOptions(['verify' => false]); }; //use headless browser by user callback or API directly $downloader->onHeadlessBrowserOpenPage = function($downloader, $browser, $page, $url){ //Note: keeping flexible types of return values helps to deal with various complex app scenarios. //1. Returning false will trigger the interruption of subsequent business logic. //2. Returning string will trigger the interruption of subsequent business logic, // it is often used to return the HTML of the web page. //3. Returning array will continue to execute subsequent business logic, // it is often used to return headless browser options. //4. Returning others will continue to execute subsequent business logic, // which is equivalent to do nothing. //Note: Generally, there is no need to call the following lines of code, because //Note: PHPCreeper will automatically call the headless API by default to do the same work. //$page->navigate($url)->waitForNavigation('firstMeaningfulPaint'); //$html = $page->getHtml(); //return $html; }; //more downloader or download callbacks frequently used //$downloader->onDownloaderStart = function($downloader){}; //$downloader->onDownloaderStop = function($downloader){}; //$downloader->onDownloaderMessage = function($downloader, $parser_reply){}; //$downloader->onDownloaderConnectToParser = function($connection){}; //$downloader->onDownloadStart = function($downloader, $task){}; //$downloader->onDownloadAfter = function($downloader, $download_data, $task){}; //$downloader->onDownloadFail = function($downloader, $error, $task){}; //$downloader->onDownloadTaskEmpty = function($downloader){}; } function startAppParser() { $parser = new Parser(); $parser->setName('AppParser')->setCount(1)->setServerSocketAddress('websocket://0.0.0.0:8888'); $parser->onParserExtractField = function($parser, $download_data, $fields){ pprint($fields); }; //more parser callbacks frequently used //$parser->onParserStart = function($parser){}; //$parser->onParserStop = function($parser){}; //$parser->onParserMessage = function($parser, $connection, $download_data){}; //$parser->onParserFindUrl = function($parser, $sub_url){}; } function startAppServer() { $server = new Server(); $server->onServerStart = function(){ //execute the task every 1 second new Crontab('*/1 * * * * *', function(){ pprint("print the current time every 1 second: " . time()); }); //execute the task every 2 minutes new Crontab('*/2 * * * *', function(){ pprint("print the current time every 2 minutes: " . time()); }); }; } //start producer component startAppProducer(); //start downloader component startAppDownloader(); //start parser component startAppParser(); //start server component startAppServer(); //start phpcreeper engine PHPCreeper::start();
现在,将上面的示例代码保存到文件中,并将其命名为 github.php
作为启动脚本,然后按照以下方式运行:
/path/to/php github.php start
用法:依赖PHPCreeper应用程序框架
如果您想基于 PHPCreeper应用程序框架
开发应用程序或 查看更多配置
,请点击此处
如何设置提取规则
//NOTE: this is new usage for【version >= v1.6.0】, strongly recommended to use. $rule = array( 'field1' => ['selector', 'action', 'range', 'callback'], ....................................................., 'fieldN' => ['selector', 'action', 'range', 'callback'], ); //Single-Task $task = array( 'url' => 'http://www.weather.com.cn/weather/101010100.shtml', 'rule' => $rule, 'rule_name' => 'r1', ); //Multi-Task $task = array( array( 'url' => 'http://www.weather.com.cn/weather/101010100.shtml', 'rule' => $rule, 'rule_name' => 'r1', "context" => $context, ), array( 'url' => 'http://www.weather.com.cn/weather/201010100.shtml', 'rule' => $rule, 'rule_name' => 'r2', "context" => $context, ), );
- 每个URL配置项与一个唯一的规则配置项匹配,且 rule_name 必须是一对一对应
- 规则值的类型必须是 数组
- 对于单个任务,对应规则项的深度,即数组的深度,只能为2
- 对于多个任务,对应规则项的深度,即数组的深度,只能为3
//NOTE: this is outdated usage for【version < v1.6.0】, not recommended to use. <?php $urls = array( 'rule_name1' => 'http://www.blogdaren.com'; '..........' => '........................'; 'rule_nameN' => 'http://www.phpcreeper.com'; ); $rule = array( 'rule_name1' => array( 'field1' => ['selector', 'action', 'range', 'callback'], '......' => ['........', '....', '.....', '........']; 'fieldN' => ['selector', 'action', 'range', 'callback'], ); ........................................................., 'rule_nameN' => array( 'field1' => ['selector', 'action', 'range', 'callback'], '......' => ['........', '....', '.....', '........']; 'fieldN' => ['selector', 'action', 'range', 'callback'], ); );
-
rule_name
您应该为每个任务提供一个唯一的规则名称,这样我们就可以轻松地索引我们想要的数据,如果您留空,它将使用md5($task_id)
而不是md5($task_url)
作为唯一的规则名称,从v1.6.0版本开始,这存在潜在的陷阱。 -
selector
必须提供选择器,否则它将被忽略,就像jQuery选择器一样,其值可以是#idName
或.className
或Html Element
等等。 -
action
默认值为text
,表示我们应该采取什么操作,值可以是以下之一
text
:用于获取HTML元素的内部文本
html
:用于获取带有HTML元素标签的内部文本
attr
:用于获取HTML元素的属性值
【注意:实际值应该是像src
、href
等属性,而不是attr
本身】
css
: 特别用于获取HTML元素的样式属性,并以数组形式返回
【注意:还支持更变的形式,如css:*
、css:prop1,prop2,...propN
】 -
range
用于缩小条目,使其仅匹配,就像jQuery选择器一样,其值可以是#idName
或.className
或Html Element
等等。 -
callback
您可以在这里触发一个callback字符串
或callback函数
,但请记住返回预期的数据。callback字符串:
推荐使用,与PHP原生回调函数语义等效。
callback函数:
请注意,您应该使用callback字符串
而不是,因为PHP原生回调函数可能在跨多进程环境通信时工作不正常。callback函数
<?php //extractor rule code example $html = "<div><a href='http://www.phpcreeper.com' id='site' style='color:red;font-size:100px;'>PHPCreeper</a></div>"; $rule = array( 'link_element' => ['div', 'html'], 'link_text ' => ['#site', 'text'], 'link_address' => ['a', 'href'], 'link_css1' => ['a', 'css'], 'link_css2' => ['div>a', 'css:font-size'], 'callback_data' => ['#site', 'text', [], 'function($field_name, $data){ return "Hello " . $data; }'], ); $data = $parser->extractField($html, $rule, 'rule1'); pprint($data['rule1']); //output Array ( [0] => Array ( [link_element] => <a href="http://www.phpcreeper.com" id="site" style="color:red;font-size:100px;">PHPCreeper</a> [link_text ] => PHPCreeper [link_address] => http://www.phpcreeper.com [link_css1] => Array ( [color] => red [font-size] => 100px ) [link_css2] => Array ( [font-size] => 100px ) [callback_data] => Hello PHPCreeper ) )
使用数据库
PHPCreeper用类似于Medoo风格的轻量级数据库进行包装,如果您想了解更多关于它的用法,请访问Medoo官方网站。现在我们只需要找出如何获取DBO,实际上,这是非常简单的。
首先配置 database.php
,然后添加以下代码
<?php return array( 'dbo' => array( 'test' => array( 'database_type' => 'mysql', 'database_name' => 'test', 'server' => '127.0.0.1', 'username' => 'root', 'password' => 'root', 'charset' => 'utf8' ), ), );
现在我们可以获取DBO并开始查询或其他操作,就像您喜欢的那样
<?php $downloader->onAfterDownloader = function($downloader){ //dbo single instance and we can pass the DSN string `test` $downloader->getDbo('test')->select('title', '*'); //dbo single instance and we can pass the configuration array $config = Configurator::get('globalConfig/database/dbo/test') $downloader->getDbo($config)->select('title', '*'); //dbo new instance and we can pass the DSN string `test` $downloader->newDbo('test')->select('title', '*'); //dbo new instance and we can pass the configuration array $config = Configurator::get('globalConfig/database/dbo/test') $downloader->newDbo($config)->select('title', '*'); };
可用命令
请注意,PHPCreeper中的所有命令只能在命令行上运行,您必须在开始任何爬取任务之前编写一个全局入口启动脚本,其名称假定是start.php,但如果您使用PHPCreeper-Application框架进行开发,它将自动帮助您生成包括全局所需的全部启动脚本。
php start.php start
php start.php start -d
php start.php stop
php start.php restart
php start.php reload
php start.php reload -g
php start.php status
php start.php connections
相关链接和感谢
许可证
PHPCreeper采用MIT许可证发布。
免责声明
请不要将PHPCreeper用于您所在国家法律不允许的任何业务。我对该代码不承担任何保证或责任。使用风险自负。