README

Spider

Spider 是一个简单、优雅、可扩展的基于 phpQuery 的 PHP 网络爬虫。

功能

拥有与 jQuery 相同的 CSS3 DOM 选择器
拥有与 jQuery 相同的 DOM 操作 API
拥有通用的列表爬虫程序
拥有强大的 HTTP 请求套件，易于实现诸如模拟登录、伪造浏览器、HTTP 代理等复杂网络请求
拥有混乱的代码解决方案
拥有强大的内容过滤功能，你可以使用 jQuery 选择器过滤内容
具有高度模块化设计，可扩展性和强大
拥有表达式丰富的 API
拥有丰富的插件

通过插件，你可以轻松实现以下功能

多线程爬取
爬取 JavaScript 动态渲染页面（PhantomJS/无头 WebKit）
图片下载到本地
模拟浏览器行为，如提交表单
网络爬虫
.....

要求

PHP >= 7.1

安装

通过 Composer 安装

composer require laoqianjunzi/spider

使用

DOM 遍历和操作

爬取「GitHub」所有图片链接

Spider::get('https://github.com')->find('img')->attrs('src');

爬取 Google 搜索结果

$ql = Spider::get('https://www.google.co.jp/search?q=Spider');

$ql->find('title')->text(); //The page title
$ql->find('meta[name=keywords]')->content; //The page keywords

$ql->find('h3>a')->texts(); //Get a list of search results titles
$ql->find('h3>a')->attrs('href'); //Get a list of search results links

$ql->find('img')->src; //Gets the link address of the first image
$ql->find('img:eq(1)')->src; //Gets the link address of the second image
$ql->find('img')->eq(2)->src; //Gets the link address of the third image
// Loop all the images
$ql->find('img')->map(function($img){
	echo $img->alt;  //Print the alt attribute of the image
});

更多使用方法

$ql->find('#head')->append('<div>Append content</div>')->find('div')->htmls();
$ql->find('.two')->children('img')->attrs('alt'); // Get the class is the "two" element under all img child nodes
// Loop class is the "two" element under all child nodes
$data = $ql->find('.two')->children()->map(function ($item){
    // Use "is" to determine the node type
    if($item->is('a')){
        return $item->text();
    }elseif($item->is('img'))
    {
        return $item->alt;
    }
});

$ql->find('a')->attr('href', 'newVal')->removeClass('className')->html('newHtml')->...
$ql->find('div > p')->add('div > ul')->filter(':has(a)')->find('p:first')->nextAll()->andSelf()->...
$ql->find('div.old')->replaceWith( $ql->find('div.new')->clone())->appendTo('.trash')->prepend('Deleted')->...

列表爬取

爬取 Google 搜索结果列表的标题和链接

$data = Spider::get('https://www.google.co.jp/search?q=Spider')
	// Set the crawl rules
    ->rules([ 
	    'title'=>array('h3','text'),
	    'link'=>array('h3>a','href')
	])
	->query()->getData();

print_r($data->all());

结果

Array
(
    [0] => Array
        (
            [title] => Angular - Spider
            [link] => https://angular.io/api/core/Spider
        )
    [1] => Array
        (
            [title] => Spider | @angular/core - Angularリファレンス - Web Creative Park
            [link] => http://www.webcreativepark.net/angular/Spider/
        )
    [2] => Array
        (
            [title] => SpiderにQueryを追加したり、追加されたことを感知する | TIPS ...
            [link] => http://www.webcreativepark.net/angular/Spider_query_add_subscribe/
        )
        //...
)

编码转换

// Out charset :UTF-8
// In charset :GB2312
Spider::get('https://top.etao.com')->encoding('UTF-8','GB2312')->find('a')->texts();

// Out charset:UTF-8
// In charset:Automatic Identification
Spider::get('https://top.etao.com')->encoding('UTF-8')->find('a')->texts();

HTTP 客户端（GuzzleHttp）

携带 Cookie 登录 GitHub

//Crawl GitHub content
$ql = Spider::get('https://github.com','param1=testvalue & params2=somevalue',[
'headers' => [
    // Fill in the cookie from the browser
    'Cookie' => 'SINAGLOBAL=546064; wb_cmtLike_2112031=1; wvr=6;....'
]
]);
//echo $ql->getHtml();
$userName = $ql->find('.header-nav-current-user>.css-truncate-target')->text();
echo $userName;

模拟登录

// Post login
$ql = Spider::post('http://xxxx.com/login',[
  'username' => 'admin',
  'password' => '123456'
])->get('http://xxx.com/admin');
// Crawl pages that need to be logged in to access
$ql->get('http://xxx.com/admin/page');
//echo $ql->getHtml();

提交表单

// Get the Spider instance
$ql = Spider::getInstance();
// Get the login form
$form = $ql->get('https://github.com/login')->find('form');

// Fill in the GitHub username and password
$form->find('input[name=login]')->val('your github username or email');
$form->find('input[name=password]')->val('your github password');

// Serialize the form data
$fromData = $form->serializeArray();
$postData = [];
foreach ($fromData as $item) {
    $postData[$item['name']] = $item['value'];
}

// Submit the login form
$actionUrl = 'https://github.com'.$form->attr('action');
$ql->post($actionUrl,$postData);
// To determine whether the login is successful
// echo $ql->getHtml();
$userName = $ql->find('.header-nav-current-user>.css-truncate-target')->text();
if($userName)
{
    echo 'Login successful ! Welcome:'.$userName;
}else{
    echo 'Login failed !';
}

绑定函数扩展

自定义 myHttp 方法的扩展

$ql = Spider::getInstance();

//Bind a `myHttp` method to the Spider object
$ql->bind('myHttp',function ($url){
	// $this is the current Spider object
    $html = file_get_contents($url);
    $this->setHtml($html);
    return $this;
});

// And then you can call by the name of the binding
$data = $ql->myHttp('https://toutiao.io')->find('h3 a')->texts();
print_r($data->all());

或打包成类，然后绑定

$ql->bind('myHttp',function ($url){
    return new MyHttp($this,$url);
});

使用 CURL 多线程插件，多线程爬取 GitHub 趋势

$ql = Spider::use(CurlMulti::class);
$ql->curlMulti([
    'https://github.com/trending/php',
    'https://github.com/trending/go',
    //.....more urls
])
 // Called if task is success
 ->success(function (Spider $ql,CurlMulti $curl,$r){
    echo "Current url:{$r['info']['url']} \r\n";
    $data = $ql->find('h3 a')->texts();
    print_r($data->all());
})
 // Task fail callback
->error(function ($errorInfo,CurlMulti $curl){
    echo "Current url:{$errorInfo['info']['url']} \r\n";
    print_r($errorInfo['error']);
})
->start([
	// Maximum number of threads
    'maxThread' => 10,
    // Number of error retries
    'maxTry' => 3,
]);

laoqianjunzi / spider

维护者

详细信息

README

Spider

功能

要求

安装

使用

DOM 遍历和操作

列表爬取

编码转换

HTTP 客户端（GuzzleHttp）

提交表单

绑定函数扩展