sleimanx2 / grawler
带有媒体元提取的引导式HTML爬虫
0.2.4
2019-09-29 11:29 UTC
Requires
- php: >=5.5
- fabpot/goutte: 3.1.*
- google/apiclient: 1.*
- hassankhan/config: 0.8.*
- njasm/soundcloud: 2.2.*
- vimeo/vimeo-api: 1.2.*
- vlucas/phpdotenv: 2.2.*
Requires (Dev)
- mockery/mockery: ^0.9.4
- phpunit/php-code-coverage: ^2.1
- phpunit/phpunit: ~4.0
README
安装
通过Composer
$ composer require sleimanx2/grawler
基本用法
获取页面DOM
require_once('vendor/autoload.php'); $client = new Bowtie\Grawler\Client(); $grawler = $client->download('http://example.com');
查找基本属性
$grawler->title();
// provide a css path to find the attribute $grawler->body($path = '.main-content');
// extracts meta keywords (array) $grawler->keywords();
// extracts meta description $grawler->description();
查找媒体
$grawler->images('.content img');
$grawler->videos('iframe');
$grawler->audio('.audio iframe');
解析媒体属性
为了解析媒体属性,您需要 加载提供者的配置
视频
当前视频解析器(youtube , vimeo)
// resolve all videos at once $videos = $grawler->videos('iframe')->resolve();
然后您可以按如下方式访问视频属性
foreach($videos as $video) { $video->id; // the video provider id $video->title; $video->description; $video->url; $video->embedUrl; $video->images; // Collection of Image instances $video->author; $video->authorId; $video->duration; $video->provider; //video source }
您也可以按如下方式单独解析视频
$videos = $grawler->videos('iframe') foreach($videos as $video) { $video->resolve(); $video->title; //... }
音频
当前音频解析器(soundcloud)
// resolve all audio at once $audio = $grawler->audio('.audio iframe')->resolve();
然后您可以按如下方式访问视频属性
foreach($audio as $track) { $track->id; // the video provider id $track->title; $track->description; $track->url; $track->embedUrl; $track->images; // Collection of cover photo instances $track->author; $track->authorId; $track->duration; $track->provider; //video source }
您也可以按如下方式单独解析音频
$track = $grawler->track('.audio iframe') foreach($audio as $track) { $track->resolve(); $track->title; //... }
解析页面URL
$links = $grawler->links('.main thumb a') foreach($links as $link) { print $link //or print $link->uri //or print $link->getUri() }
配置
客户端配置
设置用户代理
$client->agent('Googlebot/2.1')->download('http://example.com');
推荐: http://webmasters.stackexchange.com/questions/6205/what-user-agent-should-i-set
设置请求认证
$client->auth('me', '**')
您可以根据以下方式更改认证类型
$client->auth('me', '**', $type = 'basic');
设置请求方法
$client->method('post');
Grawler配置
默认情况下,Grawler会尝试访问以下环境变量
GRAWLER_YOUTUBE_KEY
GRAWLER_VIMEO_KEY
GRAWLER_VIMEO_SECRET
GRAWLER_SOUNDCLOUD_KEY
GRAWLER_SOUNDCLOUD_SECRET
如果您不使用环境变量,您可以按以下方式加载配置。
$config = [ 'youtubeKey' =>'', 'soundcloudKey'=>'' 'vimeoKey' => '', 'vimeoSecret' => '', 'soundcloudKey' => '', 'soundcloudSecret' => '', ]; $grawler->loadConfig($config);
测试
$ phpunit --testsuite unit
$ phpunit --testsuite integration
注意:您应该设置您的ptoviders密钥(youtube、vimeo、soundcloud...)以运行集成测试
贡献
请参阅 CONTRIBUTING
安全性
如果您发现任何与安全性相关的问题,请通过电子邮件 sleiman@bowtie.land 联系我们,而不是使用问题跟踪器。
许可
MIT许可(MIT)。有关更多信息,请参阅 许可文件。