sleimanx2/grawler

带有媒体元提取的引导式HTML爬虫

0.2.4 2019-09-29 11:29 UTC

This package is auto-updated.

Last update: 2024-09-14 07:32:39 UTC


README

Software License Build Status

安装

通过Composer

$ composer require sleimanx2/grawler

基本用法

获取页面DOM
require_once('vendor/autoload.php');

$client = new Bowtie\Grawler\Client();

$grawler = $client->download('http://example.com');
查找基本属性
$grawler->title();
// provide a css path to find the attribute
$grawler->body($path = '.main-content');
// extracts meta keywords (array)
$grawler->keywords();
// extracts meta description 
$grawler->description();
查找媒体
$grawler->images('.content img');
$grawler->videos('iframe');
$grawler->audio('.audio iframe');

解析媒体属性

为了解析媒体属性,您需要 加载提供者的配置

视频

当前视频解析器(youtube , vimeo)

// resolve all videos at once 
$videos = $grawler->videos('iframe')->resolve();

然后您可以按如下方式访问视频属性

foreach($videos as $video)
{
  $video->id; // the video provider id
  $video->title;
  $video->description;
  $video->url;
  $video->embedUrl;
  $video->images; // Collection of Image instances
  $video->author;
  $video->authorId;
  $video->duration;
  $video->provider; //video source
}

您也可以按如下方式单独解析视频

$videos = $grawler->videos('iframe')

foreach($videos as $video)
{
  $video->resolve();
  $video->title;
  //...
}

音频

当前音频解析器(soundcloud)

// resolve all audio at once 
$audio = $grawler->audio('.audio iframe')->resolve();

然后您可以按如下方式访问视频属性

foreach($audio as $track)
{
  $track->id; // the video provider id
  $track->title;
  $track->description;
  $track->url;
  $track->embedUrl;
  $track->images; // Collection of cover photo instances
  $track->author;
  $track->authorId;
  $track->duration;
  $track->provider; //video source
}

您也可以按如下方式单独解析音频

$track = $grawler->track('.audio iframe')

foreach($audio as $track)
{
  $track->resolve();
  $track->title;
  //...
}

解析页面URL

$links = $grawler->links('.main thumb a')

foreach($links as $link)
{
  print $link
  //or
  print $link->uri
  //or
  print $link->getUri()
}

配置

客户端配置

设置用户代理
$client->agent('Googlebot/2.1')->download('http://example.com');

推荐: http://webmasters.stackexchange.com/questions/6205/what-user-agent-should-i-set

设置请求认证
$client->auth('me', '**')

您可以根据以下方式更改认证类型

$client->auth('me', '**', $type = 'basic');
设置请求方法
$client->method('post');

Grawler配置

默认情况下,Grawler会尝试访问以下环境变量

GRAWLER_YOUTUBE_KEY

GRAWLER_VIMEO_KEY
GRAWLER_VIMEO_SECRET

GRAWLER_SOUNDCLOUD_KEY
GRAWLER_SOUNDCLOUD_SECRET

如果您不使用环境变量,您可以按以下方式加载配置。

$config = [
  'youtubeKey'   =>'',
  'soundcloudKey'=>''

  'vimeoKey'    => '',
  'vimeoSecret' => '',

  'soundcloudKey'    => '',
  'soundcloudSecret' => '',
];

$grawler->loadConfig($config);

测试

$ phpunit --testsuite unit
$ phpunit --testsuite integration

注意:您应该设置您的ptoviders密钥(youtube、vimeo、soundcloud...)以运行集成测试

贡献

请参阅 CONTRIBUTING

安全性

如果您发现任何与安全性相关的问题,请通过电子邮件 sleiman@bowtie.land 联系我们,而不是使用问题跟踪器。

许可

MIT许可(MIT)。有关更多信息,请参阅 许可文件