malahierba-lab/web-harvester

Laravel HTTP 客户端带有 JavaScript 功能

1.2.2 2016-08-27 16:56 UTC

README

一个从外部网站获取信息的工具。由 PhantomJS 和 malahierba.cl 开发团队提供支持

安装

在您的 composer.json 中添加

{
    "require": {
        "malahierba-lab/web-harvester": "1.*"
    }
}

然后您需要运行 composer update 命令。

安装后,您必须配置服务提供者。只需将服务提供者添加到 config/app.php 文件的 providers 部分即可

Malahierba\WebHarvester\WebHarvesterServiceProvider::class

现在您需要发布配置文件。只需执行 php artisan vendor:publish

配置

Laravel Web Harvester 使用 PhantomJS 无头 Webkit 浏览器运行。此工具作为二进制文件包含,因此在使用此包之前,您需要指定您的操作系统。这可以在配置文件 config\webharvester.php 中完成。

您需要设置选项 environment 为支持的选项之一

  • linux-i686-32
  • linux-i686-64
  • macosx
  • windows

示例: 'environment' => 'macosx'

使用

重要:为了说明目的,在下面的示例中,我们始终假设您使用 use Malahierba\WebHarvester; 将库导入到您的命名空间中

获取网页组件

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->load($url)) {

    //Page Title
    $title                   = $webharvester->getTitle();

    //Page Description
    $description             = $webharvester->getDescription();

    //Get Status Code (If the url redirect to another webpage, then return the status code for the final webpage)
    $status_code             = $webharvester->getStatusCode();

    //Page Featured Image as URL
    $featured_image_url      = $webharvester->getFeaturedImage();

    //Page Featured Image as Base64
    $featured_image_base_64  = $webharvester->getFeaturedImage('base64');

    //Page real URL (if the $url redirect to another, return the final)
    $real_url                = $webharvester->getRealURL();

    //Site Name
    $sitename                = $webharvester->getSiteName();
}

获取机器人的预期行为(基于 meta name="robots")

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->load($url)) {

    //check for index
    if ($webharvester->isIndexable()) {

        //...some code

    }

    //check for follow
    if ($webharvester->isFollowable()) {

        //...some code
        
    }

}

获取网页中找到的链接(对网络爬虫、网络蜘蛛等很有用)

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->load($url)) {

    //all full links as array

    $links = $webharvester->getLinks();  //retrieve an array with found links

    //all links as array, but query component removed (from the character "?" onwards)

    $links = $webharvester->getLinks([
        'remove' => ['query']
    ]);

    //retrieve links as array of objects (properties: url, follow)
    //if follow is false indicate than that links is marked to no follow (rel='nofollow') by the source website

    $links = $webharvester->getLinks(['only_urls' => false]); //default true

}

重要:出于安全原因,所有嵌入 JavaScript 的链接都不包含在输出数组中

获取网页原始内容

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
    $raw = $webharvester->content();
}

为网页截图

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->takeScreenshot($url)) {
    $image_base_64 = $webharvester->content();  //return a base64 string
}

设置选项

您可以使用一些功能自定义 webharvester

$webharvester = new WebHarvester;

//Custom User Agent
$webharvester->setUserAgent('your user agent');

//Ignore SSL Errors
$webharvester->setIgnoreSSLErrors(true);

//Resource Timeout (in milliseconds)
$webharvester->setResourceTimeout(3000);

//Wait after load (in milliseconds)
$webharvester->setWaitAfterLoad(3000);  // <- useful for get async content

许可协议

此项目受 MIT 许可证保护。有关更多信息,请阅读 LICENCE 文件。