malahierba-lab / web-harvester
Laravel HTTP 客户端带有 JavaScript 功能
1.2.2
2016-08-27 16:56 UTC
Requires
- php: >=5.5.18
- laravel/framework: 5.*
README
一个从外部网站获取信息的工具。由 PhantomJS 和 malahierba.cl 开发团队提供支持
安装
在您的 composer.json 中添加
{
"require": {
"malahierba-lab/web-harvester": "1.*"
}
}
然后您需要运行 composer update
命令。
安装后,您必须配置服务提供者。只需将服务提供者添加到 config/app.php
文件的 providers 部分即可
Malahierba\WebHarvester\WebHarvesterServiceProvider::class
现在您需要发布配置文件。只需执行 php artisan vendor:publish
配置
Laravel Web Harvester 使用 PhantomJS 无头 Webkit 浏览器运行。此工具作为二进制文件包含,因此在使用此包之前,您需要指定您的操作系统。这可以在配置文件 config\webharvester.php
中完成。
您需要设置选项 environment
为支持的选项之一
- linux-i686-32
- linux-i686-64
- macosx
- windows
示例: 'environment' => 'macosx'
使用
重要:为了说明目的,在下面的示例中,我们始终假设您使用 use Malahierba\WebHarvester;
将库导入到您的命名空间中
获取网页组件
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
//Page Title
$title = $webharvester->getTitle();
//Page Description
$description = $webharvester->getDescription();
//Get Status Code (If the url redirect to another webpage, then return the status code for the final webpage)
$status_code = $webharvester->getStatusCode();
//Page Featured Image as URL
$featured_image_url = $webharvester->getFeaturedImage();
//Page Featured Image as Base64
$featured_image_base_64 = $webharvester->getFeaturedImage('base64');
//Page real URL (if the $url redirect to another, return the final)
$real_url = $webharvester->getRealURL();
//Site Name
$sitename = $webharvester->getSiteName();
}
获取机器人的预期行为(基于 meta name="robots")
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
//check for index
if ($webharvester->isIndexable()) {
//...some code
}
//check for follow
if ($webharvester->isFollowable()) {
//...some code
}
}
获取网页中找到的链接(对网络爬虫、网络蜘蛛等很有用)
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
//all full links as array
$links = $webharvester->getLinks(); //retrieve an array with found links
//all links as array, but query component removed (from the character "?" onwards)
$links = $webharvester->getLinks([
'remove' => ['query']
]);
//retrieve links as array of objects (properties: url, follow)
//if follow is false indicate than that links is marked to no follow (rel='nofollow') by the source website
$links = $webharvester->getLinks(['only_urls' => false]); //default true
}
重要:出于安全原因,所有嵌入 JavaScript 的链接都不包含在输出数组中
获取网页原始内容
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
$raw = $webharvester->content();
}
为网页截图
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->takeScreenshot($url)) {
$image_base_64 = $webharvester->content(); //return a base64 string
}
设置选项
您可以使用一些功能自定义 webharvester
$webharvester = new WebHarvester;
//Custom User Agent
$webharvester->setUserAgent('your user agent');
//Ignore SSL Errors
$webharvester->setIgnoreSSLErrors(true);
//Resource Timeout (in milliseconds)
$webharvester->setResourceTimeout(3000);
//Wait after load (in milliseconds)
$webharvester->setWaitAfterLoad(3000); // <- useful for get async content
许可协议
此项目受 MIT 许可证保护。有关更多信息,请阅读 LICENCE 文件。