j0k3r / graby
Graby 帮助您从网页中提取文章内容
Requires
- php: >=7.1.3
- ext-curl: *
- ext-tidy: *
- fossar/htmlawed: ^1.2.7
- guzzlehttp/psr7: ^1.5.0|^2.0
- http-interop/http-factory-guzzle: ^1.1
- j0k3r/graby-site-config: ^1.0.181
- j0k3r/httplug-ssrf-plugin: ^2.0
- j0k3r/php-readability: ^1.2.10
- monolog/monolog: ^1.18.0|^2.0
- php-http/client-common: ^2.7
- php-http/discovery: ^1.19
- php-http/httplug: ^2.4
- php-http/message: ^1.14
- simplepie/simplepie: ^1.7
- smalot/pdfparser: ^1.1
- symfony/options-resolver: ^3.4|^4.4|^5.3|^6.0|^7.0
- true/punycode: ^2.1
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.0
- guzzlehttp/guzzle: ^6.3.0
- php-http/guzzle6-adapter: ^2.0
- php-http/mock-client: ^1.4
- phpstan/extension-installer: ^1.0
- phpstan/phpstan: ^0.12
- phpstan/phpstan-deprecation-rules: ^0.12
- phpstan/phpstan-phpunit: ^0.12
- symfony/phpunit-bridge: ^6.4.1
- dev-master
- 2.x-dev
- 2.4.5
- 2.4.4
- 2.4.3
- 2.4.2
- 2.4.1
- 2.4.0
- 2.3.5
- 2.3.4
- 2.3.3
- 2.3.2
- 2.3.1
- 2.3.0
- 2.2.7
- 2.2.6
- 2.2.5
- 2.2.4
- v2.2.3
- v2.2.2
- 2.2.1
- 2.2.0
- 2.1.1
- 2.1.0
- 2.0.2
- 2.0.1
- 2.0.0
- 2.0.0-alpha.0
- 1.x-dev
- 1.20.1
- 1.20.0
- 1.19.1
- 1.19.0
- 1.18.1
- 1.18.0
- 1.17.0
- 1.16.0
- 1.15.5
- 1.15.4
- 1.15.3
- 1.15.2
- 1.15.1
- 1.15.0
- 1.14.0
- 1.13.6
- 1.13.5
- 1.13.4
- 1.13.3
- 1.13.2
- 1.13.1
- 1.13.0
- 1.12.1
- 1.12.0
- 1.11.1
- 1.11.0
- 1.10.1
- 1.10.0
- 1.9.3
- 1.9.2
- 1.9.1
- 1.9.0
- 1.8.2
- 1.8.1
- 1.8.0
- 1.7.1
- 1.7.0
- 1.6.2
- 1.6.1
- 1.6.0
- 1.5.4
- 1.5.3
- 1.5.2
- 1.5.1
- 1.5.0
- 1.4.5
- 1.4.4
- 1.4.3
- 1.4.2
- 1.4.1
- 1.4.0
- 1.3.0
- 1.2.0
- 1.1.0
- 1.0.8
- 1.0.7
- 1.0.6
- 1.0.5
- 1.0.4
- 1.0.3
- 1.0.2
- 1.0.1
- 1.0.0
- 1.0.0-alpha.2
- 1.0.0-alpha.1
- 1.0.0-alpha.0
- dev-fix/cookie-multiple-pages
This package is auto-updated.
Last update: 2024-09-16 00:03:02 UTC
README

Graby 帮助您从网页中提取文章内容
- 它基于 php-readability
- 它使用 site_config 从网站提取内容
- 它是来自 @fivefilters 的 Full-Text RSS v3.3 的分支
为什么这个分支?
Full-Text RSS 作为独立应用程序运行得很好。但是,当您需要将其封装在自己的库中时,它会变得一团糟。您需要这种丑陋的东西
$article = 'http://www.bbc.com/news/entertainment-arts-32547474'; $request = 'http://example.org/full-text-rss/makefulltextfeed.php?format=json&url='.urlencode($article); $result = @file_get_contents($request);
此外,如果您想了解内部如何工作,它很难阅读和理解。最后,完全没有 测试。
这就是我创建这个分支的原因
- 最简单的方法是将其集成(使用 composer)
- 全面测试
- (希望)更容易理解
- 稍微解耦一些
如何使用它
注意 这些说明适用于 Graby 的开发版本,该版本与稳定版本的 API 不兼容。请查看
2.x
分支中的 README 以获取稳定版本的用法说明。
要求
- PHP >= 7.4
- Tidy & cURL 扩展已启用
安装
使用 Composer 添加库
composer require 'j0k3r/graby dev-master' php-http/guzzle7-adapter
为什么是 php-http/guzzle7-adapter
?因为 Graby 通过 HTTPlug 从任何 HTTP 客户端实现中解耦,因此 Graby 测试并通过与以下列表兼容,从而工作得很好。
Graby 已测试且应与以下内容一起工作得很好
- Guzzle 7(使用
php-http/guzzle7-adapter
) - Guzzle 5(使用
php-http/guzzle5-adapter
) - cURL(使用
php-http/curl-client
)
注意:如果您想使用 Guzzle 6,请使用 Graby 2(由于依赖项冲突,在 v3 中已停止支持,这与 Guzzle 5 不同 🤷)
从 URL 获取内容
使用类来获取内容
use Graby\Graby; $article = 'http://www.bbc.com/news/entertainment-arts-32547474'; $graby = new Graby(); $result = $graby->fetchContent($article); var_dump($result->getResponse()->getStatus()); // 200 var_dump($result->getHtml()); // "[Fetched and readable content…]" var_dump($result->getTitle()); // "Ben E King: R&B legend dies at 76" var_dump($result->getLanguage()); // "en-GB" var_dump($result->getDate()); // "2015-05-01T16:24:37+01:00" var_dump($result->getAuthors()); // ["BBC News"] var_dump((string) $result->getResponse()->getEffectiveUri()); // "http://www.bbc.com/news/entertainment-arts-32547474" var_dump($result->getImage()); // "https://ichef-1.bbci.co.uk/news/720/media/images/82709000/jpg/_82709878_146366806.jpg" var_dump($result->getSummary()); // "Ben E King received an award from the Songwriters Hall of Fame in …" var_dump($result->getIsNativeAd()); // false var_dump($result->getResponse()->getHeaders()); /* [ 'server' => ['Apache'], 'content-type' => ['text/html; charset=utf-8'], 'x-news-data-centre' => ['cwwtf'], 'content-language' => ['en'], 'x-pal-host' => ['pal074.back.live.cwwtf.local:80'], 'x-news-cache-id' => ['13648'], 'content-length' => ['157341'], 'date' => ['Sat, 29 Apr 2017 07:35:39 GMT'], 'connection' => ['keep-alive'], 'cache-control' => ['private, max-age=60, stale-while-revalidate'], 'x-cache-action' => ['MISS'], 'x-cache-age' => ['0'], 'x-lb-nocache' => ['true'], 'vary' => ['X-CDN,X-BBC-Edge-Cache,Accept-Encoding'], ] */
在获取 URL 时出现错误的情况下,graby 不会抛出异常,但会返回有关错误的信息(至少是状态码)
var_dump($result->getResponse()->getStatus()); // 200 var_dump($result->getHtml()); // "[unable to retrieve full-text content]" var_dump($result->getTitle()); // "BBC - 404: Not Found" var_dump($result->getLanguage()); // "en-GB" var_dump($result->getDate()); // null var_dump($result->getAuthors()); // [] var_dump((string) $result->getResponse()->getEffectiveUri()); // "http://www.bbc.co.uk/404" var_dump($result->getImage()); // null var_dump($result->getSummary()); // "[unable to retrieve full-text content]" var_dump($result->getIsNativeAd()); // false var_dump($result->getResponse()->getHeaders()); // […]
date
结果与内容中显示的相同。如果结果中的 date
不是 null
,我们建议您使用 date_parse
来解析它(这是我们用来验证日期是否正确的方法)。
从预先获取的页面获取内容
如果您想从 Graby 外部获取的页面中提取内容,您可以在调用 fetchContent()
之前调用 setContentAsPrefetched()
,例如。
use Graby\Graby; $article = 'http://www.bbc.com/news/entertainment-arts-32547474'; $input = '<html>[...]</html>'; $graby = new Graby(); $graby->setContentAsPrefetched($input); $result = $graby->fetchContent($article);
清理内容
从 1.9.0 版本开始,您还可以将 html 内容发送到与 graby 从 URL 获取的内容相同的方式进行清理。URL 仍然是必需的,以便将链接转换为绝对路径等。
use Graby\Graby; $article = 'http://www.bbc.com/news/entertainment-arts-32547474'; // use your own way to retrieve html or to provide html $html = ... $graby = new Graby(); $result = $graby->cleanupHtml($html, $article);
使用自定义处理程序和格式化程序来查看输出日志
您可以使用它们来向最终用户显示 graby 输出日志。它的目的是用于在 Symfony 项目中使用 Monolog。
定义 graby 处理程序服务(在 service.yml
中的某个位置)
services: # ... graby.log_handler: class: Graby\Monolog\Handler\GrabyHandler
然后在您的 app/config/config.yml
中定义 Monolog 处理器
monolog: handlers: graby: type: service id: graby.log_handler # use "debug" to got a lot of data (like HTML at each step) otherwise "info" is fine level: debug channels: ['graby']
您可以在控制器中使用以下方式从 graby 获取日志
$logs = $this->get('monolog.handler.graby')->getRecords();
超时配置
如果您需要定义超时,必须手动创建 Http\Client\HttpClient
,配置它并将其注入到 Graby\Graby
中。
-
对于 Guzzle 5
use Graby\Graby; use GuzzleHttp\Client as GuzzleClient; use Http\Adapter\Guzzle5\Client as GuzzleAdapter; $guzzle = new GuzzleClient([ 'defaults' => [ 'timeout' => 2, ] ]); $graby = new Graby([], new GuzzleAdapter($guzzle));
-
对于 Guzzle 7
use Graby\Graby; use GuzzleHttp\Client as GuzzleClient; use Http\Adapter\Guzzle7\Client as GuzzleAdapter; $guzzle = new GuzzleClient([ 'timeout' => 2, ]); $graby = new Graby([], new GuzzleAdapter($guzzle));
完整配置
这是完整的文档配置,也是默认配置。
$graby = new Graby([ // Enable or disable debugging. // This will only generate log information in a file (log/graby.log) 'debug' => false, // use 'debug' value if you want more data (HTML at each step for example) to be dumped in a different file (log/html.log) 'log_level' => 'info', // If enabled relative URLs found in the extracted content are automatically rewritten as absolute URLs. 'rewrite_relative_urls' => true, // If enabled, we will try to follow single page links (e.g. print view) on multi-page articles. // Currently this only happens for sites where single_page_link has been defined // in a site config file. 'singlepage' => true, // If enabled, we will try to follow next page links on multi-page articles. // Currently this only happens for sites where next_page_link has been defined // in a site config file. 'multipage' => true, // Error message when content extraction fails 'error_message' => '[unable to retrieve full-text content]', // Default title when we won't be able to extract a title 'error_message_title' => 'No title found', // List of URLs (or parts of a URL) which will be accept. // If the list is empty, all URLs (except those specified in the blocked list below) // will be permitted. // Example: array('example.com', 'anothersite.org'); 'allowed_urls' => [], // List of URLs (or parts of a URL) which will be not accept. // Note: this list is ignored if allowed_urls is not empty 'blocked_urls' => [], // If enabled, we'll pass retrieved HTML content through htmLawed with // safe flag on and style attributes denied, see // http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/htmLawed_README.htm#s3.6 // Note: if enabled this will also remove certain elements you may want to preserve, such as iframes. 'xss_filter' => true, // Here you can define different actions based on the Content-Type header returned by server. // MIME type as key, action as value. // Valid actions: // * 'exclude' - exclude this item from the result // * 'link' - create HTML link to the item 'content_type_exc' => [ 'application/zip' => ['action' => 'link', 'name' => 'ZIP'], 'application/pdf' => ['action' => 'link', 'name' => 'PDF'], 'image' => ['action' => 'link', 'name' => 'Image'], 'audio' => ['action' => 'link', 'name' => 'Audio'], 'video' => ['action' => 'link', 'name' => 'Video'], 'text/plain' => ['action' => 'link', 'name' => 'Plain text'], ], // How we handle link in content // Valid values : // * preserve: nothing is done // * footnotes: convert links as footnotes // * remove: remove all links 'content_links' => 'preserve', 'http_client' => [ // User-Agent used to fetch content 'ua_browser' => 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2', // default referer when fetching content 'default_referer' => 'http://www.google.co.uk/url?sa=t&source=web&cd=1', // Currently allows simple string replace of URLs. // Useful for rewriting certain URLs to point to a single page or HTML view. // Although using the single_page_link site config instruction is the preferred way to do this, sometimes, as // with Google Docs URLs, it's not possible. 'rewrite_url' => [ 'docs.google.com' => ['/Doc?' => '/View?'], 'tnr.com' => ['tnr.com/article/' => 'tnr.com/print/article/'], '.m.wikipedia.org' => ['.m.wikipedia.org' => '.wikipedia.org'], 'm.vanityfair.com' => ['m.vanityfair.com' => 'www.vanityfair.com'], ], // Prevent certain file/mime types // HTTP responses which match these content types will // be returned without body. 'header_only_types' => [ 'image', 'audio', 'video', ], // URLs ending with one of these extensions will // prompt Humble HTTP Agent to send a HEAD request first // to see if returned content type matches $headerOnlyTypes. 'header_only_clues' => ['mp3', 'zip', 'exe', 'gif', 'gzip', 'gz', 'jpeg', 'jpg', 'mpg', 'mpeg', 'png', 'ppt', 'mov'], // User Agent strings - mapping domain names 'user_agents' => [], // AJAX triggers to search for. // for AJAX sites, e.g. Blogger with its dynamic views templates. 'ajax_triggers' => [ "<meta name='fragment' content='!'", '<meta name="fragment" content="!"', "<meta content='!' name='fragment'", '<meta content="!" name="fragment"', ], // number of redirection allowed until we assume request won't be complete 'max_redirect' => 10, ], 'extractor' => [ 'default_parser' => 'libxml', // key is fingerprint (fragment to find in HTML) // value is host name to use for site config lookup if fingerprint matches // \s* match anything INCLUDING new lines 'fingerprints' => [ '/\<meta\s*content=([\'"])blogger([\'"])\s*name=([\'"])generator([\'"])/i' => 'fingerprint.blogspot.com', '/\<meta\s*name=([\'"])generator([\'"])\s*content=([\'"])Blogger([\'"])/i' => 'fingerprint.blogspot.com', '/\<meta\s*name=([\'"])generator([\'"])\s*content=([\'"])WordPress/i' => 'fingerprint.wordpress.com', ], 'config_builder' => [ // Directory path to the site config folder WITHOUT trailing slash 'site_config' => [], 'hostname_regex' => '/^(([a-zA-Z0-9-]*[a-zA-Z0-9])\.)*([A-Za-z0-9-]*[A-Za-z0-9])$/', ], 'readability' => [ // filters might be like array('regex' => 'replace with') // for example, to remove script content: array('!<script[^>]*>(.*?)</script>!is' => '') 'pre_filters' => [], 'post_filters' => [], ], 'src_lazy_load_attributes' => [ 'data-src', 'data-lazy-src', 'data-original', 'data-sources', 'data-hi-res-src', ], // these JSON-LD types will be ignored 'json_ld_ignore_types' => ['Organization', 'WebSite', 'Person', 'VideoGame'], ], ]);
致谢
- FiveFilters,感谢其 全文本 RSS
- Caneco,感谢其出色的标志 ✨