j0k3r/graby

Graby 帮助您从网页中提取文章内容

维护者

详细信息

github.com/j0k3r/graby

源代码

问题

资助包维护!
j0k3r

安装次数: 281 914

依赖项: 2

建议者: 0

安全: 0

星标: 363

关注者: 17

分支: 73

开放问题: 42

2.4.5 2024-01-04 08:46 UTC

README


Graby logo



Join the chat at https://gitter.im/j0k3r/graby CI Coverage Status Total Downloads License

Graby 帮助您从网页中提取文章内容

为什么这个分支?

Full-Text RSS 作为独立应用程序运行得很好。但是,当您需要将其封装在自己的库中时,它会变得一团糟。您需要这种丑陋的东西

$article = 'http://www.bbc.com/news/entertainment-arts-32547474';
$request = 'http://example.org/full-text-rss/makefulltextfeed.php?format=json&url='.urlencode($article);
$result  = @file_get_contents($request);

此外,如果您想了解内部如何工作,它很难阅读和理解。最后,完全没有 测试

这就是我创建这个分支的原因

  1. 最简单的方法是将其集成(使用 composer)
  2. 全面测试
  3. (希望)更容易理解
  4. 稍微解耦一些

如何使用它

注意 这些说明适用于 Graby 的开发版本,该版本与稳定版本的 API 不兼容。请查看 2.x 分支中的 README 以获取稳定版本的用法说明。

要求

  • PHP >= 7.4
  • Tidy & cURL 扩展已启用

安装

使用 Composer 添加库

composer require 'j0k3r/graby dev-master' php-http/guzzle7-adapter

为什么是 php-http/guzzle7-adapter?因为 Graby 通过 HTTPlug 从任何 HTTP 客户端实现中解耦,因此 Graby 测试并通过与以下列表兼容,从而工作得很好。

Graby 已测试且应与以下内容一起工作得很好

  • Guzzle 7(使用 php-http/guzzle7-adapter
  • Guzzle 5(使用 php-http/guzzle5-adapter
  • cURL(使用 php-http/curl-client

注意:如果您想使用 Guzzle 6,请使用 Graby 2(由于依赖项冲突,在 v3 中已停止支持,这与 Guzzle 5 不同 🤷)

从 URL 获取内容

使用类来获取内容

use Graby\Graby;

$article = 'http://www.bbc.com/news/entertainment-arts-32547474';

$graby = new Graby();
$result = $graby->fetchContent($article);

var_dump($result->getResponse()->getStatus()); // 200
var_dump($result->getHtml()); // "[Fetched and readable content…]"
var_dump($result->getTitle()); // "Ben E King: R&B legend dies at 76"
var_dump($result->getLanguage()); // "en-GB"
var_dump($result->getDate()); // "2015-05-01T16:24:37+01:00"
var_dump($result->getAuthors()); // ["BBC News"]
var_dump((string) $result->getResponse()->getEffectiveUri()); // "http://www.bbc.com/news/entertainment-arts-32547474"
var_dump($result->getImage()); // "https://ichef-1.bbci.co.uk/news/720/media/images/82709000/jpg/_82709878_146366806.jpg"
var_dump($result->getSummary()); // "Ben E King received an award from the Songwriters Hall of Fame in …"
var_dump($result->getIsNativeAd()); // false
var_dump($result->getResponse()->getHeaders()); /*
[
  'server' => ['Apache'],
  'content-type' => ['text/html; charset=utf-8'],
  'x-news-data-centre' => ['cwwtf'],
  'content-language' => ['en'],
  'x-pal-host' => ['pal074.back.live.cwwtf.local:80'],
  'x-news-cache-id' => ['13648'],
  'content-length' => ['157341'],
  'date' => ['Sat, 29 Apr 2017 07:35:39 GMT'],
  'connection' => ['keep-alive'],
  'cache-control' => ['private, max-age=60, stale-while-revalidate'],
  'x-cache-action' => ['MISS'],
  'x-cache-age' => ['0'],
  'x-lb-nocache' => ['true'],
  'vary' => ['X-CDN,X-BBC-Edge-Cache,Accept-Encoding'],
]
*/

在获取 URL 时出现错误的情况下,graby 不会抛出异常,但会返回有关错误的信息(至少是状态码)

var_dump($result->getResponse()->getStatus()); // 200
var_dump($result->getHtml()); // "[unable to retrieve full-text content]"
var_dump($result->getTitle()); // "BBC - 404: Not Found"
var_dump($result->getLanguage()); // "en-GB"
var_dump($result->getDate()); // null
var_dump($result->getAuthors()); // []
var_dump((string) $result->getResponse()->getEffectiveUri()); // "http://www.bbc.co.uk/404"
var_dump($result->getImage()); // null
var_dump($result->getSummary()); // "[unable to retrieve full-text content]"
var_dump($result->getIsNativeAd()); // false
var_dump($result->getResponse()->getHeaders()); // […]

date 结果与内容中显示的相同。如果结果中的 date 不是 null,我们建议您使用 date_parse 来解析它(这是我们用来验证日期是否正确的方法)。

从预先获取的页面获取内容

如果您想从 Graby 外部获取的页面中提取内容,您可以在调用 fetchContent() 之前调用 setContentAsPrefetched(),例如。

use Graby\Graby;

$article = 'http://www.bbc.com/news/entertainment-arts-32547474';

$input = '<html>[...]</html>';

$graby = new Graby();
$graby->setContentAsPrefetched($input);
$result = $graby->fetchContent($article);

清理内容

从 1.9.0 版本开始,您还可以将 html 内容发送到与 graby 从 URL 获取的内容相同的方式进行清理。URL 仍然是必需的,以便将链接转换为绝对路径等。

use Graby\Graby;

$article = 'http://www.bbc.com/news/entertainment-arts-32547474';
// use your own way to retrieve html or to provide html
$html = ...

$graby = new Graby();
$result = $graby->cleanupHtml($html, $article);

使用自定义处理程序和格式化程序来查看输出日志

您可以使用它们来向最终用户显示 graby 输出日志。它的目的是用于在 Symfony 项目中使用 Monolog。

定义 graby 处理程序服务(在 service.yml 中的某个位置)

services:
    # ...
    graby.log_handler:
        class: Graby\Monolog\Handler\GrabyHandler

然后在您的 app/config/config.yml 中定义 Monolog 处理器

monolog:
    handlers:
        graby:
            type: service
            id: graby.log_handler
            # use "debug" to got a lot of data (like HTML at each step) otherwise "info" is fine
            level: debug
            channels: ['graby']

您可以在控制器中使用以下方式从 graby 获取日志

$logs = $this->get('monolog.handler.graby')->getRecords();

超时配置

如果您需要定义超时,必须手动创建 Http\Client\HttpClient,配置它并将其注入到 Graby\Graby 中。

  • 对于 Guzzle 5

    use Graby\Graby;
    use GuzzleHttp\Client as GuzzleClient;
    use Http\Adapter\Guzzle5\Client as GuzzleAdapter;
    $guzzle = new GuzzleClient([
        'defaults' => [
            'timeout' => 2,
        ]
    ]);
    $graby = new Graby([], new GuzzleAdapter($guzzle));
  • 对于 Guzzle 7

    use Graby\Graby;
    use GuzzleHttp\Client as GuzzleClient;
    use Http\Adapter\Guzzle7\Client as GuzzleAdapter;
    
    $guzzle = new GuzzleClient([
        'timeout' => 2,
    ]);
    $graby = new Graby([], new GuzzleAdapter($guzzle));

完整配置

这是完整的文档配置,也是默认配置。

$graby = new Graby([
    // Enable or disable debugging.
    // This will only generate log information in a file (log/graby.log)
    'debug' => false,
    // use 'debug' value if you want more data (HTML at each step for example) to be dumped in a different file (log/html.log)
    'log_level' => 'info',
    // If enabled relative URLs found in the extracted content are automatically rewritten as absolute URLs.
    'rewrite_relative_urls' => true,
    // If enabled, we will try to follow single page links (e.g. print view) on multi-page articles.
    // Currently this only happens for sites where single_page_link has been defined
    // in a site config file.
    'singlepage' => true,
    // If enabled, we will try to follow next page links on multi-page articles.
    // Currently this only happens for sites where next_page_link has been defined
    // in a site config file.
    'multipage' => true,
    // Error message when content extraction fails
    'error_message' => '[unable to retrieve full-text content]',
    // Default title when we won't be able to extract a title
    'error_message_title' => 'No title found',
    // List of URLs (or parts of a URL) which will be accept.
    // If the list is empty, all URLs (except those specified in the blocked list below)
    // will be permitted.
    // Example: array('example.com', 'anothersite.org');
    'allowed_urls' => [],
    // List of URLs (or parts of a URL) which will be not accept.
    // Note: this list is ignored if allowed_urls is not empty
    'blocked_urls' => [],
    // If enabled, we'll pass retrieved HTML content through htmLawed with
    // safe flag on and style attributes denied, see
    // http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/htmLawed_README.htm#s3.6
    // Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.
    'xss_filter' => true,
    // Here you can define different actions based on the Content-Type header returned by server.
    // MIME type as key, action as value.
    // Valid actions:
    // * 'exclude' - exclude this item from the result
    // * 'link' - create HTML link to the item
    'content_type_exc' => [
       'application/zip' => ['action' => 'link', 'name' => 'ZIP'],
       'application/pdf' => ['action' => 'link', 'name' => 'PDF'],
       'image' => ['action' => 'link', 'name' => 'Image'],
       'audio' => ['action' => 'link', 'name' => 'Audio'],
       'video' => ['action' => 'link', 'name' => 'Video'],
       'text/plain' => ['action' => 'link', 'name' => 'Plain text'],
    ],
    // How we handle link in content
    // Valid values :
    // * preserve: nothing is done
    // * footnotes: convert links as footnotes
    // * remove: remove all links
    'content_links' => 'preserve',
    'http_client' => [
        // User-Agent used to fetch content
        'ua_browser' => 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2',
        // default referer when fetching content
        'default_referer' => 'http://www.google.co.uk/url?sa=t&source=web&cd=1',
        // Currently allows simple string replace of URLs.
        // Useful for rewriting certain URLs to point to a single page or HTML view.
        // Although using the single_page_link site config instruction is the preferred way to do this, sometimes, as
        // with Google Docs URLs, it's not possible.
        'rewrite_url' => [
            'docs.google.com' => ['/Doc?' => '/View?'],
            'tnr.com' => ['tnr.com/article/' => 'tnr.com/print/article/'],
            '.m.wikipedia.org' => ['.m.wikipedia.org' => '.wikipedia.org'],
            'm.vanityfair.com' => ['m.vanityfair.com' => 'www.vanityfair.com'],
        ],
        // Prevent certain file/mime types
        // HTTP responses which match these content types will
        // be returned without body.
        'header_only_types' => [
           'image',
           'audio',
           'video',
        ],
        // URLs ending with one of these extensions will
        // prompt Humble HTTP Agent to send a HEAD request first
        // to see if returned content type matches $headerOnlyTypes.
        'header_only_clues' => ['mp3', 'zip', 'exe', 'gif', 'gzip', 'gz', 'jpeg', 'jpg', 'mpg', 'mpeg', 'png', 'ppt', 'mov'],
        // User Agent strings - mapping domain names
        'user_agents' => [],
        // AJAX triggers to search for.
        // for AJAX sites, e.g. Blogger with its dynamic views templates.
        'ajax_triggers' => [
            "<meta name='fragment' content='!'",
            '<meta name="fragment" content="!"',
            "<meta content='!' name='fragment'",
            '<meta content="!" name="fragment"',
        ],
        // number of redirection allowed until we assume request won't be complete
        'max_redirect' => 10,
    ],
    'extractor' => [
        'default_parser' => 'libxml',
        // key is fingerprint (fragment to find in HTML)
        // value is host name to use for site config lookup if fingerprint matches
        // \s* match anything INCLUDING new lines
        'fingerprints' => [
            '/\<meta\s*content=([\'"])blogger([\'"])\s*name=([\'"])generator([\'"])/i' => 'fingerprint.blogspot.com',
            '/\<meta\s*name=([\'"])generator([\'"])\s*content=([\'"])Blogger([\'"])/i' => 'fingerprint.blogspot.com',
            '/\<meta\s*name=([\'"])generator([\'"])\s*content=([\'"])WordPress/i' => 'fingerprint.wordpress.com',
        ],
        'config_builder' => [
            // Directory path to the site config folder WITHOUT trailing slash
            'site_config' => [],
            'hostname_regex' => '/^(([a-zA-Z0-9-]*[a-zA-Z0-9])\.)*([A-Za-z0-9-]*[A-Za-z0-9])$/',
        ],
        'readability' => [
            // filters might be like array('regex' => 'replace with')
            // for example, to remove script content: array('!<script[^>]*>(.*?)</script>!is' => '')
            'pre_filters' => [],
            'post_filters' => [],
        ],
        'src_lazy_load_attributes' => [
            'data-src',
            'data-lazy-src',
            'data-original',
            'data-sources',
            'data-hi-res-src',
        ],
        // these JSON-LD types will be ignored
        'json_ld_ignore_types' => ['Organization', 'WebSite', 'Person', 'VideoGame'],
    ],
]);

致谢