pforret/pf-article-extractor

PhpArticleExtractor. 从HTML页面移除样板代码并进行全文提取

0.3.0 2024-06-03 22:57 UTC

This package is auto-updated.

Last update: 2024-09-13 08:49:12 UTC


README

Tests GitHub Release GitHub Tag GitHub commit activity Packagist Downloads PHP GitHub License

从HTML页面移除样板代码并进行全文提取。

为PHP8.2及以上版本重写的dotpack/php-boiler-pipe,附有测试。

安装

composer require pforret/pf-article-extractor

使用方法

use Pforret\PfArticleExtractor\ArticleExtractor;

$articleData = ArticleExtractor::getArticle($html);
/*
 * $articleData = Pforret\PfArticleExtractor\Formats\ArticleContents Object
(
    [title] => Film Podcast: Wicked Little Letters Named Film of the Month
    [content] => UK Film Club was back in March with a new episode of their film podcast. (...)
    [date] =>
    [images] => Array
        (
            [0] => https://static.wixstatic.com/media/.../b19cd0_dde0d59546f84127865267f43994f39b~mv2.jpg
        )

    [links] => Array
        (
            [0] => https://www.chrisolson.co.uk/
            (...)
        )

)

 */

内部机制

  • 该包接受完整的HTML页面作为输入
  • 它将遍历DOM树并尝试找到主要内容
  • 它将移除样板内容(如页眉、页脚、侧边栏等)
  • 它将尝试提取主要内容
  • 它将尝试从文章中提取标题、日期、图片和链接

目前它已经与以下示例页面进行了测试:

  • Blogger
  • Drupal
  • Jekyll
  • Mkdocs
  • Wix
  • WordPress

类似包