pforret / pf-article-extractor
PhpArticleExtractor. 从HTML页面移除样板代码并进行全文提取
0.3.0
2024-06-03 22:57 UTC
Requires
- php: ^8.2
- ext-dom: *
- ext-libxml: *
- ext-mbstring: *
- fivefilters/readability.php: ^3.2
Requires (Dev)
- ext-curl: *
- laravel/pint: ^1.16
- phpunit/phpunit: ^11.1
README
从HTML页面移除样板代码并进行全文提取。
为PHP8.2及以上版本重写的dotpack/php-boiler-pipe
,附有测试。
安装
composer require pforret/pf-article-extractor
使用方法
use Pforret\PfArticleExtractor\ArticleExtractor; $articleData = ArticleExtractor::getArticle($html); /* * $articleData = Pforret\PfArticleExtractor\Formats\ArticleContents Object ( [title] => Film Podcast: Wicked Little Letters Named Film of the Month [content] => UK Film Club was back in March with a new episode of their film podcast. (...) [date] => [images] => Array ( [0] => https://static.wixstatic.com/media/.../b19cd0_dde0d59546f84127865267f43994f39b~mv2.jpg ) [links] => Array ( [0] => https://www.chrisolson.co.uk/ (...) ) ) */
内部机制
- 该包接受完整的HTML页面作为输入
- 它将遍历DOM树并尝试找到主要内容
- 它将移除样板内容(如页眉、页脚、侧边栏等)
- 它将尝试提取主要内容
- 它将尝试从文章中提取标题、日期、图片和链接
目前它已经与以下示例页面进行了测试:
- Blogger
- Drupal
- Jekyll
- Mkdocs
- Wix
- WordPress
类似包
- beautifulsoup4 - Python, MIT
- html-text - Python, MIT
- kohlschutter/boilerpipe - Java, Apache 2.0
- fivefilters/readability.php - PHP, GPL-3.0
- miso-belica/jusText - Python, BSD2
- codelucas/newspaper - Python, Apache