j0k3r / php-readability
从HTML自动提取文章
2.0.3
2023-04-03 12:47 UTC
Requires
- php: >=7.2.0
- ext-mbstring: *
- masterminds/html5: ^2.7
- psr/log: ^1.0.1 || ^2.0 || ^3.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.0
- monolog/monolog: ^1.24|^2.1
- phpstan/phpstan: ^1.3
- phpstan/phpstan-phpunit: ^1.0
- rector/rector: ^0.15.0
- symfony/phpunit-bridge: ^4.4|^5.3|^6.0
Suggests
- ext-tidy: Used to clean up given HTML and to avoid problems with bad HTML structure.
This package is auto-updated.
Last update: 2024-09-19 05:58:51 UTC
README
这是从full-text-rss分支中提取的Readability类。它可以被定义为原始php-readability的更好版本。
差异
默认的php-readability库非常老旧,需要改进。我找到了@Dither的full-text-rss分支,它改进了Readability类。
- 我已经从其分支中提取了类,以便可以直接使用
- 我添加了一些简单的测试
- 并且改变了代码风格,运行
php-cs-fixer
并添加了命名空间
但是代码仍然很难理解和阅读...
要求
默认情况下,如果可用,此库将使用 Tidy 扩展。Tidy只用于清理给定的HTML,避免出现HTML结构问题等。它将由Composer建议。
如果您在解析未安装Tidy的内容时遇到问题,请安装它并再次尝试。
使用方法
use Readability\Readability; $url = 'http://www.medialens.org/index.php/alerts/alert-archive/alerts-2013/729-thatcher.html'; // you can use whatever you want to retrieve the html content (Guzzle, Buzz, cURL ...) $html = file_get_contents($url); $readability = new Readability($html, $url); // or without Tidy // $readability = new Readability($html, $url, 'libxml', false); $result = $readability->init(); if ($result) { // display the title of the page echo $readability->getTitle()->textContent; // display the *readability* content echo $readability->getContent()->textContent; } else { echo 'Looks like we couldn\'t find the content. :('; }
如果您想调试它或检查正在进行的事情,可以注入一个记录器(它必须遵循 Psr\Log\LoggerInterface
,例如Monolog)
use Readability\Readability; use Monolog\Logger; use Monolog\Handler\StreamHandler; $url = 'http://www.medialens.org/index.php/alerts/alert-archive/alerts-2013/729-thatcher.html'; $html = file_get_contents($url); $logger = new Logger('readability'); $logger->pushHandler(new StreamHandler('path/to/your.log', Logger::DEBUG)); $readability = new Readability($html, $url); $readability->setLogger($logger);