phikhi / url-to-text

从url中提取文本

v1.0.5 2023-03-02 15:35 UTC

This package is auto-updated.

Last update: 2024-09-30 01:32:46 UTC


README

从远程HTML页面提取任何文本 🚧 进行中(请勿使用) 🚧

安装

composer require phikhi/url-to-text

用法

基本用法

use Phikhi\UrlToText\UrlToText;

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->extract()
    ->toArray();
/*
[
    'lorem ipsum dolor sit amet',
    'non gloriam sine audentes',
    '...'
];
*/

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->extract()
    ->toJson();
// ['lorem ipsum dolor sit amet', 'non gloriam sine audentes', '...'];

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->extract()
    ->toText();
/*
lorem ipsum dolor sit amet
non gloriam sine audentes
...
*/

高级用法

您可以自定义想要解析的标签

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->allow(['div', 'span']) // will add these tags to the existing allowed tags array (H*, p, li, a).
    ->extract()
    ->toArray();

如果您想覆盖而不是扩展允许的标签数组,可以在allow()方法中传递第二个参数

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->allow(['div', 'span'], overwrite: true) // will replace the existing allowed tags array with this one.
    ->extract()
    ->toArray();

默认情况下,scriptstyle标签在从DOM中提取允许的标签之前会自动移除,以防止提取过程中出现一些奇怪的行为。但您仍然可以使用deny()方法自定义它们。

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->deny(['svg']) // will add the `svg` tag to the existing denied tags array (script, style).
    ->extract()
    ->toArray();

如果您想覆盖而不是扩展拒绝的标签数组,可以在deny()方法中传递第二个参数

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->deny(['svg'], overwrite: true) // will replace the existing denied tags array with this one.
    ->extract()
    ->toArray();