phikhi / url-to-text
从url中提取文本
v1.0.5
2023-03-02 15:35 UTC
Requires
- php: ^8.1
Requires (Dev)
- laravel/pint: ^1.6.0
- nunomaduro/collision: ^7.0.5
- pestphp/pest: ^2.0.0
- pestphp/pest-plugin-mock: ^2.0.0
- phpstan/phpstan: ^1.10.3
- rector/rector: ^0.14.8
- symfony/var-dumper: ^6.2.7
README
从远程HTML页面提取任何文本 🚧 进行中(请勿使用) 🚧
安装
composer require phikhi/url-to-text
用法
基本用法
use Phikhi\UrlToText\UrlToText; $text = (new UrlToText()) ->from('https://phikhi.com') ->extract() ->toArray(); /* [ 'lorem ipsum dolor sit amet', 'non gloriam sine audentes', '...' ]; */ $text = (new UrlToText()) ->from('https://phikhi.com') ->extract() ->toJson(); // ['lorem ipsum dolor sit amet', 'non gloriam sine audentes', '...']; $text = (new UrlToText()) ->from('https://phikhi.com') ->extract() ->toText(); /* lorem ipsum dolor sit amet non gloriam sine audentes ... */
高级用法
您可以自定义想要解析的标签
$text = (new UrlToText()) ->from('https://phikhi.com') ->allow(['div', 'span']) // will add these tags to the existing allowed tags array (H*, p, li, a). ->extract() ->toArray();
如果您想覆盖而不是扩展允许的标签数组,可以在allow()
方法中传递第二个参数
$text = (new UrlToText()) ->from('https://phikhi.com') ->allow(['div', 'span'], overwrite: true) // will replace the existing allowed tags array with this one. ->extract() ->toArray();
默认情况下,script
和style
标签在从DOM中提取允许的标签之前会自动移除,以防止提取过程中出现一些奇怪的行为。但您仍然可以使用deny()
方法自定义它们。
$text = (new UrlToText()) ->from('https://phikhi.com') ->deny(['svg']) // will add the `svg` tag to the existing denied tags array (script, style). ->extract() ->toArray();
如果您想覆盖而不是扩展拒绝的标签数组,可以在deny()
方法中传递第二个参数
$text = (new UrlToText()) ->from('https://phikhi.com') ->deny(['svg'], overwrite: true) // will replace the existing denied tags array with this one. ->extract() ->toArray();