stevebauman / hypertext
最好的HTML转文本转换器
v1.1.1
2024-05-19 04:56 UTC
Requires
- ezyang/htmlpurifier: ^4.16
Requires (Dev)
- pestphp/pest: ^2.24
- spatie/ray: ^1.39
Suggests
- ext-dom: Required for filtering HTML.
- ext-libxml: Required for filtering HTML.
README
一个PHP HTML转纯文本转换器,能够优雅地处理各种格式不规范的HTML。
Hypertext擅长从任何基于HTML的文档中提取文本内容,并自动
- 移除CSS
- 移除脚本
- 移除标题
- 移除非HTML内容
- 保留空白
- 保留链接(可选)
- 保留换行符(可选)
它主要用于LLM相关任务,如提示和嵌入。
安装
composer require stevebauman/hypertext
使用
use Stevebauman\Hypertext\Transformer; $transformer = new Transformer(); // (Optional) Filter out specific elements by their XPath. $transformer->filter("//*[@id='some-element']"); // (Optional) Retain new line characters. $transformer->keepNewLines(); // (Optional) Retain anchor tags and their href attribute. $transformer->keepLinks(); $text = $transformer->toText($html);
示例
对于更大的示例,请查看tests/Fixtures目录。
输入:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>My Blog</title> </head> <body> <h1>Welcome to My Blog</h1> <p>This is a paragraph of text on my webpage.</p> <a href="https://blog.com/posts">Click here</a> to view my posts. </body> </html>
输出(纯文本):
echo (new Transformer)->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. Click here to view my posts.
输出(保留换行符):
echo (new Transformer)->keepNewLines()->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
Click here to view my posts.
输出(保留链接):
echo (new Transformer)->keepLinks()->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. <a href="https://blog.com/posts">Click Here</a> to view my posts.
输出(保留两者):
echo (new Transformer) ->keepLinks() ->keepNewLines() ->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
<a href="https://blog.com/posts">Click Here</a> to view my posts.