stevebauman/hypertext

最好的HTML转文本转换器

v1.1.1 2024-05-19 04:56 UTC

This package is auto-updated.

Last update: 2024-09-19 05:43:42 UTC


README

一个PHP HTML转纯文本转换器,能够优雅地处理各种格式不规范的HTML。

Hypertext擅长从任何基于HTML的文档中提取文本内容,并自动

  • 移除CSS
  • 移除脚本
  • 移除标题
  • 移除非HTML内容
  • 保留空白
  • 保留链接(可选)
  • 保留换行符(可选)

它主要用于LLM相关任务,如提示和嵌入。

安装

composer require stevebauman/hypertext

使用

use Stevebauman\Hypertext\Transformer;

$transformer = new Transformer();

// (Optional) Filter out specific elements by their XPath.
$transformer->filter("//*[@id='some-element']");

// (Optional) Retain new line characters.
$transformer->keepNewLines();

// (Optional) Retain anchor tags and their href attribute.
$transformer->keepLinks();

$text = $transformer->toText($html);

示例

对于更大的示例,请查看tests/Fixtures目录。

输入:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My Blog</title>
</head>
<body>
    <h1>Welcome to My Blog</h1>
    <p>This is a paragraph of text on my webpage.</p>
    <a href="https://blog.com/posts">Click here</a> to view my posts.
</body>
</html>

输出(纯文本):

echo (new Transformer)->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. Click here to view my posts.

输出(保留换行符):

echo (new Transformer)->keepNewLines()->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
Click here to view my posts.

输出(保留链接):

echo (new Transformer)->keepLinks()->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. <a href="https://blog.com/posts">Click Here</a> to view my posts.

输出(保留两者):

echo (new Transformer)
    ->keepLinks()
    ->keepNewLines()
    ->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
<a href="https://blog.com/posts">Click Here</a> to view my posts.