xatham/text-extraction

支持多种文件类型的简单文本提取

0.0.2 2021-09-25 19:25 UTC

This package is auto-updated.

Last update: 2024-09-26 01:56:23 UTC


README

PHP Composer

text-extraction

关于

这个PHP库允许您从各种文档类型中提取纯文本。

目前支持以下文件MIME类型进行提取:

text/plain

text/csv

application/vnd.ms-excel

application/vnd.oasis.opendocument.text

application/pdf

application/msword

安装

composer require xatham/text-extraction

使用方法

/**
 * Extracting only pdf files, without ocr capturing
 */
$textExtractor = (new TextExtractionBuilder())->buildTextExtractor(
    [
        'withOcr' => false,
        'validMimeTypes' =>  ['application/pdf'],
    ],
);

$target = dirname(__DIR__) . '/examples/sample.pdf';
$plainTextDocument = $textExtractor->extractByFilePath($target);
if ($plainTextDocument === null) {
    exit('Could not extract any data');
}
$texts = $plainTextDocument->getTextItems();

foreach ($texts as $text) {
    var_dump($text);
}

许可证

text-extraction采用MIT许可证。