xatham / text-extraction
支持多种文件类型的简单文本提取
0.0.2
2021-09-25 19:25 UTC
Requires
- php: >=7.4
- ext-fileinfo: *
- ext-imagick: *
- league/flysystem: ^2.0
- phpoffice/phpspreadsheet: ^1.15
- phpoffice/phpword: ^0.17.0 | ^0.18.2
- shuchkin/simplexlsx: ^0.8.19
- smalot/pdfparser: ^0.17.1
- symfony/finder: ^5.2
- thiagoalessio/tesseract_ocr: ^2.9
Requires (Dev)
- friendsofphp/php-cs-fixer: ^2.17
- phpmd/phpmd: ^2.9
- phpspec/prophecy-phpunit: ^2.0
- phpstan/phpstan: ^0.12.62
- phpunit/phpunit: ^9.5
This package is auto-updated.
Last update: 2024-09-26 01:56:23 UTC
README
text-extraction
关于
这个PHP库允许您从各种文档类型中提取纯文本。
目前支持以下文件MIME类型进行提取:
text/plain
text/csv
application/vnd.ms-excel
application/vnd.oasis.opendocument.text
application/pdf
application/msword
安装
composer require xatham/text-extraction
使用方法
/** * Extracting only pdf files, without ocr capturing */ $textExtractor = (new TextExtractionBuilder())->buildTextExtractor( [ 'withOcr' => false, 'validMimeTypes' => ['application/pdf'], ], ); $target = dirname(__DIR__) . '/examples/sample.pdf'; $plainTextDocument = $textExtractor->extractByFilePath($target); if ($plainTextDocument === null) { exit('Could not extract any data'); } $texts = $plainTextDocument->getTextItems(); foreach ($texts as $text) { var_dump($text); }
许可证
text-extraction采用MIT许可证。