ahmedghanem00/tesseract-ocr

Tesseract-OCR 二进制文件的 PHP 封装器

1.0.11 2024-06-07 10:57 UTC

This package is auto-updated.

Last update: 2024-09-07 11:41:36 UTC


README

Tesseract-OCR 二进制文件的 PHP 封装器。

最初受到 ddeboer/tesseract 的启发,增加了新功能并进行了一些改进。

安装

$ composer require ahmedghanem00/tesseract-ocr

使用方法

如果将 tesseract 添加到您的路径中,您可以直接这样做

$tesseract = new \ahmedghanem00\TesseractOCR\Tesseract();

否则,您可以这样做

$tesseract = new \ahmedghanem00\TesseractOCR\Tesseract("path/to/binary/location");
# OR, If you already have an initiated instance 
$tesseract->setBinaryPath("path/to/binary/location");

指定 tesseract 进程的超时时间

$tesseract = new \ahmedghanem00\TesseractOCR\Tesseract(processTimeout: 3);
# OR
$tesseract->setProcessTimeout(2.5);

指定自定义 tessdata-dir

$tesseract->setTessDataDirPath("path/to/data/dir")

将 tessdata-dir 重置为默认值

$tesseract->resetTessDataDirPath();

获取二进制文件的版本

$version = $tesseract->getVersion();

获取所有支持的语言

$languages = $tesseract->getSupportedLanguages();

对图像进行 OCR

$result = $tesseract->recognize("test.png");
## OR
$result = $tesseract->recognize("https://example.com/test.png");
## etc.

感谢 Intervention/image 包。recognize 方法可以接受不同来源的图像

- Path of the image in filesystem.
- URL of an image (allow_url_fopen must be enabled).
- Binary image data.
- Data-URL encoded image data.
- Base64 encoded image data.
- PHP resource of type gd
- Imagick instance
- Intervention\Image\Image instance
- SplFileInfo instance (To handle Laravel file uploads via Symfony\Component\HttpFoundation\File\UploadedFile)

指定语言(s)

$result = $tesseract->recognize("test.png", langs: ["eng", "ara"]);

指定页面分割模型(PSM)

use ahmedghanem00\TesseractOCR\Enum\PSM;

# using PSM enum
$result = $tesseract->recognize("test.png", psm: PSM::SINGLE_BLOCK);
# OR by using id directly
$result = $tesseract->recognize("test.png", psm: 3);

指定 OCR 引擎模式(OEM)

use ahmedghanem00\TesseractOCR\Enum\OEM;

# using OEM enum
$result = $tesseract->recognize("test.png", oem: OEM::LEGACY_WITH_LSTM);
# OR by using id directly
$result = $tesseract->recognize("test.png", oem: 3);

指定输入图像的 DPI

$result = $tesseract->recognize("test.png", dpi: 200);

让 recognize 方法输出可搜索的 PDF 而不是原始文本

$pdfBinaryData = $tesseract->recognize("test.png", outputAsPDF: true);

file_put_contents("result.pdf", $pdfBinaryData)

指定单词文件或模式文件

$result = $tesseract->recognize("test.png", wordsFilePath: "/path/to/file");
# OR
$result = $tesseract->recognize("test.png", patternsFilePath: "/path/to/file");

设置配置参数

use ahmedghanem00\TesseractOCR\ConfigBag;

$config = ConfigBag::new()
    ->setParameter("tessedit_char_whitelist", "abcrety")
    ->setParameter("textord_pitch_range", 3);

$result = $tesseract->recognize("test.png", config: $config);

您还可以运行 tesseract --print-parameters 来查看可用的配置参数列表。

许可证

包根据 MIT 许可证 授权。有关更多信息,您可以通过 许可证文件 查看。