ahmedghanem00 / tesseract-ocr
Tesseract-OCR 二进制文件的 PHP 封装器
1.0.11
2024-06-07 10:57 UTC
Requires
- php: ^8.2
- intervention/image: ^2.7.0
- symfony/process: ^6.0 || ^7.0
- symfony/string: ^6.0 || ^7.0
Requires (Dev)
- ext-gd: *
- ext-imagick: *
- friendsofphp/php-cs-fixer: ^3.54
- phpstan/extension-installer: ^1.3
- phpstan/phpstan: 1.10.x-dev
- phpunit/phpunit: dev-main
- smalot/pdfparser: dev-master
This package is auto-updated.
Last update: 2024-09-07 11:41:36 UTC
README
Tesseract-OCR 二进制文件的 PHP 封装器。
最初受到 ddeboer/tesseract 的启发,增加了新功能并进行了一些改进。
安装
$ composer require ahmedghanem00/tesseract-ocr
使用方法
如果将 tesseract 添加到您的路径中,您可以直接这样做
$tesseract = new \ahmedghanem00\TesseractOCR\Tesseract();
否则,您可以这样做
$tesseract = new \ahmedghanem00\TesseractOCR\Tesseract("path/to/binary/location"); # OR, If you already have an initiated instance $tesseract->setBinaryPath("path/to/binary/location");
指定 tesseract 进程的超时时间
$tesseract = new \ahmedghanem00\TesseractOCR\Tesseract(processTimeout: 3); # OR $tesseract->setProcessTimeout(2.5);
指定自定义 tessdata-dir
$tesseract->setTessDataDirPath("path/to/data/dir")
将 tessdata-dir 重置为默认值
$tesseract->resetTessDataDirPath();
获取二进制文件的版本
$version = $tesseract->getVersion();
获取所有支持的语言
$languages = $tesseract->getSupportedLanguages();
对图像进行 OCR
$result = $tesseract->recognize("test.png"); ## OR $result = $tesseract->recognize("https://example.com/test.png"); ## etc.
感谢 Intervention/image 包。recognize 方法可以接受不同来源的图像
- Path of the image in filesystem.
- URL of an image (allow_url_fopen must be enabled).
- Binary image data.
- Data-URL encoded image data.
- Base64 encoded image data.
- PHP resource of type gd
- Imagick instance
- Intervention\Image\Image instance
- SplFileInfo instance (To handle Laravel file uploads via Symfony\Component\HttpFoundation\File\UploadedFile)
指定语言(s)
$result = $tesseract->recognize("test.png", langs: ["eng", "ara"]);
指定页面分割模型(PSM)
use ahmedghanem00\TesseractOCR\Enum\PSM; # using PSM enum $result = $tesseract->recognize("test.png", psm: PSM::SINGLE_BLOCK); # OR by using id directly $result = $tesseract->recognize("test.png", psm: 3);
指定 OCR 引擎模式(OEM)
use ahmedghanem00\TesseractOCR\Enum\OEM; # using OEM enum $result = $tesseract->recognize("test.png", oem: OEM::LEGACY_WITH_LSTM); # OR by using id directly $result = $tesseract->recognize("test.png", oem: 3);
指定输入图像的 DPI
$result = $tesseract->recognize("test.png", dpi: 200);
让 recognize 方法输出可搜索的 PDF 而不是原始文本
$pdfBinaryData = $tesseract->recognize("test.png", outputAsPDF: true); file_put_contents("result.pdf", $pdfBinaryData)
指定单词文件或模式文件
$result = $tesseract->recognize("test.png", wordsFilePath: "/path/to/file"); # OR $result = $tesseract->recognize("test.png", patternsFilePath: "/path/to/file");
设置配置参数
use ahmedghanem00\TesseractOCR\ConfigBag; $config = ConfigBag::new() ->setParameter("tessedit_char_whitelist", "abcrety") ->setParameter("textord_pitch_range", 3); $result = $tesseract->recognize("test.png", config: $config);
您还可以运行 tesseract --print-parameters
来查看可用的配置参数列表。