README

Extractor：为Laravel提供的人工智能数据提取库。

使用OpenAI，轻松地从各种来源提取结构化数据，包括图像、PDF和电子邮件。

特性

OpenAI Chat和Completion端点的方便包装器。
支持多种输入格式，如纯文本、PDF、Rtf、图像、Word文档和网页内容。
包括灵活的字段提取器，可以提取任何任意数据，而无需编写自定义逻辑。
可以返回常规数组或Spatie/data对象。
与Textract集成以提供OCR功能。
使用最新GPT-3.5和GPT-4模型的JSON模式。

示例

示例代码

<?php

use TheAi\Extractor\Facades\Extractor;
use TheAi\Extractor\Facades\Text;
use Illuminate\Support\Facades\Storage;

$image = Storage::get("restaurant_menu.png")

// Extract text from images
$textFromImage = Text::textract($image);

// Extract structured data from plain text
$menu = Extractor::fields($textFromImage,
    fields: [
        'restaurantName',
        'phoneNumber',
        'dishes' => [
            'name' => 'name of the dish',
            'description' => 'description of the dish',
            'price' => 'price of the dish as a number',
        ],
    ],
    model: "gpt-3.5-turbo-1106",
    maxTokens: 4000,
);

安装

通过composer安装包

composer require batnieluyo/extractor

发布配置文件

php artisan vendor:publish --tag="extractor-config"

您可以在配置文件中找到所有配置选项。

由于此包依赖于OpenAI Laravel Package，您还需要发布他们的配置并将OPENAI_API_KEY添加到您的.env文件中。

php artisan vendor:publish --provider="OpenAI\Laravel\ServiceProvider"

OPENAI_API_KEY="your-key-here"

# Optional: Set request timeout (default: 30s).
OPENAI_REQUEST_TIMEOUT=60

用法

从文档中提取纯文本

use TheAi\Extractor\Facades\Text;

$textPlainText = Text::text(file_get_contents('./data.txt'));
$textPdf = Text::pdf(file_get_contents('./data.pdf'));
$textImageOcr = Text::textract(file_get_contents('./data.jpg'));
$textPdfOcr = Text::textractUsingS3Upload(file_get_contents('./data.pdf'));
$textWord = Text::word(file_get_contents('./data.doc'));
$textWeb = Text::web('https://example.com');
$textHtml = Text::html(file_get_contents('./data.html'));

提取结构化数据

Extractor包包含一组预构建的提取器，旨在简化从各种类型的文本中提取结构化数据。每个提取器针对特定的数据格式进行了优化，使得处理不同类型的信息变得容易。以下是包含的提取器的列表，包括简要说明和每个提取器的方便缩写方法

这些提取器是现成的，提供了一种方便的方式来从文本中提取特定类型的结构化数据。您可以使用缩写方法轻松访问每个提取器的功能。

使用字段提取器

如果您不需要太多的自定义逻辑或验证，只想从一段文本中提取一些结构化数据，字段提取器就非常棒。

以下是从简历中提取信息的示例，请注意，提供描述以引导AI模型是支持的，以及嵌套项（这对于子项列表，如工作历史、行项、产品评论等非常有用）

$sample = Text::pdf(file_get_contents(__DIR__.'/../samples/helge-cv.pdf'));

$data = Extractor::fields($sample,
    fields: [
        'name' => 'the name of the candidate',
        'email',
        'certifications' => 'list of certifications, if any',
        'workHistory' => [
            'companyName',
            'from' => 'Y-m-d if available, Year only if not, null if missing',
            'to' => 'Y-m-d if available, Year only if not, null if missing',
            'text',
        ],
    ],
    model: Engine::GPT_3_TURBO_1106,
);

使用GPT-4-Vision与Extractor

注意：此功能仍在开发中。

Extractor包还与OpenAI的新Vision API集成，利用强大的gpt-4-vision-preview模型从图像中提取结构化数据。此功能使您能够轻松分析和解释视觉内容，无论是从图像中读取文本、从图表中提取数据还是理解复杂的视觉场景。

如何使用ImageContent与OpenAI的Vision API

要使用Extractor中的Vision功能，您需要提供图像作为输入。这可以通过几种不同的方式完成

使用文件路径：从文件路径加载图像。
使用原始图像数据：使用图像的原始数据，例如，从上传的文件中。
使用图像URL：直接从URL加载图像。

以下是每种方法的使用方法

使用文件路径

use TheAi\Extractor\Text\ImageContent;

$imagePath = __DIR__ . '/../samples/sample-image.jpg';
$imageContent = ImageContent::file($imagePath);

使用原始图像数据

use TheAi\Extractor\Text\ImageContent;

$rawImageData = file_get_contents(__DIR__ . '/../samples/sample-image.jpg');
$imageContent = ImageContent::raw($rawImageData);

使用图像URL

use TheAi\Extractor\Text\ImageContent;

$imageUrl = 'https://example.com/sample-image.jpg';
$imageContent = ImageContent::url($imageUrl);

使用OpenAI的Vision API从图像中提取数据

在准备好您的 ImageContent 对象后，您可以将它传递给 Extractor::fields 方法，使用 OpenAI 的 Vision API 提取结构化数据。例如

use TheAi\Extractor\Facades\Extractor;
use TheAi\Extractor\Text\ImageContent;

$imageContent = ImageContent::file(__DIR__ . '/../samples/product-catalog.jpg');

$data = Extractor::fields(
    $imageContent,
    fields: [
        'productName',
        'price',
        'description',
    ],
    model: Engine::GPT_4_VISION,
);

创建自定义提取器

Extractor 中的自定义提取器允许根据特定需求进行定制数据提取。以下是如何创建和使用自定义提取器的示例，以职位发布提取器为例。

实现自定义提取器

通过扩展 Extractor 类创建一个新的类作为您的自定义提取器。在这个例子中，我们将创建一个 JobPostingExtractor 来从职位发布中提取关键信息

<?php

namespace App\Extractors;

use TheAi\Extractor\Extraction\Extractor;
use TheAi\Extractor\Text\TextContent;

class JobPostingExtractor extends Extractor
{
    public function prompt(string|TextContent $input): string
    {
        $outputKey = $this->expectedOutputKey();

        return "Extract the following fields from the job posting below:"
            . "\n- jobTitle: The title or designation of the job."
            . "\n- companyName: The name of the company or organization posting the job."
            . "\n- location: The geographical location or workplace where the job is based."
            . "\n- jobType: The nature of employment (e.g., Full-time, Part-time, Contract)."
            . "\n- description: A brief summary or detailed description of the job."
            . "\n- applicationDeadline: The closing date for applications, if specified."
            . "\n\nThe output should be a JSON object under the key '{$outputKey}'."
            . "\n\nINPUT STARTS HERE\n\n$input\n\nOUTPUT IN JSON:\n";
    }

    public function expectedOutputKey(): string
    {
        return 'extractedData';
    }
}

注意：建议添加一个说明，说明应该将数据嵌套在哪个 $outputKey 键下，因为 OpenAI 的 JsonMode 响应希望将所有内容都放在一个根键下。通过重写 expectedOutputKey() 方法，它将告诉基类 Extractor 读取数据的关键。

注册自定义提取器

定义好您的自定义提取器后，使用 extend 方法将其注册到主 Extractor 类中

use TheAi\Extractor\Extractor;

Extractor::extend("job-posting", fn() => new JobPostingExtractor());

使用自定义提取器

一旦注册，您就可以像使用内置提取器一样使用您的自定义提取器。以下是如何使用 JobPostingExtractor 的示例

use TheAi\Extractor\Facades\Text;
use TheAi\Extractor\Extractor;

$jobPostingContent = Text::web("https://www.finn.no/job/fulltime/ad.html?finnkode=329443482");

$extractedData = Extractor::extract('job-posting', $jobPostingContent);
// Or you can specify the class-string instead
// ex: Extractor::extract(JobPostingExtractor::class, $jobPostingContent);

// $extractedData now contains structured information from the job posting

使用 JobPostingExtractor，您可以有效地解析和提取职位发布中的关键信息，并以易于管理和在 Laravel 应用程序中使用的方式对其进行结构化。

为职位发布提取器添加验证

为了确保提取数据的完整性，您可以在职位发布提取器中添加验证规则。这通过使用 HasValidation 特性并在 rules 方法中定义验证规则来完成

<?php

namespace App\Extractors;

use TheAi\Extractor\Extraction\Concerns\HasValidation;
use TheAi\Extractor\Extraction\Extractor;

class JobPostingExtractor extends Extractor
{
    use HasValidation;

    public function rules(): array
    {
        return [
            'jobTitle' => ['required', 'string'],
            'companyName' => ['required', 'string'],
            'location' => ['required', 'string'],
            'jobType' => ['required', 'string'],
            'salary' => ['required', 'numeric'],
            'description' => ['required', 'string'],
            'applicationDeadline' => ['required', 'date']
        ];
    }
}

这将确保职位发布数据中的每个键字段都满足指定的标准，从而提高了数据提取的可靠性。

将数据提取到 DTO 中

Extractor 可以与 spatie/data 集成，将提取的数据转换为您选择的 Data Transfer Object (DTO)。为此，请将 HasDto 特性添加到您的提取器中，并在 dataClass 方法中指定 DTO 类

<?php

namespace App\Extractors;

use DateTime;
use App\Extractors\JobPostingDto;
use TheAi\Extractor\Extraction\Concerns\HasDto;
use TheAi\Extractor\Extraction\Extractor;
use Spatie\LaravelData\Data;

class JobPostingDto extends Data
{
    public function __construct(
        public string $jobTitle,
        public string $companyName,
        public string $location,
        public string $jobType,
        public int|float $salary,
        public string $description,
        public DateTime $applicationDeadline
    ) {
    }
}

class JobPostingExtractor extends Extractor
{
    use HasDto;

    public function dataClass(): string
    {
        return JobPostingDto::class;
    }

    public function isCollection(): bool
    {
        return false; 
    }
}

使用 AWS Textract 进行 OCR 配置

要使用 AWS Textract 从大图像和多页 PDF 中提取文本，该包需要将文件上传到 S3，并将 s3 对象位置传递给 textract 服务。

因此，您需要在 config/extractor.php 文件中配置您的 AWS 凭据如下

TEXTRACT_KEY="your-aws-access-key"
TEXTRACT_SECRET="your-aws-security"
TEXTRACT_REGION="your-textract-region"

# Can be omitted
TEXTRACT_VERSION="2018-06-27"

您还需要配置一个单独的 Textract 磁盘来存储文件，打开您的 config/filesystems.php 配置文件并添加以下内容

'textract' => [
    'driver' => 's3',
    'key' => env('TEXTRACT_KEY'),
    'secret' => env('TEXTRACT_SECRET'),
    'region' => env('TEXTRACT_REGION'),
    'bucket' => env('TEXTRACT_BUCKET'),
],

确保 config/extractor.php 中的 textract_disk 设置与 filesystems.php 配置中的磁盘名称相同，您可以使用 .env 中的 TEXTRACT_DISK 值进行更改。

return [
    "textract_disk" => env("TEXTRACT_DISK")
];

.env

TEXTRACT_DISK="uploads"

使用 Textract 处理文件后删除文件

使用 S3 生命周期规则

您可以在您的 S3 存储桶上配置生命周期规则以在指定时间后删除文件，有关更多信息，请参阅 AWS 文档

https://repost.aws/knowledge-center/s3-empty-bucket-lifecycle-rule

使用 `cleanupFileUsing` 钩子

默认情况下，该包将不会删除已上传到 Textract S3 桶中的文件。如果您想删除这些文件，您可以使用 TextractUsingS3Upload::cleanupFileUsing(Closure) 钩子来实现。

// Delete the file from the S3 bucket
TextractUsingS3Upload::cleanupFileUsing(function (string $filePath) {
    Storage::disk('textract')->delete($filePath);
}

注意

Textract 不在所有区域都可用

问题：在哪些 AWS 区域可以使用 Amazon Textract？Amazon Textract 目前可在以下区域使用：美国东部（弗吉尼亚北部）、美国东部（俄亥俄）、美国西部（俄勒冈）、美国西部（北加州）、AWS GovCloud (US-West)、AWS GovCloud (US-East)、加拿大（中部）、欧盟（爱尔兰）、欧盟（伦敦）、欧盟（法兰克福）、欧盟（巴黎）、亚太（新加坡）、亚太（悉尼）、亚太（首尔）和亚太（孟买）地区。

查看：https://aws.amazon.com/textract/faqs/

所有参数及其功能

$input (TextContent|string)

需要处理的输入文本或数据。它接受一个 TextContent 对象或字符串。

$model (Model)

此参数指定用于提取过程的 OpenAI 模型。

它接受一个 string 值。不同的模型具有不同的速度/精度特性和用例，为了方便，大多数可接受模型都作为常量提供在 Engine 类中。

可用模型

$maxTokens (int)

模型将处理的令牌的最大数量。默认值为 2000，对于非常长的文本，可能需要调整此值。2000 个令牌通常足够。

$temperature (float)

控制模型输出的随机性/创造性。

更高的值（例如，0.8）会使输出更加随机，这在大多数情况下通常是不希望的。推荐值是 0.1 或 0.2；超过 0.5 的值通常不太有用。默认值是 0.1。

许可证

此软件包采用 MIT 许可证。有关更多详细信息，请参阅许可证文件。

batnieluyo / extractor

维护者

详细信息

README

Extractor：为Laravel提供的人工智能数据提取库。

特性

示例

安装

用法

从文档中提取纯文本

提取结构化数据

使用字段提取器

使用GPT-4-Vision与Extractor

如何使用ImageContent与OpenAI的Vision API

使用文件路径

使用原始图像数据

使用图像URL

使用OpenAI的Vision API从图像中提取数据

创建自定义提取器

实现自定义提取器

注册自定义提取器

使用自定义提取器

为职位发布提取器添加验证

将数据提取到 DTO 中

使用 AWS Textract 进行 OCR 配置

使用 Textract 处理文件后删除文件

使用 S3 生命周期规则

使用 `cleanupFileUsing` 钩子

所有参数及其功能

许可证

batnieluyo / extractor

维护者

详细信息

README

Extractor：为Laravel提供的人工智能数据提取库。

特性

示例

安装

用法

从文档中提取纯文本

提取结构化数据

使用字段提取器

使用GPT-4-Vision与Extractor

如何使用ImageContent与OpenAI的Vision API

使用文件路径

使用原始图像数据

使用图像URL

使用OpenAI的Vision API从图像中提取数据

创建自定义提取器

实现自定义提取器

注册自定义提取器

使用自定义提取器

为职位发布提取器添加验证

将数据提取到 DTO 中

使用 AWS Textract 进行 OCR 配置

使用 Textract 处理文件后删除文件

使用 S3 生命周期规则

使用 cleanupFileUsing 钩子

所有参数及其功能

许可证

使用 `cleanupFileUsing` 钩子