nai-php/naipostagger

一款用 PHP 编写的词性标注器。

v0.2 2021-08-22 08:39 UTC

This package is auto-updated.

Last update: 2024-09-16 14:08:52 UTC


README

这是一个轻量级的、不依赖于框架的纯 PHP 库,用于词性标注。可用于聊天机器人、个人助理、关键词提取等。由于是用 PHP 编写的,它可以轻松集成到现有或新的应用程序中,真正实现理解用户所写内容的能力。

它基于词汇和预定义的语法规则,无需第三方系统、神经网络、机器学习或需要大量资源的模型。

这是英文版本。文档和 TODO 列表即将到来,更多信息请访问 n-ai.cloud

精确度

在此表中,我将展示不同类型句子语料库的结果。

安装

  1. 在您的项目文件夹中(例如 "myproject")通过 composer 安装此包;

  2. 创建 "dictionaries" 文件夹;

  3. 在 "dictionaries" 文件夹中,克隆或下载 英文词典 仓库;

  4. 运行此示例脚本

use NaiPosTagger\Pipelines\PipelinePosTagging;
use NaiPosTagger\Models\NaiPosArr;


include('vendor/autoload.php');

include(__DIR__ . '/vendor/nai-php/naipostagger/src/Utilities/common_functions_helper.php');

define('DICTIONARIES_PATH', __DIR__ . '/./dictionaries/dictionaries-');

define('TRAITS_PATH', __DIR__ . '/./vendor/nai-php/naipostagger/src/');

$sentence = 'my name is Fred';

$PipelinePosTagging = new PipelinePosTagging();

$PipelinePosTagging->language = 'en';

$pos_arr = $PipelinePosTagging->transform($sentence);

// for a clear output, better hide metadata
$pos_arr = NaiPosArr::clearMetadata($pos_arr);

// and further simplify the output
$pos_arr = NaiPosArr::flatPosArr($pos_arr);

diex($pos_arr);

输出将如下所示

Array
(
    [0] => Array
        (
            [form] => .
            [lemma] => .
            [features] => SENT
            [sh-feat] => SENT
            [label] => 
            [rule] => 
            [pos_score] => 0
        )

    [1] => Array
        (
            [form] => my
            [lemma] => my
            [features] => ADJ:pos+m+s
            [sh-feat] => ADJ
            [label] => 
            [rule] => 
            [pos_score] => 0
        )

    [2] => Array
        (
            [form] => name
            [lemma] => name
            [features] => NOUN-m:s
            [sh-feat] => NOUN
            [label] => 
            [rule] => 
            [pos_score] => 0
        )

    [3] => Array
        (
            [form] => is
            [lemma] => is
            [features] => VER:ind+pres+3+s
            [sh-feat] => VER
            [label] => 
            [rule] => 
            [pos_score] => 0
        )

    [4] => Array
        (
            [form] => Fred
            [lemma] => Fred
            [features] => NPR
            [sh-feat] => NPR
            [label] => 
            [rule] => 
            [pos_score] => 0
        )

    [5] => Array
        (
            [form] => .
            [lemma] => .
            [features] => SENT
            [sh-feat] => SENT
            [label] => 
            [rule] => 
            [pos_score] => 0
        )

)

待办事项

  • 寻找贡献者
  • 清理、检查、修复和标记词典中的术语
  • 清理、检查、修复 brill 规则
  • 添加更多 ngrams
  • 添加更多测试,特别是针对过滤器
  • 收集和加载 frill 单词
  • 是否对某些类进行更好的 Oop 处理?
  • 在用于逻辑分析(尚未发布)的模块中收集同义词和时间表达