geocurly/name-splitter

姓名分割工具

0.1 2020-05-29 19:53 UTC

This package is not auto-updated.

Last update: 2024-09-23 03:16:34 UTC


README

很遗憾,该工具仅支持西里尔字母姓名

有一个姓名分割工具。它接受输入字符串并将其解析为对象。

用法

<?php

declare(strict_types=1);

use NameSplitter\NameSplitter;

$splitter = new NameSplitter(['enc' => 'CP1251']);
$result = $splitter->split('Иванов Иван Иванович');
[$surname, $name, $middleName] = [
    $result->getSurname(),
    $result->getName(),
    $result->getMiddleName(),
];

质量

NameSplitter的测试覆盖了大约13000个俄语姓名案例,准确率为99.65。每个案例都使用了多个模板,因此结果案例数量为124283。您可以使用自己的数据集运行测试(使用--verbose选项以查看模板错误)

[aleksandr@aleksandr name-splitter]$ ./bin/name-split-test --file=$(realpath fio.csv)

TESTED TEMPLATES:
%Surname %Name %Middle
%Name %Middle %Surname
%Name %Middle
%Name %Surname
%Surname %Name
%Surname %StrictInitials
%StrictInitials %Surname
%Surname %SplitInitials
%SplitInitials %Surname

ACCURACY: 99.65
COUNT CASE TOTAL: 124283
COUNT CASE PASS:  123848
COUNT CASE ERROR: 435

fio.csv文件的格式为

SomeSurname;SomeName;SomeMiddleName

问题

  • 当姓氏与中间名匹配时(例如Иван Иванович),工具无法识别模板如%Name %Surname
  • 当分割的姓名不在词典中时,某些模板可能无法正确工作。

决策

您可以使用前缀和后缀模板

<?php

declare(strict_types=1);

use NameSplitter\{
    NameSplitter,
    Template\SimpleMatch,
    Contract\TemplateInterface as TPL,
    Contract\StateInterface
};

$before = [
    // for this case we explicitly match name parts with template
    new SimpleMatch([
        TPL::SURNAME => 'Difficult Surname', 
        TPL::NAME => 'Difficult Name'
    ]),
    static function(StateInterface $state) {
        // TODO there is will be your implementation
        return [
            TPL::SURNAME => $surname ?? null, 
            TPL::NAME => $name ?? null,
        ];
    },
];

// There are may be any callable types if they take to input the StateInterface
$after = [];

$splitter = new NameSplitter([], $before, $after);
$result = $splitter->split('Difficult Surname Difficult Name');