devtronic/super-tokenizer

由PHP编写的强大动态分词器

1.0.1 2017-02-26 15:28 UTC

This package is auto-updated.

Last update: 2024-09-22 00:09:34 UTC


README

GitHub tag Packagist Travis Packagist

超级分词器

超级分词器是一个由PHP编写的超动态且易于使用的分词器

安装

composer require devtronic/super-tokenizer

使用

最小分词器

<?php

use Devtronic\SuperTokenizer\Tokenizer;

require_once __DIR__ . '/vendor/autoload.php';

$tokenizer = new Tokenizer();

$sample = 'Minimal tokenizer example';

$tokens = $tokenizer->tokenize($sample);
print_r($tokens);

打印

Array
(
    [0] => Array
        (
            [type] => 1
            [value] => Minimal
            [position] => 0
        )

    [1] => Array
        (
            [type] => 1
            [value] => tokenizer
            [position] => 8
        )

    [2] => Array
        (
            [type] => 1
            [value] => example
            [position] => 18
        )
)

您还可以使用getTokenName()-方法获取标记的名称

<?php
// ...
foreach ($tokens as &$token) {
    $token['name'] = $tokenizer->getTokenName($token['type']);
}

print_r($tokens);

打印

Array
(
    [0] => Array
        (
            [type] => 1
            [value] => Minimal
            [position] => 0
            [name] => TT_TOKEN
        )

    [1] => Array
        (
            [type] => 1
            [value] => tokenizer
            [position] => 8
            [name] => TT_TOKEN
        )

    [2] => Array
        (
            [type] => 1
            [value] => example
            [position] => 18
            [name] => TT_TOKEN
        )
)

简单分词器

简单分词器也允许使用字符串("hello" 或 'hello')、括号('()'、'[]' 和 '{}')、多个分隔符(" "、"\t"、"\n"、"\r"、"\0"、"\x0B")以及使用反斜杠(\)进行字符转义

<?php

use Devtronic\SuperTokenizer\SimpleTokenizer;

require_once __DIR__ . '/vendor/autoload.php';

$tokenizer = new SimpleTokenizer();

$sample = '"Simple" \'Tokenizer\' with\ different brackets [a, b] (c,d), {0, 1}';

$tokens = $tokenizer->tokenize($sample);

foreach ($tokens as &$token) {
    $token['name'] = $tokenizer->getTokenName($token['type']);
}

print_r($tokens);

打印

Array
(
    [0] => Array
        (
            [type] => 10
            [value] => "Simple"
            [position] => 0
            [name] => TT_STRING
        )

    [1] => Array
        (
            [type] => 10
            [value] => 'Tokenizer'
            [position] => 9
            [name] => TT_STRING
        )

    [2] => Array
        (
            [type] => 1
            [value] => with different
            [position] => 21
            [name] => TT_TOKEN
        )

    [3] => Array
        (
            [type] => 1
            [value] => brackets
            [position] => 37
            [name] => TT_TOKEN
        )

    [4] => Array
        (
            [type] => 20
            [value] => [
            [position] => 46
            [name] => TT_BRACKET_OPEN
        )

    [5] => Array
        (
            [type] => 1
            [value] => a,
            [position] => 47
            [name] => TT_TOKEN
        )

    [6] => Array
        (
            [type] => 1
            [value] => b
            [position] => 50
            [name] => TT_TOKEN
        )

    [7] => Array
        (
            [type] => 21
            [value] => ]
            [position] => 51
            [name] => TT_BRACKET_CLOSE
        )

    [8] => Array
        (
            [type] => 20
            [value] => (
            [position] => 53
            [name] => TT_BRACKET_OPEN
        )

    [9] => Array
        (
            [type] => 1
            [value] => c,d
            [position] => 54
            [name] => TT_TOKEN
        )

    [10] => Array
        (
            [type] => 21
            [value] => )
            [position] => 57
            [name] => TT_BRACKET_CLOSE
        )

    [11] => Array
        (
            [type] => 1
            [value] => ,
            [position] => 58
            [name] => TT_TOKEN
        )

    [12] => Array
        (
            [type] => 20
            [value] => {
            [position] => 60
            [name] => TT_BRACKET_OPEN
        )

    [13] => Array
        (
            [type] => 1
            [value] => 0,
            [position] => 61
            [name] => TT_TOKEN
        )

    [14] => Array
        (
            [type] => 1
            [value] => 1
            [position] => 64
            [name] => TT_TOKEN
        )

    [15] => Array
        (
            [type] => 21
            [value] => }
            [position] => 65
            [name] => TT_BRACKET_CLOSE
        )
)

自定义标记/自定义分词器

要添加自己的标记,您可以简单地创建一个自定义分词器类,如下所示

<?php

use Devtronic\SuperTokenizer\SimpleTokenizer;

require_once __DIR__ . '/vendor/autoload.php';

class CustomTokenizer extends SimpleTokenizer
{
    const TT_DOLLAR = 30;
    const TT_EQUALS = 35;

    public function __construct()
    {
        parent::__construct();

        $this->customTokens = [
            self::TT_DOLLAR => '$',
            self::TT_EQUALS => '='
        ];
    }
}

$tokenizer = new CustomTokenizer();

$sample = '$var = 1234';
$tokens = $tokenizer->tokenize($sample);

foreach ($tokens as &$token) {
    $token['name'] = $tokenizer->getTokenName($token['type']);
}

print_r($tokens);

打印

Array
(
    [0] => Array
        (
            [type] => 30
            [value] => $
            [position] => 0
            [name] => TT_DOLLAR
        )

    [1] => Array
        (
            [type] => 1
            [value] => var
            [position] => 1
            [name] => TT_TOKEN
        )

    [2] => Array
        (
            [type] => 35
            [value] => =
            [position] => 5
            [name] => TT_EQUALS
        )

    [3] => Array
        (
            [type] => 1
            [value] => 1234
            [position] => 7
            [name] => TT_TOKEN
        )
)

preTokenize()-方法允许您在分词之前修改输入源(标准化换行符...)。使用postTokenize()您可以在分词方法的结果上进行修改(检测数字...)

测试

phpunit

贡献

  • 分叉仓库
  • 创建拉取请求