devtronic / super-tokenizer
由PHP编写的强大动态分词器
1.0.1
2017-02-26 15:28 UTC
Requires
- php: >=7.0.0
Requires (Dev)
- phpunit/phpunit: ^5.6
This package is auto-updated.
Last update: 2024-09-22 00:09:34 UTC
README
超级分词器
超级分词器是一个由PHP编写的超动态且易于使用的分词器
安装
composer require devtronic/super-tokenizer
使用
最小分词器
<?php use Devtronic\SuperTokenizer\Tokenizer; require_once __DIR__ . '/vendor/autoload.php'; $tokenizer = new Tokenizer(); $sample = 'Minimal tokenizer example'; $tokens = $tokenizer->tokenize($sample); print_r($tokens);
打印
Array
(
[0] => Array
(
[type] => 1
[value] => Minimal
[position] => 0
)
[1] => Array
(
[type] => 1
[value] => tokenizer
[position] => 8
)
[2] => Array
(
[type] => 1
[value] => example
[position] => 18
)
)
您还可以使用getTokenName()-方法获取标记的名称
<?php // ... foreach ($tokens as &$token) { $token['name'] = $tokenizer->getTokenName($token['type']); } print_r($tokens);
打印
Array
(
[0] => Array
(
[type] => 1
[value] => Minimal
[position] => 0
[name] => TT_TOKEN
)
[1] => Array
(
[type] => 1
[value] => tokenizer
[position] => 8
[name] => TT_TOKEN
)
[2] => Array
(
[type] => 1
[value] => example
[position] => 18
[name] => TT_TOKEN
)
)
简单分词器
简单分词器也允许使用字符串("hello" 或 'hello')、括号('()'、'[]' 和 '{}')、多个分隔符(" "、"\t"、"\n"、"\r"、"\0"、"\x0B")以及使用反斜杠(\)进行字符转义
<?php use Devtronic\SuperTokenizer\SimpleTokenizer; require_once __DIR__ . '/vendor/autoload.php'; $tokenizer = new SimpleTokenizer(); $sample = '"Simple" \'Tokenizer\' with\ different brackets [a, b] (c,d), {0, 1}'; $tokens = $tokenizer->tokenize($sample); foreach ($tokens as &$token) { $token['name'] = $tokenizer->getTokenName($token['type']); } print_r($tokens);
打印
Array
(
[0] => Array
(
[type] => 10
[value] => "Simple"
[position] => 0
[name] => TT_STRING
)
[1] => Array
(
[type] => 10
[value] => 'Tokenizer'
[position] => 9
[name] => TT_STRING
)
[2] => Array
(
[type] => 1
[value] => with different
[position] => 21
[name] => TT_TOKEN
)
[3] => Array
(
[type] => 1
[value] => brackets
[position] => 37
[name] => TT_TOKEN
)
[4] => Array
(
[type] => 20
[value] => [
[position] => 46
[name] => TT_BRACKET_OPEN
)
[5] => Array
(
[type] => 1
[value] => a,
[position] => 47
[name] => TT_TOKEN
)
[6] => Array
(
[type] => 1
[value] => b
[position] => 50
[name] => TT_TOKEN
)
[7] => Array
(
[type] => 21
[value] => ]
[position] => 51
[name] => TT_BRACKET_CLOSE
)
[8] => Array
(
[type] => 20
[value] => (
[position] => 53
[name] => TT_BRACKET_OPEN
)
[9] => Array
(
[type] => 1
[value] => c,d
[position] => 54
[name] => TT_TOKEN
)
[10] => Array
(
[type] => 21
[value] => )
[position] => 57
[name] => TT_BRACKET_CLOSE
)
[11] => Array
(
[type] => 1
[value] => ,
[position] => 58
[name] => TT_TOKEN
)
[12] => Array
(
[type] => 20
[value] => {
[position] => 60
[name] => TT_BRACKET_OPEN
)
[13] => Array
(
[type] => 1
[value] => 0,
[position] => 61
[name] => TT_TOKEN
)
[14] => Array
(
[type] => 1
[value] => 1
[position] => 64
[name] => TT_TOKEN
)
[15] => Array
(
[type] => 21
[value] => }
[position] => 65
[name] => TT_BRACKET_CLOSE
)
)
自定义标记/自定义分词器
要添加自己的标记,您可以简单地创建一个自定义分词器类,如下所示
<?php use Devtronic\SuperTokenizer\SimpleTokenizer; require_once __DIR__ . '/vendor/autoload.php'; class CustomTokenizer extends SimpleTokenizer { const TT_DOLLAR = 30; const TT_EQUALS = 35; public function __construct() { parent::__construct(); $this->customTokens = [ self::TT_DOLLAR => '$', self::TT_EQUALS => '=' ]; } } $tokenizer = new CustomTokenizer(); $sample = '$var = 1234'; $tokens = $tokenizer->tokenize($sample); foreach ($tokens as &$token) { $token['name'] = $tokenizer->getTokenName($token['type']); } print_r($tokens);
打印
Array
(
[0] => Array
(
[type] => 30
[value] => $
[position] => 0
[name] => TT_DOLLAR
)
[1] => Array
(
[type] => 1
[value] => var
[position] => 1
[name] => TT_TOKEN
)
[2] => Array
(
[type] => 35
[value] => =
[position] => 5
[name] => TT_EQUALS
)
[3] => Array
(
[type] => 1
[value] => 1234
[position] => 7
[name] => TT_TOKEN
)
)
preTokenize()-方法允许您在分词之前修改输入源(标准化换行符...)。使用postTokenize()您可以在分词方法的结果上进行修改(检测数字...)
测试
phpunit
贡献
- 分叉仓库
- 创建拉取请求