org_heigl / tokenizer
提供了一种字符串分词的方法
dev-master
2014-01-08 14:34 UTC
Requires (Dev)
- mockery/mockery: dev-master
This package is auto-updated.
Last update: 2024-09-23 09:38:41 UTC
README
提供将字符串分割成更小实体(取决于使用的分词器)的方法。
您可以将不同的分词器链接到一个分词器链中,以获得所需的结果。
目前这个库提供以下分词器
- WhitespaceTokenizer 用于在空白处分割字符串。可用于将句子分割成单个单词。
- CamelCaseTokenizer 用于将驼峰式字符串分割成独立的标记。
安装
使用 composer
安装,将以下行添加到您的 composer.conf
-文件中的 require
部分:
"org_heigl/tokenizer" : "dev-master"
用法
用法相对简单
use Org_Heigl\Tokenizer\TokenizerQueue;
use Org_Heigl\Tokenizer\Tokenizers;
// Create a new Tokenizer-Queue
$tokenizer = new TokenizerQueue();
// Add single tokenizers to the queue
// First a Whitespace tokenizer
$tokenizer->addTokenizer(new Tokenizers\WhitespaceTokenizer());
// Then a CamelCase-Tokenizer
$tokenizer->addTokenizer(new Tokenizers\CamelCaseTokenizer());
// Finally tokenize a given string
$tokenList = $tokenizer->tokenize('A String with WhiteSpace');
var_dump((array) $tokenList);
// This will print the following:
/*
array(8) {
[0] =>
class Org_Heigl\Tokenizer\Token#216 (3) {
protected $token =>
string(1) "A"
protected $offset =>
int(0)
protected $type =>
string(6) "string"
}
[1] =>
class Org_Heigl\Tokenizer\Token#215 (3) {
protected $token =>
string(1) " "
protected $offset =>
int(1)
protected $type =>
string(10) "whitespace"
}
[2] =>
class Org_Heigl\Tokenizer\Token#214 (3) {
protected $token =>
string(6) "String"
protected $offset =>
int(2)
protected $type =>
string(6) "string"
}
[3] =>
class Org_Heigl\Tokenizer\Token#213 (3) {
protected $token =>
string(1) " "
protected $offset =>
int(8)
protected $type =>
string(10) "whitespace"
}
[4] =>
class Org_Heigl\Tokenizer\Token#212 (3) {
protected $token =>
string(4) "with"
protected $offset =>
int(9)
protected $type =>
string(6) "string"
}
[5] =>
class Org_Heigl\Tokenizer\Token#211 (3) {
protected $token =>
string(1) " "
protected $offset =>
int(13)
protected $type =>
string(10) "whitespace"
}
[6] =>
class Org_Heigl\Tokenizer\Token#209 (3) {
protected $token =>
string(5) "White"
protected $offset =>
int(14)
protected $type =>
string(6) "string"
}
[7] =>
class Org_Heigl\Tokenizer\Token#208 (3) {
protected $token =>
string(5) "Space"
protected $offset =>
int(19)
protected $type =>
string(6) "string"
}
}
*/