jared/php-tokenizer

此包的最新版本(dev-main)没有提供许可证信息。

此包提供了一种简单的基于正则表达式的独立分词器。

dev-main 2023-09-13 20:39 UTC

This package is not auto-updated.

Last update: 2024-09-26 23:45:24 UTC


README

简单的正则表达式驱动分词器

此包提供了一种简单的基于正则表达式的独立分词器。

安装

composer require jared/php-tokenizer

使用

示例用法

$rules = [ 
    'NON_SPACE_STRING' => '/\\G[^\\s]+/u',
    'ANY_CHARACTER' => '/\\G./u'
];

$string = 'abc 1qz';
$stream = ( new Falloff\Tokenizer\Factory( $rules ) )->getStream( $string );
while( $token = $stream->nextToken() ){
    print "Token has type `{$token->type}` and its value is `{$token->value}` at offset `{$token->offset}`\n";
}

// The output is:
# Token has type `NON_SPACE_STRING` and its value is `abc` at offset `0`
# Token has type `ANY_CHARACTER` and its value is ` ` at offset `3`
# Token has type `NON_SPACE_STRING` and its value is `1qz` at offset `4`

注意:所使用的正则表达式必须以\G断言开头。

注意:数据按UTF-8解释,因此建议使用带有u设置的正则表达式。

规则可以动态添加到工厂或流本身。添加到工厂的规则不会影响已实例化的流。

$rules = [ 
    'NON_SPACE_STRING' => '/\\G[^\\s]+/u',
    'ANY_CHARACTER' => '/\\G./u'
];

$string = 'a b 1 qz';
$stream = ( new Falloff\Tokenizer\Factory( $rules ) )->getStream( $stream );

// This rule will never trigger, coz 'NON_SPACE_STRING' will be macthed earlier
$stream->appendRule('DIGIT', '/\\G\d/u');

// Prepending rules, so these will be matched before the 'NON_SPACE_STRING'
$stream->prependRules([
    'Q_CHAR' => '/\\Gq/u',
    'Z_CHAR' => '/\\Gz/u',
]);

// Stream might be invoked like it was a function
while( $token = $stream() ){
    print "Token has type `{$token->type}` and its value is `{$token->value}` at offset `{$token->offset}`\n";
}

// The output is:
# Token has type `NON_SPACE_STRING` and its value is `a` at offset `0`
# Token has type `ANY_CHARACTER` and its value is ` ` at offset `1`
# Token has type `NON_SPACE_STRING` and its value is `b` at offset `2`
# Token has type `ANY_CHARACTER` and its value is ` ` at offset `3`
# Token has type `NON_SPACE_STRING` and its value is `1` at offset `4`
# Token has type `ANY_CHARACTER` and its value is ` ` at offset `5`
# Token has type `Q_CHAR` and its value is `q` at offset `6`
# Token has type `Z_CHAR` and its value is `z` at offset `7`

当没有规则匹配下一个输入流块时,将抛出UnknownTokenException。此异常本身是一个标记。它将类型设置为NULL,但仍允许访问valueoffset属性。

当流结束时,对下一个标记的调用将返回false。可以通过检查eof属性来检索流状态,而无需请求下一个标记。

if( $stream->eof ){
    print "Got all the tokens we had there";
} else{
    $token = $stream();
}

可以使用tail方法在任何时候检索剩余的子字符串。

print "The untokenized substring currently is: " . $stream->tail();

流可以附加一个回调函数,每当从分词器请求标记时,该函数就会被触发。

use \Falloff\Tokenizer\{UnknownTokenException,Token};

$stream->onTokenRequest(function( UnknownTokenException|Token $token ){
    print $token->type . ' token retrieved from the stream';
});

如果此回调返回一个Token实例,则该实例将返回给初始调用者。