jared/ php-tokenizer
此包的最新版本(dev-main)没有提供许可证信息。
此包提供了一种简单的基于正则表达式的独立分词器。
dev-main
2023-09-13 20:39 UTC
Requires
- php: >=7.4
Requires (Dev)
- phpunit/phpunit: ^10
This package is not auto-updated.
Last update: 2024-09-26 23:45:24 UTC
README
简单的正则表达式驱动分词器
此包提供了一种简单的基于正则表达式的独立分词器。
安装
composer require jared/php-tokenizer
使用
示例用法
$rules = [ 'NON_SPACE_STRING' => '/\\G[^\\s]+/u', 'ANY_CHARACTER' => '/\\G./u' ]; $string = 'abc 1qz'; $stream = ( new Falloff\Tokenizer\Factory( $rules ) )->getStream( $string ); while( $token = $stream->nextToken() ){ print "Token has type `{$token->type}` and its value is `{$token->value}` at offset `{$token->offset}`\n"; } // The output is: # Token has type `NON_SPACE_STRING` and its value is `abc` at offset `0` # Token has type `ANY_CHARACTER` and its value is ` ` at offset `3` # Token has type `NON_SPACE_STRING` and its value is `1qz` at offset `4`
注意:所使用的正则表达式必须以\G
断言开头。
注意:数据按UTF-8解释,因此建议使用带有u
设置的正则表达式。
规则可以动态添加到工厂或流本身。添加到工厂的规则不会影响已实例化的流。
$rules = [ 'NON_SPACE_STRING' => '/\\G[^\\s]+/u', 'ANY_CHARACTER' => '/\\G./u' ]; $string = 'a b 1 qz'; $stream = ( new Falloff\Tokenizer\Factory( $rules ) )->getStream( $stream ); // This rule will never trigger, coz 'NON_SPACE_STRING' will be macthed earlier $stream->appendRule('DIGIT', '/\\G\d/u'); // Prepending rules, so these will be matched before the 'NON_SPACE_STRING' $stream->prependRules([ 'Q_CHAR' => '/\\Gq/u', 'Z_CHAR' => '/\\Gz/u', ]); // Stream might be invoked like it was a function while( $token = $stream() ){ print "Token has type `{$token->type}` and its value is `{$token->value}` at offset `{$token->offset}`\n"; } // The output is: # Token has type `NON_SPACE_STRING` and its value is `a` at offset `0` # Token has type `ANY_CHARACTER` and its value is ` ` at offset `1` # Token has type `NON_SPACE_STRING` and its value is `b` at offset `2` # Token has type `ANY_CHARACTER` and its value is ` ` at offset `3` # Token has type `NON_SPACE_STRING` and its value is `1` at offset `4` # Token has type `ANY_CHARACTER` and its value is ` ` at offset `5` # Token has type `Q_CHAR` and its value is `q` at offset `6` # Token has type `Z_CHAR` and its value is `z` at offset `7`
当没有规则匹配下一个输入流块时,将抛出UnknownTokenException
。此异常本身是一个标记。它将类型设置为NULL
,但仍允许访问value
和offset
属性。
当流结束时,对下一个标记的调用将返回false
。可以通过检查eof
属性来检索流状态,而无需请求下一个标记。
if( $stream->eof ){ print "Got all the tokens we had there"; } else{ $token = $stream(); }
可以使用tail
方法在任何时候检索剩余的子字符串。
print "The untokenized substring currently is: " . $stream->tail();
流可以附加一个回调函数,每当从分词器请求标记时,该函数就会被触发。
use \Falloff\Tokenizer\{UnknownTokenException,Token}; $stream->onTokenRequest(function( UnknownTokenException|Token $token ){ print $token->type . ' token retrieved from the stream'; });
如果此回调返回一个Token
实例,则该实例将返回给初始调用者。