nikic / phlexy
PHP的词法分析实验
v0.2
2019-06-26 21:03 UTC
Requires
- php: ^7.1
Requires (Dev)
- phpunit/phpunit: ^7.0 || ^8.0
This package is auto-updated.
Last update: 2024-08-25 01:54:53 UTC
README
本项目是我在PHP快速词法分析方面的文章的后续。它包含了一些词法分析器的实现(包括无状态和有状态)以及相关的性能测试。
用法
通过使用工厂类从词法分析器定义创建词法分析器。
例如,如果你想要创建一个基于MARK的无状态CSV词法分析器,你可以使用以下代码
<?php require 'path/to/vendor/autoload.php'; $factory = new Phlexy\LexerFactory\Stateless\UsingMarks( new Phlexy\LexerDataGenerator ); $lexer = $factory->createLexer(array( '[^",\r\n]+' => 0, // 0, 1, 2, 3 are the tokens '"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"' => 1, // they should really be constants ',' => 2, '\r?\n' => 3, )); $tokens = $lexer->lex("hallo world,foo bar,more foo,more bar,\"rare , escape\",some more,stuff\n...");
同样,对于有状态的词法分析器
<?php require 'path/to/lib/Phlexy/bootstrap.php'; $factory = new Phlexy\LexerFactory\Stateful\UsingMarks( new Phlexy\LexerDataGenerator ); // The "i" is an additional modifier (all createLexer methods accept it) $lexer = $factory->createLexer($lexerDefinition, 'i');
有关有状态词法分析器定义的示例,你可以查看PHP源代码的词法分析定义。
性能
可以使用性能测试脚本对不同词法分析器实现进行比较
$ php-7.2 examples/performanceTests.php
Timing lexing of CVS data:
Took 0.55736708641052 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.526859998703 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.49272608757019 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.5570011138916 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.46333193778992 seconds (Phlexy\Lexer\Stateless\UsingMarks)
Timing alphabet lexing of all "a":
Took 0.58650183677673 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.754310131073 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.70682787895203 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.76406478881836 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.62837815284729 seconds (Phlexy\Lexer\Stateless\UsingMarks)
Timing alphabet lexing of all "z":
Took 0.79967403411865 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.30202317237854 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.29198718070984 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.36609601974487 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.12433409690857 seconds (Phlexy\Lexer\Stateless\UsingMarks)
Timing alphabet lexing of random string:
Took 1.1720998287201 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.5946900844574 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.55696296691895 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.6708779335022 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.33155107498169 seconds (Phlexy\Lexer\Stateless\UsingMarks)
Timing PHP lexing of this file:
Took 0.151211977005 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.025480031967163 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.007037878036499 seconds (Phlexy\Lexer\Stateful\UsingMarks)
Timing PHP lexing of larger TestAbstract file:
Took 0.49794602394104 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.083348035812378 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.019592046737671 seconds (Phlexy\Lexer\Stateful\UsingMarks)
Stateless\Simple
和Stateless\Simple
是简单的词法分析器实现(通过正则表达式循环)。
Stateless\WithoutCapturingGroups
、Stateless\WithCapturingGroups
和Stateful\UsingCompiledRegex
使用了上文提到的博客中描述的编译正则表达式方法。
Stateless\UsingPregReplace
是编译正则表达式方法的扩展,其中通过(误用)preg_replace_callback
进行正则表达式的循环。
Stateless\UsingMarks
和Stateful\UsingMark
使用了PHP 5.5中暴露的(*MARK)
机制。
如上述性能测量所示,Simple
方法比使用编译正则表达式方法慢得多。基于MARK的实现比基于组偏移的实现性能要好得多。随着词法分析器大小的增加,这种优势更加明显:对于CSV词法分析器来说,这种差异相对较小,而对于PHP词法分析器来说,基于MARK的实现比简单实现快25倍。