README

本项目是我在PHP快速词法分析方面的文章的后续。它包含了一些词法分析器的实现（包括无状态和有状态）以及相关的性能测试。

用法

通过使用工厂类从词法分析器定义创建词法分析器。

例如，如果你想要创建一个基于MARK的无状态CSV词法分析器，你可以使用以下代码

<?php
require 'path/to/vendor/autoload.php';

$factory = new Phlexy\LexerFactory\Stateless\UsingMarks(
    new Phlexy\LexerDataGenerator
);

$lexer = $factory->createLexer(array(
    '[^",\r\n]+'                     => 0, // 0, 1, 2, 3 are the tokens
    '"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"' => 1, // they should really be constants
    ','                              => 2,
    '\r?\n'                          => 3,
));

$tokens = $lexer->lex("hallo world,foo bar,more foo,more bar,\"rare , escape\",some more,stuff\n...");

同样，对于有状态的词法分析器

<?php
require 'path/to/lib/Phlexy/bootstrap.php';

$factory = new Phlexy\LexerFactory\Stateful\UsingMarks(
    new Phlexy\LexerDataGenerator
);

// The "i" is an additional modifier (all createLexer methods accept it)
$lexer = $factory->createLexer($lexerDefinition, 'i');

有关有状态词法分析器定义的示例，你可以查看PHP源代码的词法分析定义。

性能

可以使用性能测试脚本对不同词法分析器实现进行比较

$ php-7.2 examples/performanceTests.php

Timing lexing of CVS data:
Took 0.55736708641052 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.526859998703 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.49272608757019 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.5570011138916 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.46333193778992 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing alphabet lexing of all "a":
Took 0.58650183677673 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.754310131073 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.70682787895203 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.76406478881836 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.62837815284729 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing alphabet lexing of all "z":
Took 0.79967403411865 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.30202317237854 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.29198718070984 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.36609601974487 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.12433409690857 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing alphabet lexing of random string:
Took 1.1720998287201 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.5946900844574 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.55696296691895 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.6708779335022 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.33155107498169 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing PHP lexing of this file:
Took 0.151211977005 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.025480031967163 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.007037878036499 seconds (Phlexy\Lexer\Stateful\UsingMarks)

Timing PHP lexing of larger TestAbstract file:
Took 0.49794602394104 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.083348035812378 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.019592046737671 seconds (Phlexy\Lexer\Stateful\UsingMarks)

Stateless\Simple和Stateless\Simple是简单的词法分析器实现（通过正则表达式循环）。

Stateless\WithoutCapturingGroups、Stateless\WithCapturingGroups和Stateful\UsingCompiledRegex使用了上文提到的博客中描述的编译正则表达式方法。

Stateless\UsingPregReplace是编译正则表达式方法的扩展，其中通过（误用）preg_replace_callback进行正则表达式的循环。

Stateless\UsingMarks和Stateful\UsingMark使用了PHP 5.5中暴露的(*MARK)机制。

如上述性能测量所示，Simple方法比使用编译正则表达式方法慢得多。基于MARK的实现比基于组偏移的实现性能要好得多。随着词法分析器大小的增加，这种优势更加明显：对于CSV词法分析器来说，这种差异相对较小，而对于PHP词法分析器来说，基于MARK的实现比简单实现快25倍。

nikic / phlexy

维护者

详细信息

README

用法

性能