nikic/phlexy

PHP的词法分析实验

维护者

详细信息

github.com/nikic/Phlexy

源码

问题

安装: 395 888

依赖: 15

建议者: 0

安全: 0

星级: 161

关注者: 15

分支: 11

开放问题: 1

v0.2 2019-06-26 21:03 UTC

This package is auto-updated.

Last update: 2024-08-25 01:54:53 UTC


README

本项目是我在PHP快速词法分析方面的文章的后续。它包含了一些词法分析器的实现(包括无状态和有状态)以及相关的性能测试。

用法

通过使用工厂类从词法分析器定义创建词法分析器。

例如,如果你想要创建一个基于MARK的无状态CSV词法分析器,你可以使用以下代码

<?php
require 'path/to/vendor/autoload.php';

$factory = new Phlexy\LexerFactory\Stateless\UsingMarks(
    new Phlexy\LexerDataGenerator
);

$lexer = $factory->createLexer(array(
    '[^",\r\n]+'                     => 0, // 0, 1, 2, 3 are the tokens
    '"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"' => 1, // they should really be constants
    ','                              => 2,
    '\r?\n'                          => 3,
));

$tokens = $lexer->lex("hallo world,foo bar,more foo,more bar,\"rare , escape\",some more,stuff\n...");

同样,对于有状态的词法分析器

<?php
require 'path/to/lib/Phlexy/bootstrap.php';

$factory = new Phlexy\LexerFactory\Stateful\UsingMarks(
    new Phlexy\LexerDataGenerator
);

// The "i" is an additional modifier (all createLexer methods accept it)
$lexer = $factory->createLexer($lexerDefinition, 'i');

有关有状态词法分析器定义的示例,你可以查看PHP源代码的词法分析定义

性能

可以使用性能测试脚本对不同词法分析器实现进行比较

$ php-7.2 examples/performanceTests.php

Timing lexing of CVS data:
Took 0.55736708641052 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.526859998703 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.49272608757019 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.5570011138916 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.46333193778992 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing alphabet lexing of all "a":
Took 0.58650183677673 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.754310131073 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.70682787895203 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.76406478881836 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.62837815284729 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing alphabet lexing of all "z":
Took 0.79967403411865 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.30202317237854 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.29198718070984 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.36609601974487 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.12433409690857 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing alphabet lexing of random string:
Took 1.1720998287201 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.5946900844574 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.55696296691895 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.6708779335022 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.33155107498169 seconds (Phlexy\Lexer\Stateless\UsingMarks)

Timing PHP lexing of this file:
Took 0.151211977005 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.025480031967163 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.007037878036499 seconds (Phlexy\Lexer\Stateful\UsingMarks)

Timing PHP lexing of larger TestAbstract file:
Took 0.49794602394104 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.083348035812378 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.019592046737671 seconds (Phlexy\Lexer\Stateful\UsingMarks)

Stateless\SimpleStateless\Simple是简单的词法分析器实现(通过正则表达式循环)。

Stateless\WithoutCapturingGroupsStateless\WithCapturingGroupsStateful\UsingCompiledRegex使用了上文提到的博客中描述的编译正则表达式方法。

Stateless\UsingPregReplace是编译正则表达式方法的扩展,其中通过(误用)preg_replace_callback进行正则表达式的循环。

Stateless\UsingMarksStateful\UsingMark使用了PHP 5.5中暴露的(*MARK)机制。

如上述性能测量所示,Simple方法比使用编译正则表达式方法慢得多。基于MARK的实现比基于组偏移的实现性能要好得多。随着词法分析器大小的增加,这种优势更加明显:对于CSV词法分析器来说,这种差异相对较小,而对于PHP词法分析器来说,基于MARK的实现比简单实现快25倍。