z1b/laravel-keywords

dev-master / dev-master 2019-04-26 16:47 UTC

This package is auto-updated.

Last update: 2020-05-26 19:08:45 UTC


README

这是 Rapid Automatic Keyword Extraction (RAKE) 算法的另一个 PHP 实现。

Latest Stable Version Total Downloads License

这个包为什么有用?

关键词描述了文档/文本中表达的主要主题。关键词 提取 可以从文本中提取重要的单词和短语。这可以用于构建标签列表、构建关键词搜索索引、按主题分组相似内容等等。这个库为 PHP 开发者提供了一个简单的方法,可以从文本字符串中获取关键词和短语列表。

本项目基于 Richard Filipčík 的名为 RAKE-PHP 的另一个项目,该项目是将 Python 实现翻译成简单的 RAKE

如 Rose, S.、Engel, D.、Cramer, N. 和 Cowley, W. (2010) 在《从单个文档中自动提取关键词》一书中所述。[链接](https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents) 在 M. W. Berry 和 J. Kogan 编辑的《文本挖掘:理论与应用》一书中。

这个特定包旨在比原始的 RAKE-PHP 包提供以下好处

  1. 添加 PSR-2 编码标准。
  2. 实现 PSR-4 以便使用 Composer 安装。
  3. 添加额外的功能,如方法链。
  4. 提供多种方式提供源停用词。
  5. 完整的单元测试覆盖率。
  6. 性能改进。
  7. 改进的文档。

当前支持的语言

  • 美国英语 (en_US)
  • 西班牙语 (es_AR)
  • 法语 (fr_FR)
  • 波兰语 (pl_PL)
  • 俄语 (ru_RU)

版本

v1.0.10

特别感谢

特别感谢 Jarosław Wasilewski 为添加波兰语和改进多字节支持所做的贡献。

安装

使用 Composer

$ composer require donatello-za/rake-php-plus

{
    "require": {
        "donatello-za/rake-php-plus": "^1.0"
    }
}
<?php
require 'vendor/autoload.php';

use Z1b\LaravelKeywords\RakePlus;

不使用 Composer

<?php

require 'path/to/AbstractStopwordProvider.php';
require 'path/to/StopwordArray.php';
require 'path/to/StopwordsPatternFile.php';
require 'path/to/StopwordsPHP.php';
require 'path/to/RakePlus.php';

use Z1b\LaravelKeywords\RakePlus;

示例 1

创建一个新的 RakePlus 实例,提取短语并返回结果。假设指定的文本是英语(美国)。


use Z1b\LaravelKeywords\RakePlus;

$text = "Criteria of compatibility of a system of linear Diophantine equations, " . 
    "strict inequations, and nonstrict inequations are considered. Upper bounds " .
    "for components of a minimal set of solutions and algorithms of construction " .
    "of minimal generating sets of solutions for all types of systems are given.";

$phrases = RakePlus::create($text)->get();

print_r($phrases);

Array
(
    [0] => criteria
    [1] => compatibility
    [2] => system
    [3] => linear diophantine equations
    [4] => strict inequations
    [5] => nonstrict inequations
    [6] => considered
    [7] => upper bounds
    [8] => components
    [9] => minimal set
    [10] => solutions
    [11] => algorithms
    [12] => construction
    [13] => minimal generating sets
    [14] => types
    [15] => systems
)


示例 2

创建一个新的 RakePlus 实例,按不同的顺序提取短语,并展示如何获取短语分数。


use Z1b\LaravelKeywords\RakePlus;

$text = "Criteria of compatibility of a system of linear Diophantine equations, " . 
    "strict inequations, and nonstrict inequations are considered. Upper bounds " .
    "for components of a minimal set of solutions and algorithms of construction " .
    "of minimal generating sets of solutions for all types of systems are given.";

// Note: en_US is the default language.
$rake = RakePlus::create($text, 'en_US');

// 'asc' is optional and is the default sort order
$phrases = $rake->sort('asc')->get();
print_r($phrases);

Array
(
    [0] => algorithms
    [1] => compatibility
    [2] => components
    [3] => considered
    [4] => construction
    [5] => criteria
    [6] => linear diophantine equations
    [7] => minimal generating sets
    [8] => minimal set
    [9] => nonstrict inequations
    [10] => solutions
    [11] => strict inequations
    [12] => system
    [13] => systems
    [14] => types
    [15] => upper bounds
)

// Sort in descending order
$phrases = $rake->sort('desc')->get();
print_r($phrases);

Array
(
    [0] => upper bounds
    [1] => types
    [2] => systems
    [3] => system
    [4] => strict inequations
    [5] => solutions
    [6] => nonstrict inequations
    [7] => minimal set
    [8] => minimal generating sets
    [9] => linear diophantine equations
    [10] => criteria
    [11] => construction
    [12] => considered
    [13] => components
    [14] => compatibility
    [15] => algorithms
)

// Sort the phrases by score and return the scores
$phrase_scores = $rake->sortByScore('desc')->scores();
print_r($phrase_scores);

Array
(
    [linear diophantine equations] => 9
    [minimal generating sets] => 8.5
    [minimal set] => 4.5
    [strict inequations] => 4
    [nonstrict inequations] => 4
    [upper bounds] => 4
    [criteria] => 1
    [compatibility] => 1
    [system] => 1
    [considered] => 1
    [components] => 1
    [solutions] => 1
    [algorithms] => 1
    [construction] => 1
    [types] => 1
    [systems] => 1
)


// Extract phrases from a new string on the same RakePlus instance. Using the 
// same RakePlus instance is faster than creating a new instance as the 
// language files do not have to be re-loaded and parsed.

$text = "A fast Fourier transform (FFT) algorithm computes...";
$phrases = $rake->extract($text)->sort()->get();
print_r($phrases);

Array
(
    [0] => algorithm computes
    [1] => fast fourier transform
    [2] => fft
)

示例 3

创建一个新的 RakePlus 实例,并从短语中提取唯一的关键词。


use Z1b\LaravelKeywords\RakePlus;

$text = "Criteria of compatibility of a system of linear Diophantine equations, " . 
    "strict inequations, and nonstrict inequations are considered. Upper bounds " .
    "for components of a minimal set of solutions and algorithms of construction " .
    "of minimal generating sets of solutions for all types of systems are given.";

$keywords = RakePlus::create($text)->keywords();
print_r($keywords);

Array
(
    [0] => criteria
    [1] => compatibility
    [2] => system
    [3] => linear
    [4] => diophantine
    [5] => equations
    [6] => strict
    [7] => inequations
    [8] => nonstrict
    [9] => considered
    [10] => upper
    [11] => bounds
    [12] => components
    [13] => minimal
    [14] => set
    [15] => solutions
    [16] => algorithms
    [17] => construction
    [18] => generating
    [19] => sets
    [20] => types
    [21] => systems
)

示例 4

创建一个新的 RakePlus 实例,不使用静态的 RakePlus::create 方法。


use Z1b\LaravelKeywords;

$text = "Criteria of compatibility of a system of linear Diophantine equations, " . 
    "strict inequations, and nonstrict inequations are considered. Upper bounds " .
    "for components of a minimal set of solutions and algorithms of construction " .
    "of minimal generating sets of solutions for all types of systems are given.";

$rake = new RakePlus();
$phrases = $rake->extract()->get();

// Alternative method:
$phrases = (new RakePlus($text))->get();

示例 5

您可以通过四种不同的方式提供自定义停用词


use Z1b\LaravelKeywords\RakePlus;

// 1: The standard way (provide a language code)
//    RakePlus will first look for ./lang/en_US.pattern, if
//    not found, it will look for ./lang/en_US.php.
$rake = RakePlus::create($text, 'en_US');

// 2: Pass an array containing stopwords
$rake = RakePlus::create($text, ['a', 'able', 'about', 'above', ...]);

// 3: Pass the name of a PHP or pattern file, 
//    see lang/en_US.php and lang/en_US.pattern for examples.
$rake = RakePlus::create($text, '/path/to/my/stopwords.pattern');

// 4: Create an instance of one of the stopword provider classes (or
//    create your own) and pass that to RakePlus:
$stopwords = StopwordArray::create(['a', 'able', 'about', 'above', ...]);
$rake = RakePlus::create($text, $stopwords);

示例 6

您可以指定短语/关键词必须的最小字符数,如果小于最小值,则将其过滤掉。默认为 0(无最小值)。


use Z1b\LaravelKeywords\RakePlus;

$text = '6462 Little Crest Suite, 413 Lake Carlietown, WA 12643';

// Without a minimum
$phrases = RakePlus::create($text, 'en_US', 0)->get();
print_r($phrases);

Array
(
    [0] => crest suite
    [1] => 413 lake carlietown
    [2] => wa 12643
)

// With a minimum
$phrases = RakePlus::create($text, 'en_US', 10)->get();
print_r($phrases);

Array
(
    [0] => crest suite
    [1] => 413 lake carlietown
)

示例 7

您可以指定是否过滤出仅由数字组成的短语/关键词。默认是过滤数字。


use Z1b\LaravelKeywords\RakePlus;

$text = '6462 Little Crest Suite, 413 Lake Carlietown, WA 12643';

// Filter out numerics
$phrases = RakePlus::create($text, 'en_US', 0, true)->get();
print_r($phrases);

Array
(
    [0] => crest suite
    [1] => 413 lake carlietown
    [2] => wa 12643
)

// Do not filter out numerics
$phrases = RakePlus::create($text, 'en_US', 0, false)->get();
print_r($phrases);

Array
(
    [0] => 6462
    [1] => crest suite
    [2] => 413 lake carlietown
    [3] => wa 12643
)

如何添加额外的语言

使用停用词提取工具

该库需要为每种语言提供一个“停用词”列表。停用词是指在一种语言中常用的词,如“和”、“是”、“或”等。此类停用词的示例列表可以在以下网址找到(en_US):这里。您还可以查看这个列表,它包含50种不同语言的停用词,分别存放在单独的JSON文件中:这里

当处理如第一个示例中的简单列表时,您可以复制文本并将其粘贴到文本文件中,然后使用提取工具将其转换为该库可以高效读取的格式。以下是从上述超链接中复制的停用词文件示例(console/stopwords_en_US.txt),供您参考

或者,您也可以从提供的JSON文件中提取停用词,示例文件位于console/stopwords_en_US.json

要从文本文件中提取停用词,请在命令行中运行以下命令

$ php -q extractor.php stopwords_en_US.txt

要从JSON文件中提取停用词,请在命令行中运行以下命令

$ php -q extractor.php stopwords_en_US.json

它将输出结果到终端。您会发现结果看起来像PHP,事实上它就是PHP。您可以直接将结果写入PHP文件,方法是管道传输

$ php -q extractor.php stopwords_en_US.txt > en_US.php

最后,将en_US.php文件复制到lang/目录(您可能需要设置其权限,以便网络服务器可以执行它),然后像这样实例化php-rake-plus

$rake = RakePlus::create($text, 'en_US');

为了提高RakePlus中语言文件的初始加载速度,您还可以设置导出器使用-p开关产生正则表达式模式的结果

$ php -q extractor.php stopwords_en_US.txt -p > en_US.pattern

RakePHP将始终首先寻找.pattern文件,如果没有找到,将寻找lang目录下的.php文件。

要运行测试

./vendor/bin/phpunit tests/RakePlusTest.php

许可协议

在MIT许可下发布(阅读LICENSE)。