z1b / laravel-keywords
依赖于 rake-plus 包
Requires
- php: >=5.4.0
- ext-json: *
- ext-mbstring: *
Requires (Dev)
- php: >=5.5.0
- phpunit/phpunit: ~4.0|~5.0
This package is auto-updated.
Last update: 2020-05-26 19:08:45 UTC
README
这是 Rapid Automatic Keyword Extraction (RAKE) 算法的另一个 PHP 实现。
这个包为什么有用?
关键词描述了文档/文本中表达的主要主题。关键词 提取 可以从文本中提取重要的单词和短语。这可以用于构建标签列表、构建关键词搜索索引、按主题分组相似内容等等。这个库为 PHP 开发者提供了一个简单的方法,可以从文本字符串中获取关键词和短语列表。
本项目基于 Richard Filipčík 的名为 RAKE-PHP 的另一个项目,该项目是将 Python 实现翻译成简单的 RAKE。
如 Rose, S.、Engel, D.、Cramer, N. 和 Cowley, W. (2010) 在《从单个文档中自动提取关键词》一书中所述。[链接](https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents) 在 M. W. Berry 和 J. Kogan 编辑的《文本挖掘:理论与应用》一书中。
这个特定包旨在比原始的 RAKE-PHP 包提供以下好处
当前支持的语言
- 美国英语 (en_US)
- 西班牙语 (es_AR)
- 法语 (fr_FR)
- 波兰语 (pl_PL)
- 俄语 (ru_RU)
版本
v1.0.10
特别感谢
特别感谢 Jarosław Wasilewski 为添加波兰语和改进多字节支持所做的贡献。
安装
使用 Composer
$ composer require donatello-za/rake-php-plus
{
"require": {
"donatello-za/rake-php-plus": "^1.0"
}
}
<?php
require 'vendor/autoload.php';
use Z1b\LaravelKeywords\RakePlus;
不使用 Composer
<?php
require 'path/to/AbstractStopwordProvider.php';
require 'path/to/StopwordArray.php';
require 'path/to/StopwordsPatternFile.php';
require 'path/to/StopwordsPHP.php';
require 'path/to/RakePlus.php';
use Z1b\LaravelKeywords\RakePlus;
示例 1
创建一个新的 RakePlus 实例,提取短语并返回结果。假设指定的文本是英语(美国)。
use Z1b\LaravelKeywords\RakePlus;
$text = "Criteria of compatibility of a system of linear Diophantine equations, " .
"strict inequations, and nonstrict inequations are considered. Upper bounds " .
"for components of a minimal set of solutions and algorithms of construction " .
"of minimal generating sets of solutions for all types of systems are given.";
$phrases = RakePlus::create($text)->get();
print_r($phrases);
Array
(
[0] => criteria
[1] => compatibility
[2] => system
[3] => linear diophantine equations
[4] => strict inequations
[5] => nonstrict inequations
[6] => considered
[7] => upper bounds
[8] => components
[9] => minimal set
[10] => solutions
[11] => algorithms
[12] => construction
[13] => minimal generating sets
[14] => types
[15] => systems
)
示例 2
创建一个新的 RakePlus 实例,按不同的顺序提取短语,并展示如何获取短语分数。
use Z1b\LaravelKeywords\RakePlus;
$text = "Criteria of compatibility of a system of linear Diophantine equations, " .
"strict inequations, and nonstrict inequations are considered. Upper bounds " .
"for components of a minimal set of solutions and algorithms of construction " .
"of minimal generating sets of solutions for all types of systems are given.";
// Note: en_US is the default language.
$rake = RakePlus::create($text, 'en_US');
// 'asc' is optional and is the default sort order
$phrases = $rake->sort('asc')->get();
print_r($phrases);
Array
(
[0] => algorithms
[1] => compatibility
[2] => components
[3] => considered
[4] => construction
[5] => criteria
[6] => linear diophantine equations
[7] => minimal generating sets
[8] => minimal set
[9] => nonstrict inequations
[10] => solutions
[11] => strict inequations
[12] => system
[13] => systems
[14] => types
[15] => upper bounds
)
// Sort in descending order
$phrases = $rake->sort('desc')->get();
print_r($phrases);
Array
(
[0] => upper bounds
[1] => types
[2] => systems
[3] => system
[4] => strict inequations
[5] => solutions
[6] => nonstrict inequations
[7] => minimal set
[8] => minimal generating sets
[9] => linear diophantine equations
[10] => criteria
[11] => construction
[12] => considered
[13] => components
[14] => compatibility
[15] => algorithms
)
// Sort the phrases by score and return the scores
$phrase_scores = $rake->sortByScore('desc')->scores();
print_r($phrase_scores);
Array
(
[linear diophantine equations] => 9
[minimal generating sets] => 8.5
[minimal set] => 4.5
[strict inequations] => 4
[nonstrict inequations] => 4
[upper bounds] => 4
[criteria] => 1
[compatibility] => 1
[system] => 1
[considered] => 1
[components] => 1
[solutions] => 1
[algorithms] => 1
[construction] => 1
[types] => 1
[systems] => 1
)
// Extract phrases from a new string on the same RakePlus instance. Using the
// same RakePlus instance is faster than creating a new instance as the
// language files do not have to be re-loaded and parsed.
$text = "A fast Fourier transform (FFT) algorithm computes...";
$phrases = $rake->extract($text)->sort()->get();
print_r($phrases);
Array
(
[0] => algorithm computes
[1] => fast fourier transform
[2] => fft
)
示例 3
创建一个新的 RakePlus 实例,并从短语中提取唯一的关键词。
use Z1b\LaravelKeywords\RakePlus;
$text = "Criteria of compatibility of a system of linear Diophantine equations, " .
"strict inequations, and nonstrict inequations are considered. Upper bounds " .
"for components of a minimal set of solutions and algorithms of construction " .
"of minimal generating sets of solutions for all types of systems are given.";
$keywords = RakePlus::create($text)->keywords();
print_r($keywords);
Array
(
[0] => criteria
[1] => compatibility
[2] => system
[3] => linear
[4] => diophantine
[5] => equations
[6] => strict
[7] => inequations
[8] => nonstrict
[9] => considered
[10] => upper
[11] => bounds
[12] => components
[13] => minimal
[14] => set
[15] => solutions
[16] => algorithms
[17] => construction
[18] => generating
[19] => sets
[20] => types
[21] => systems
)
示例 4
创建一个新的 RakePlus 实例,不使用静态的 RakePlus::create 方法。
use Z1b\LaravelKeywords;
$text = "Criteria of compatibility of a system of linear Diophantine equations, " .
"strict inequations, and nonstrict inequations are considered. Upper bounds " .
"for components of a minimal set of solutions and algorithms of construction " .
"of minimal generating sets of solutions for all types of systems are given.";
$rake = new RakePlus();
$phrases = $rake->extract()->get();
// Alternative method:
$phrases = (new RakePlus($text))->get();
示例 5
您可以通过四种不同的方式提供自定义停用词
use Z1b\LaravelKeywords\RakePlus;
// 1: The standard way (provide a language code)
// RakePlus will first look for ./lang/en_US.pattern, if
// not found, it will look for ./lang/en_US.php.
$rake = RakePlus::create($text, 'en_US');
// 2: Pass an array containing stopwords
$rake = RakePlus::create($text, ['a', 'able', 'about', 'above', ...]);
// 3: Pass the name of a PHP or pattern file,
// see lang/en_US.php and lang/en_US.pattern for examples.
$rake = RakePlus::create($text, '/path/to/my/stopwords.pattern');
// 4: Create an instance of one of the stopword provider classes (or
// create your own) and pass that to RakePlus:
$stopwords = StopwordArray::create(['a', 'able', 'about', 'above', ...]);
$rake = RakePlus::create($text, $stopwords);
示例 6
您可以指定短语/关键词必须的最小字符数,如果小于最小值,则将其过滤掉。默认为 0(无最小值)。
use Z1b\LaravelKeywords\RakePlus;
$text = '6462 Little Crest Suite, 413 Lake Carlietown, WA 12643';
// Without a minimum
$phrases = RakePlus::create($text, 'en_US', 0)->get();
print_r($phrases);
Array
(
[0] => crest suite
[1] => 413 lake carlietown
[2] => wa 12643
)
// With a minimum
$phrases = RakePlus::create($text, 'en_US', 10)->get();
print_r($phrases);
Array
(
[0] => crest suite
[1] => 413 lake carlietown
)
示例 7
您可以指定是否过滤出仅由数字组成的短语/关键词。默认是过滤数字。
use Z1b\LaravelKeywords\RakePlus;
$text = '6462 Little Crest Suite, 413 Lake Carlietown, WA 12643';
// Filter out numerics
$phrases = RakePlus::create($text, 'en_US', 0, true)->get();
print_r($phrases);
Array
(
[0] => crest suite
[1] => 413 lake carlietown
[2] => wa 12643
)
// Do not filter out numerics
$phrases = RakePlus::create($text, 'en_US', 0, false)->get();
print_r($phrases);
Array
(
[0] => 6462
[1] => crest suite
[2] => 413 lake carlietown
[3] => wa 12643
)
如何添加额外的语言
使用停用词提取工具
该库需要为每种语言提供一个“停用词”列表。停用词是指在一种语言中常用的词,如“和”、“是”、“或”等。此类停用词的示例列表可以在以下网址找到(en_US):这里。您还可以查看这个列表,它包含50种不同语言的停用词,分别存放在单独的JSON文件中:这里。
当处理如第一个示例中的简单列表时,您可以复制文本并将其粘贴到文本文件中,然后使用提取工具将其转换为该库可以高效读取的格式。以下是从上述超链接中复制的停用词文件示例(console/stopwords_en_US.txt),供您参考
或者,您也可以从提供的JSON文件中提取停用词,示例文件位于console/stopwords_en_US.json
要从文本文件中提取停用词,请在命令行中运行以下命令
$ php -q extractor.php stopwords_en_US.txt
要从JSON文件中提取停用词,请在命令行中运行以下命令
$ php -q extractor.php stopwords_en_US.json
它将输出结果到终端。您会发现结果看起来像PHP,事实上它就是PHP。您可以直接将结果写入PHP文件,方法是管道传输
$ php -q extractor.php stopwords_en_US.txt > en_US.php
最后,将en_US.php
文件复制到lang/
目录(您可能需要设置其权限,以便网络服务器可以执行它),然后像这样实例化php-rake-plus
$rake = RakePlus::create($text, 'en_US');
为了提高RakePlus中语言文件的初始加载速度,您还可以设置导出器使用-p
开关产生正则表达式模式的结果
$ php -q extractor.php stopwords_en_US.txt -p > en_US.pattern
RakePHP将始终首先寻找.pattern文件,如果没有找到,将寻找lang目录下的.php文件。
要运行测试
./vendor/bin/phpunit tests/RakePlusTest.php
许可协议
在MIT许可下发布(阅读LICENSE)。