dangquangha/phpflashtext

Flashtext Python 实现的移植版

dev-master 2021-08-07 11:08 UTC

This package is not auto-updated.

Last update: 2024-09-30 00:52:39 UTC


README

Build Status Coverage Status

它是从优秀的 Python 项目 https://github.com/vi3k6i5/flashtext 端移植过来的,算法的内部细节请参考那里。

此算法允许您一次性提取或替换多个关键词。如果您处理 300 个关键词,每个关键词有 5 个变体,那么正则表达式方法比 flashtext 方法慢。对于 1000 个关键词,每个关键词有 5 个变体,正则表达式可能无法构建。

在 PHP 5.6 中使用正则表达式非常慢。在新版本中性能更好。

安装

composer require dangquangha/phpflashtext

用法

<?php

use Shdev\FlashText\KeywordProcessor;

$keywordProcessor= new KeywordProcessor();

$keywords = [
	'java'               => ['java_2e', 'java programing'],
	'product management' => ['product management techniques', 'product management'],
];

$keywordProcessor->addKeywordsFromAssocArray($keywords);

$sentence = 'I know java_2e and product management techniques';

$keywordsExtracted = $keywordProcessor->extractKeywords($sentence);
// $keywordsExtracted = ['java', 'product management']

$keywordsExtractedWithSpanInfo = $keywordProcessor->extractKeywords($sentence, true);
// $keywordsExtractedWithSpanInfo = [
//	['java', 7, 14],
// 	['product management', 19, 48],
//]


$sentenceNew = $keywordProcessor->replaceKeywords($sentence);
// $sentenceNew = 'I know java and product management';

引用

FlashText 算法 上发表的原始论文。

    @ARTICLE{2017arXiv171100046S,
       author = {{Singh}, V.},
        title = "{Replace or Retrieve Keywords In Documents at Scale}",
      journal = {ArXiv e-prints},
    archivePrefix = "arXiv",
       eprint = {1711.00046},
     primaryClass = "cs.DS",
     keywords = {Computer Science - Data Structures and Algorithms},
         year = 2017,
        month = oct,
       adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171100046S},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
    }

Medium freeCodeCamp 上发表的这篇文章。