serafim / tf-idf
用于计算通用文档TF-IDF(词频-逆文档频率)的库
Requires
- php: ^8.1
- ext-intl: *
- ext-mbstring: *
- voku/portable-utf8: ^6.0
- voku/stop-words: ^2.0
Requires (Dev)
- phpunit/phpunit: ^9.5.20
- squizlabs/php_codesniffer: ^3.7
- symfony/var-dumper: ^5.4|^6.0
- vimeo/psalm: ^5.6
This package is auto-updated.
Last update: 2024-09-21 16:38:28 UTC
README
介绍
TF-IDF是一种信息检索方法,用于对文档中单词的重要性进行排序。它基于这样一个观点:在文档中出现的单词越频繁,就越与文档相关。
TF-IDF是词频和逆文档频率的乘积。以下是TF-IDF计算的公式。
TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)
词频
某个词在文档中出现的次数与文档中总词数的比例。因此,该词在单个文档中的重要性被评估
其中,n 是该词在文档中的出现次数,分母是该文档中的总词数。
逆文档频率
某个词在文档集中出现的频率的倒数。这一概念是由Karen Spark Jones提出的。考虑IDF可以减少常用词的权重。在给定的文档集中,每个唯一单词只有一个IDF值。
其中
- — 文档集中的文档数量;
- — 包含该词的文档数量(当 n > 0 时)。
公式中对数底的选择并不重要,因为改变底数只会按常数因子改变每个词的权重,这不会影响权重比。
因此,TF-IDF度量是两个因素的乘积
在TF-IDF中,权重会给予那些在特定文档中频率高而在其他文档中频率低的单词。
安装
TF-IDF作为composer仓库可用,可以在项目的根目录下使用以下命令进行安装
$ composer require serafim/tf-idf
快速入门
获取单词信息
$vectorizer = new \Serafim\TFIDF\Vectorizer(); $vectorizer->addFile(__DIR__ . '/path/to/file-1.txt'); $vectorizer->addFile(__DIR__ . '/path/to/file-2.txt'); foreach ($vectorizer->compute() as $document => $entries) { var_dump($document); foreach ($entries as $entry) { var_dump($entry); } }
示例结果
Serafim\TFIDF\Document\FileDocument { locale: "ru_RU" pathname: "/home/example/how-it-works.md" } Serafim\TFIDF\Entry { term: "работает" occurrences: 4 df: 1 tf: 0.012012012012012 idf: 0.69314718055995 tfidf: 0.0083260922589783 } Serafim\TFIDF\Entry { term: "php" occurrences: 26 df: 2 tf: 0.078078078078078 idf: 0.0 tfidf: 0.0 } Serafim\TFIDF\Entry { term: "запуска" occurrences: 2 df: 1 tf: 0.006006006006006 idf: 0.69314718055995 tfidf: 0.0041630461294892 } // ...etc...
添加文档
IDF(逆文档频率)的计算需要多个文档的语料库。为此,你可以使用几种方法
$vectorizer = new \Serafim\TFIDF\Vectorizer(); $vectorizer->addFile(__DIR__ . '/path/to/file.txt'); $vectorizer->addFile(new \SplFileInfo(__DIR__ . '/path/to/file.txt')); $vectorizer->addText('example text'); $vectorizer->addStream(fopen(__DIR__ . '/path/to/file.txt', 'rb')); // OR $vectorizer->add(new class implements \Serafim\TFIDF\Document\TextDocumentInterface { public function getLocale(): string { /* ... */ } public function getContent(): string { /* ... */ } });
创建文档
$vectorizer = new \Serafim\TFIDF\Vectorizer(); $file = $vectorizer->createFile(__DIR__ . '/path/to/file.txt'); $text = $vectorizer->createText('example text'); $stream = $vectorizer->createStream(fopen(__DIR__ . '/path/to/file.txt', 'rb'));
计算
要计算加载文档之间的TF-IDF,请使用compute(): iterable
方法
foreach ($vectorizer->compute() as $document => $result) { // $document = object(Serafim\TFIDF\Document\DocumentInterface) // $result = list<object(Serafim\TFIDF\Entry)> }
要计算加载文档和传递的文档之间的TF-IDF,请使用computeFor(StreamingDocumentInterface|TextDocumentInterface): iterable
方法
$text = $vectorizer->createText('example text'); $result = $vectorizer->computeFor($text); // $result = list<object(Serafim\TFIDF\Entry)>
自定义内存驱动器
默认情况下,所有操作都在内存中计算。这发生得相当快,但它可能会溢出。如果你需要节省内存,可以编写自己的驱动程序。
use Serafim\TFIDF\Vectorizer; use Serafim\TFIDF\Memory\FactoryInterface; use Serafim\TFIDF\Memory\MemoryInterface; $vectorizer = new Vectorizer( memory: new class implements FactoryInterface { // Method for creating a memory area for counters public function create(): MemoryInterface { return new class implements MemoryInterface, \IteratorAggregate { // Increment counter for the given term. public function inc(string $term): void { /* ... */ } // Return counter value for the given term or // 0 if the counter is not found. public function get(string $term): int { /* ... */ } // Should return TRUE if there is a counter for the // specified term. public function has(string $term): bool { /* ... */ } // Returns the number of registered counters. public function count(): int { /* ... */ } // Returns a list of terms and counter values in // format: [ WORD => 42 ] public function getIterator(): \Traversable { /* ... */ } // Destruction of the allocated memory area. public function __destruct() { /* ... */ } }; } } );
自定义停用词
如果需要排除某些“停用词”,这些词在结果中不会被考虑,则应指定自定义实现。
请注意,默认情况下,使用voku/stop-words包中的停用词列表。
use Serafim\TFIDF\Vectorizer; use Serafim\TFIDF\StopWords\FactoryInterface; use Serafim\TFIDF\StopWords\StopWordsInterface; $vectorizer = new Vectorizer( stopWords: new class implements FactoryInterface { public function create(string $locale): StopWordsInterface { // You can use a different set of stop word drivers depending // on the locale ("$locale" argument) of the document. return new class implements StopWordsInterface { // TRUE should be returned if the word should be ignored. // For example prepositions. public function match(string $term): bool { return \in_array($term, ['and', 'or', /* ... */], true); } }; } } );
自定义区域设置
use Serafim\TFIDF\Vectorizer; use Serafim\TFIDF\Locale\IntlRepository; $vectorizer = new Vectorizer( locales: new class extends IntlRepository { // Specifying the default locale public function getDefault(): string { return 'en_US'; } } );
自定义分词器
如果出于某种原因,文本中单词的分析不适合你,你可以编写自己的分词器。
use Serafim\TFIDF\Vectorizer; use Serafim\TFIDF\Tokenizer\TokenizerInterface; use Serafim\TFIDF\Document\StreamingDocumentInterface; use Serafim\TFIDF\Document\TextDocumentInterface; $vectorizer = new Vectorizer( tokenizer: new class implements TokenizerInterface { // Please note that there can be several types of document: // - Text Document: One that contains text in string representation. // - Streaming Document: One that can be read and may contain a // large amount of data. public function tokenize(StreamingDocumentInterface|TextDocumentInterface $document): iterable { $content = $document instanceof StreamingDocumentInterface ? \stream_get_contents($document->getContentStream()) : $document->getContent(); // Please note that the document also contains the locale, based on // which the term (word) separation logic can change. // // i.e. `if ($document->getLocale() === 'ar') { ... }` // return \preg_split('/[\s,]+/isum', $content); } } );