serafim / tf-idf

用于计算通用文档TF-IDF(词频-逆文档频率)的库

0.1.0 2023-02-06 03:52 UTC

This package is auto-updated.

Last update: 2024-09-21 16:38:28 UTC


README

PHP 8.1+ Latest Stable Version Latest Unstable Version Total Downloads License MIT

介绍

TF-IDF是一种信息检索方法,用于对文档中单词的重要性进行排序。它基于这样一个观点:在文档中出现的单词越频繁,就越与文档相关。

TF-IDF是词频和逆文档频率的乘积。以下是TF-IDF计算的公式。

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

词频

某个词在文档中出现的次数与文档中总词数的比例。因此,该词在单个文档中的重要性被评估

其中,n 是该词在文档中的出现次数,分母是该文档中的总词数。

逆文档频率

某个词在文档集中出现的频率的倒数。这一概念是由Karen Spark Jones提出的。考虑IDF可以减少常用词的权重。在给定的文档集中,每个唯一单词只有一个IDF值。

其中

  • — 文档集中的文档数量;
  • — 包含该词的文档数量(当 n > 0 时)。

公式中对数底的选择并不重要,因为改变底数只会按常数因子改变每个词的权重,这不会影响权重比。

因此,TF-IDF度量是两个因素的乘积

在TF-IDF中,权重会给予那些在特定文档中频率高而在其他文档中频率低的单词。

安装

TF-IDF作为composer仓库可用,可以在项目的根目录下使用以下命令进行安装

$ composer require serafim/tf-idf

快速入门

获取单词信息

$vectorizer = new \Serafim\TFIDF\Vectorizer();

$vectorizer->addFile(__DIR__ . '/path/to/file-1.txt');
$vectorizer->addFile(__DIR__ . '/path/to/file-2.txt');

foreach ($vectorizer->compute() as $document => $entries) {
    var_dump($document);

    foreach ($entries as $entry) {
        var_dump($entry);
    }
}

示例结果

Serafim\TFIDF\Document\FileDocument {
    locale: "ru_RU"
    pathname: "/home/example/how-it-works.md"
}

Serafim\TFIDF\Entry {
    term: "работает"
    occurrences: 4
    df: 1
    tf: 0.012012012012012
    idf: 0.69314718055995
    tfidf: 0.0083260922589783
}

Serafim\TFIDF\Entry {
    term: "php"
    occurrences: 26
    df: 2
    tf: 0.078078078078078
    idf: 0.0
    tfidf: 0.0
}

Serafim\TFIDF\Entry {
    term: "запуска"
    occurrences: 2
    df: 1
    tf: 0.006006006006006
    idf: 0.69314718055995
    tfidf: 0.0041630461294892
}

// ...etc...

添加文档

IDF(逆文档频率)的计算需要多个文档的语料库。为此,你可以使用几种方法

$vectorizer = new \Serafim\TFIDF\Vectorizer();

$vectorizer->addFile(__DIR__ . '/path/to/file.txt');
$vectorizer->addFile(new \SplFileInfo(__DIR__ . '/path/to/file.txt'));
$vectorizer->addText('example text');
$vectorizer->addStream(fopen(__DIR__ . '/path/to/file.txt', 'rb'));

// OR

$vectorizer->add(new class implements \Serafim\TFIDF\Document\TextDocumentInterface {
    public function getLocale(): string { /* ... */ }
    public function getContent(): string { /* ... */ }
});

创建文档

$vectorizer = new \Serafim\TFIDF\Vectorizer();

$file = $vectorizer->createFile(__DIR__ . '/path/to/file.txt');
$text = $vectorizer->createText('example text');
$stream = $vectorizer->createStream(fopen(__DIR__ . '/path/to/file.txt', 'rb'));

计算

要计算加载文档之间的TF-IDF,请使用compute(): iterable方法

foreach ($vectorizer->compute() as $document => $result) { 
    // $document = object(Serafim\TFIDF\Document\DocumentInterface)
    // $result   = list<object(Serafim\TFIDF\Entry)>
}

要计算加载文档和传递的文档之间的TF-IDF,请使用computeFor(StreamingDocumentInterface|TextDocumentInterface): iterable方法

$text = $vectorizer->createText('example text');

$result = $vectorizer->computeFor($text);

// $result = list<object(Serafim\TFIDF\Entry)>

自定义内存驱动器

默认情况下,所有操作都在内存中计算。这发生得相当快,但它可能会溢出。如果你需要节省内存,可以编写自己的驱动程序。

use Serafim\TFIDF\Vectorizer;
use Serafim\TFIDF\Memory\FactoryInterface;
use Serafim\TFIDF\Memory\MemoryInterface;

$vectorizer = new Vectorizer(
    memory: new class implements FactoryInterface {
        // Method for creating a memory area for counters
        public function create(): MemoryInterface
        {
            return new class implements MemoryInterface, \IteratorAggregate {
                // Increment counter for the given term.
                public function inc(string $term): void { /* ... */ }

                // Return counter value for the given term or
                // 0 if the counter is not found.
                public function get(string $term): int { /* ... */ }

                // Should return TRUE if there is a counter for the
                // specified term.
                public function has(string $term): bool { /* ... */ }

                // Returns the number of registered counters.
                public function count(): int { /* ... */ }

                // Returns a list of terms and counter values in
                // format: [ WORD => 42 ]
                public function getIterator(): \Traversable { /* ... */ }

                // Destruction of the allocated memory area.
                public function __destruct() { /* ... */ }
            };
        }
    }
);

自定义停用词

如果需要排除某些“停用词”,这些词在结果中不会被考虑,则应指定自定义实现。

请注意,默认情况下,使用voku/stop-words包中的停用词列表。

use Serafim\TFIDF\Vectorizer;
use Serafim\TFIDF\StopWords\FactoryInterface;
use Serafim\TFIDF\StopWords\StopWordsInterface;

$vectorizer = new Vectorizer(
    stopWords: new class implements FactoryInterface {
        public function create(string $locale): StopWordsInterface
        {
            // You can use a different set of stop word drivers depending
            // on the locale ("$locale" argument) of the document.
            return new class implements StopWordsInterface {
                // TRUE should be returned if the word should be ignored.
                // For example prepositions.
                public function match(string $term): bool
                {
                    return \in_array($term, ['and', 'or', /* ... */], true);
                }
            };
        }
    }
);

自定义区域设置

use Serafim\TFIDF\Vectorizer;
use Serafim\TFIDF\Locale\IntlRepository;

$vectorizer = new Vectorizer(
    locales: new class extends IntlRepository {
        // Specifying the default locale
        public function getDefault(): string
        {
            return 'en_US';
        }
    }
);

自定义分词器

如果出于某种原因,文本中单词的分析不适合你,你可以编写自己的分词器。

use Serafim\TFIDF\Vectorizer;
use Serafim\TFIDF\Tokenizer\TokenizerInterface;
use Serafim\TFIDF\Document\StreamingDocumentInterface;
use Serafim\TFIDF\Document\TextDocumentInterface;

$vectorizer = new Vectorizer(
    tokenizer: new class implements TokenizerInterface {
        // Please note that there can be several types of document:
        //  - Text Document: One that contains text in string representation.
        //  - Streaming Document: One that can be read and may contain a
        //    large amount of data.
        public function tokenize(StreamingDocumentInterface|TextDocumentInterface $document): iterable 
        {
            $content = $document instanceof StreamingDocumentInterface
                ? \stream_get_contents($document->getContentStream())
                : $document->getContent();

            // Please note that the document also contains the locale, based on
            // which the term (word) separation logic can change.
            //
            // i.e. `if ($document->getLocale() === 'ar') { ... }`
            //

            return \preg_split('/[\s,]+/isum', $content);
        }
    }
);