permafrost-dev / text-classifier
根据训练模型将输入文本分配到某一类别
1.2.0
2021-05-27 14:23 UTC
Requires
- php: ^7.3|^8.0
- ext-json: *
- skyeng/php-lemmatizer: ^1.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^2.16
- phpunit/phpunit: ^9.0
This package is auto-updated.
Last update: 2024-08-27 21:43:36 UTC
README
使用如朴素贝叶斯等算法进行基本的文本分类。
安装
您可以使用composer安装text-classifier
composer require permafrost-dev/text-classifier
注意:用于训练模型的更高质量和更完整的数据将使分类更加准确。
示例 - 电子邮件地址分类
将文本分类用于确定电子邮件是否为垃圾邮件或非垃圾邮件是一种常见用例。虽然这超出了本例的范围,但我们可以尝试根据电子邮件地址的特征来判断其是否为垃圾邮件。 注意:用于训练/示例中的所有电子邮件地址都是随机生成的。如果您在样本数据中意外发现您的电子邮件地址,请通过packages@permafrost.dev 联系我们,我们将立即将其删除。
<?php use Permafrost\TextClassifier\TextClassifier; use Permafrost\TextClassifier\Classifiers\NaiveBayes; use Permafrost\TextClassifier\Pipelines\TextProcessingPipeline; use Permafrost\TextClassifier\Tokenizers\EmailAddressTokenizer; use Permafrost\TextClassifier\Processors\EmailAddressNormalizer; $processors = new TextProcessingPipeline([ new EmailAddressNormalizer(), ]); $tc = new TextClassifier($processors, [new EmailAddressTokenizer()], new NaiveBayes()); $tc = $tc->trainFromFile(__DIR__ . '/email-train.txt'); $emails = [ 'blah44657457@whatever.rut', 'john@gmail.com', ]; foreach ($emails as $email) { echo "classification for '$email': " . $tc->classify($email) . PHP_EOL; }
输出结果
'blah44657457@whatever.rut'的分类:垃圾邮件
'john@gmail.com'的分类:有效
此方法可以轻松应用于其他用于垃圾邮件检查的领域,例如分类用户提供的域名。
示例 - 情感分析
请参阅examples/sentiment.php
以获取工作演示。
<?php use Skyeng\Lemmatizer; use Permafrost\TextClassifier\TextClassifier; use Permafrost\TextClassifier\Classifiers\NaiveBayes; use Permafrost\TextClassifier\Processors\TextLemmatizer; use Permafrost\TextClassifier\Tokenizers\BasicTokenizer; use Permafrost\TextClassifier\Tokenizers\NGramTokenizer; use Permafrost\TextClassifier\Processors\StopwordRemover; use Permafrost\TextClassifier\Processors\BasicTextNormalizer; use Permafrost\TextClassifier\Pipelines\TextProcessingPipeline; //Use different processors for training and classifying. Since we're using keyword tokens, //add all lemmas for each token during training to increase the size of the training data. $trainingProcessors = [new TextLemmatizer(new Lemmatizer()), new BasicTextNormalizer()]; //When classifying, let's remove stopwords in addition to basic text normalization, because //we'll be processing phrases. $classifyProcessors = [new StopwordRemover(), new BasicTextNormalizer()]; //Let's use a basic tokenizer (word-based tokens), and an NGram tokenizer, which creates //trigrams (N=3). This should give us a good mix of keywords and partial keywords to look //for when classifying text. $tokenizers = [new BasicTokenizer(), new NGramTokenizer(3)]; $textClassifier = new TextClassifier( new TextProcessingPipeline($trainingProcessors, $classifyProcessors), $tokenizers, new NaiveBayes() //use Naive-Bayes as the classifier ); $textClassifier->trainFromFile(__DIR__ . '/sentiment-train.txt'); $phrases = [ 'this is fantastic', 'this is terrible', ]; foreach ($phrases as $phrase) { echo $phrase . ' - ' . $textClassifier->classify($phrase) . PHP_EOL; }
输出结果
这是极好的 - 正面
这是糟糕的! - 负面
通过更强大的预处理和标记化,这些方法可以应用于其他数据,例如确定电子邮件消息是否可能是垃圾邮件,根据基本偏好判断给定文章是否对用户感兴趣等。
然而,这仅限于此 - 当需要高度准确的结果时,建议使用机器学习。