README

PHP 库，用于从任何自由文本中检测语言。

它遵循在论文中描述的方法，给定文本被标记为N-Grams（在此步骤之前我们清理空白字符）。然后我们对tokens进行排序，并与语言模型进行比较。

是crodas/languagedetector的分支，因为原始包似乎已被遗弃。

工作原理

我们首先需要的是一个语言模型（类似于此文件），在分类时间用来比较文本。这个过程必须在任何事情之前完成，并且可以使用类似于此文件的脚本来生成。

// register the autoloader
require 'lib/LanguageDetector/autoload.php';

// it could use a little bit of memory, but it's fine
// because this process runs once.
ini_set('memory_limit', '1G');

// we load the configuration (which will be serialized
// later into our language model file
$config = new LanguageDetector\Config;

$c = new LanguageDetector\Learn($config);
foreach (glob(__DIR__ . '/samples/*') as $file) { 
    // feed with examples ('language', 'text');
    $c->addSample(basename($file), file_get_contents($file));
}

// some callback so we know where the process is 
$c->addStepCallback(function($lang, $status) {
    echo "Learning {$lang}: $status\n";
});

// save it in `datafile`. 
// we currently support the `php` serialization but it's trivial
// to add other formats, just extend `\LanguageDetector\Format\AbstractFormat`. 
//You can check example at https://github.com/crodas/LanguageDetector/blob/master/lib/LanguageDetector/Format/PHP.php
$c->save(AbstractFormat::initFormatByPath('language.php'));

一旦我们有了我们的语言模型文件（在本例中为language.php），我们就可以通过语言来分类文本了。

// register the autoloader
require 'lib/LanguageDetector/autoload.php';

// we load the language model, it would create
// the $config object for us.
$detect = LanguageDetector\Detect::initByPath('language.php');

$lang = $detect->detect("Agricultura (-ae, f.), sensu latissimo, 
est summa omnium artium et scientiarum et technologiarum quae de 
terris colendis et animalibus creandis curant, ut poma, frumenta, 
charas, carnes, textilia, et aliae res e terra bene producantur. 
Specialius, agronomia est ars et scientia quae terris colendis student, 
agricultio autem animalibus creandis.")

var_dump($lang);

就是这样。

算法

该项目设计为与模块一起工作，这意味着您可以提供自己的算法来对N-Grams进行排序和比较。默认情况下，库实现了PageRank作为排序算法，以及原地（在论文中描述）作为比较。

为了提供自己的算法，您必须在学习阶段更改$config以加载自己的类（顺便说一下，应该实现某些接口）。

语言检测训练文件

查看example/samples目录。有关更高级的训练数据，请访问莱比锡语料库下载页面。

包含非拉丁字符的语言

记住，如果您针对基于非拉丁字符的语言进行训练，请设置Config的mb属性（已在创建语言模型之前设置）。使用UTF-8编码的文本。

andywer / language-detector

维护者

详细信息

README

工作原理

算法

语言检测训练文件

包含非拉丁字符的语言