leg / simhash-php
2.0
2015-10-05 13:41 UTC
Requires
- php: >=5.3
- cocur/slugify: 1.*
Replaces
This package is auto-updated.
Last update: 2022-02-01 12:20:53 UTC
README
这是 SimHashPHP 的第二个版本。如果您正在使用版本 1 并且不想更新您的代码,请参考
1.0-security
分支(https://github.com/tgalopin/SimHashPhp/tree/1.0-security)。1.0 分支将一直维护到 v3 的发布,但只有 v2 将具有最新的功能。
什么是 SimHashPHP ?
SimHashPHP 是一个 PHP 库,它将 SimHash 算法移植到 PHP。这个由 Moses Charikar 创建的算法,提供了一个计算两个文本之间相似度指数的高效方法。它被 Google 内部用于检测重复内容。
更多信息请参阅 "SimHash 或快速比较两个数据集的方法"。
如何使用它?
使用 Composer 安装
composer require tga/simhash-php
安装完成后,包含 vendor/autoload.php
以加载库。
SimHash 的概念在 这篇文章 中描述。以下是一些示例
<?php require 'vendor/autoload.php'; $text1 = <<<EOT George Headley (1909–1983) was a West Indian cricketer who played 22 Test matches, mostly before the Second World War. Considered one of the best batsmen to play for West Indies and one of the greatest cricketers of all time, he also represented Jamaica and played professionally in England. Headley was born in Panama but raised in Jamaica where he quickly established a cricketing reputation as a batsman. West Indies had a weak cricket team through most of Headley's career; as their one world-class player, he carried a heavy responsibility, and they depended on his batting. He batted at number three, scoring 2,190 runs in Tests at an average of 60.83, and 9,921 runs in all first-class matches at an average of 69.86. He was chosen as one of the Wisden Cricketers of the Year in 1934. EOT; $text2 = <<<EOT George Headley was a West Indian cricketer who played 22 Test matches, mostly before the Second World War. Considered one of the best batsmen to play for West Indies and one of the greatest cricketers of all time, he also represented Jamaica and played professionally in England. Headley was born in Panama but raised in Jamaica where he quickly established a cricketing reputation as a batsman. West Indies had a weak cricket team through most of Headley's career; as their one world-class player, he carried a heavy responsibility, and they depended on his batting. He batted at number three, scoring 2,190 runs in tests at an average of 60.83, and 9,921 runs in all first-class matches at an average of 69.86. He was chosen as one of the Wisden Cricketers of the Year. EOT; $simhash = new \Tga\SimHash\SimHash(); $extractor = new \Tga\SimHash\Extractor\SimpleTextExtractor(); $comparator = new Tga\SimHash\Comparator\GaussianComparator(3); $fp1 = $simhash->hash($extractor->extract($text1), \Tga\SimHash\SimHash::SIMHASH_64); $fp2 = $simhash->hash($extractor->extract($text2), \Tga\SimHash\SimHash::SIMHASH_64); var_dump($fp1->getBinary()); var_dump($fp2->getBinary()); // Index between 0 and 1 : 0.80073740291681 var_dump($comparator->compare($fp1, $fp2));
许可
此库受 MIT 许可证(见 LICENSE.md)约束。
关于
SimHashPHP 主要由 Titouan Galopin 开发。
报告问题或功能请求
问题和功能请求在 Github 问题跟踪器 中跟踪。