tga/simhash-php

PHP 5.3 的 SimHash 相似性算法实现

2.0 2015-10-05 13:41 UTC

This package is auto-updated.

Last update: 2024-09-15 08:57:35 UTC


README

这是 SimHashPHP 的第二个版本。如果您正在使用版本 1 并且不想更新您的代码,请参考 1.0-security 分支(https://github.com/tgalopin/SimHashPhp/tree/1.0-security)。1.0 分支将维护到 v3 版本的发布,但只有 v2 将拥有最新的功能。

SimHashPHP 是什么?

SimHashPHP 是一个 PHP 库,它将 SimHash 算法移植到 PHP。这个算法由 Moses Charikar 创建,提供了一种高效的方法来计算两个文本之间的相似性指数。它被 Google 内部用于检测重复内容。

有关更多信息,请参阅 "SimHash 或快速比较两个数据集的方法"

Build Status

如何使用它?

使用 Composer 安装

composer require tga/simhash-php

安装完成后,包含 vendor/autoload.php 以加载库。

SimHash 的概念在 这篇文章 中描述。以下是一些示例

<?php

require 'vendor/autoload.php';

$text1 = <<<EOT
George Headley (1909–1983) was a West Indian cricketer who played 22 Test matches, mostly before the Second World War.
Considered one of the best batsmen to play for West Indies and one of the greatest cricketers of all time, he also
represented Jamaica and played professionally in England. Headley was born in Panama but raised in Jamaica where he
quickly established a cricketing reputation as a batsman. West Indies had a weak cricket team through most of Headley's
career; as their one world-class player, he carried a heavy responsibility, and they depended on his batting. He batted
at number three, scoring 2,190 runs in Tests at an average of 60.83, and 9,921 runs in all first-class matches at an
average of 69.86. He was chosen as one of the Wisden Cricketers of the Year in 1934.
EOT;

$text2 = <<<EOT
George Headley was a West Indian cricketer who played 22 Test matches, mostly before the Second World War.
Considered one of the best batsmen to play for West Indies and one of the greatest cricketers of all time, he also
represented Jamaica and played professionally in England. Headley was born in Panama but raised in Jamaica where he
quickly established a cricketing reputation as a batsman. West Indies had a weak cricket team through most of Headley's
career; as their one world-class player, he carried a heavy responsibility, and they depended on his batting. He batted
at number three, scoring 2,190 runs in tests at an average of 60.83, and 9,921 runs in all first-class matches at an
average of 69.86. He was chosen as one of the Wisden Cricketers of the Year.
EOT;

$simhash = new \Tga\SimHash\SimHash();
$extractor = new \Tga\SimHash\Extractor\SimpleTextExtractor();
$comparator = new Tga\SimHash\Comparator\GaussianComparator(3);

$fp1 = $simhash->hash($extractor->extract($text1), \Tga\SimHash\SimHash::SIMHASH_64);
$fp2 = $simhash->hash($extractor->extract($text2), \Tga\SimHash\SimHash::SIMHASH_64);

var_dump($fp1->getBinary());
var_dump($fp2->getBinary());

// Index between 0 and 1 : 0.80073740291681
var_dump($comparator->compare($fp1, $fp2));

许可证

此库采用 MIT 许可证(请参阅 LICENSE.md)

关于

SimHashPHP 主要由 Titouan Galopin 开发。

报告问题或功能请求

问题和功能请求在 Github 问题跟踪器 中跟踪。