pascalbaljetmedia/term-extractor

此软件包已被弃用,不再维护。未建议替代软件包。

Term Extractor - Topia's Term Extractor的PHP版本

1.0 2016-11-04 12:16 UTC

This package is auto-updated.

Last update: 2020-09-12 13:10:20 UTC


README

这是Topia's Term Extractor的PHP版本,可从Topia's Term Extractor获取。

它被移植到PHP是为了Five Filters项目。

它还包含了Joseph Turian对Topia's Term Extractor的一些更改

它采用GPL 3.0和原始许可协议(ZPL 2.1)授权。

我们还提供使用此软件包的自托管网络服务,该服务可复制Yahoo的Term Extraction API,并提供在线网络文章的术语提取。如果您想将其作为网络服务运行或在网络文章上执行术语提取,请参阅http://fivefilters.org/term-extraction/

简单示例

<?php
// Text to extract terms from
$text = 'Inevitably, then, corporations do not restrict themselves merely to the arena of economics. Rather, as John Dewey observed, "politics is the shadow cast on society by big business". Over decades, corporations have worked together to ensure that the choices offered by \'representative democracy\' all represent their greed for maximised profits.

This is a sensitive task. We do not live in a totalitarian society - the public potentially has enormous power to interfere. The goal, then, is to persuade the public that corporate-sponsored political choice is meaningful, that it makes a difference. The task of politicians at all points of the supposed \'spectrum\' is to appear passionately principled while participating in what is essentially a charade.';

// TermExtractor PHP class (if not using Composer's autoloader)
require 'term-extractor/TermExtractor.php';

$extractor = new TermExtractor();
$terms = $extractor->extract($text);
// We're outputting results in plain text...
header('Content-Type: text/plain; charset=UTF-8');
// Loop through extracted terms and print each term on a new line
foreach ($terms as $term_info) {
	// index 0: term
	// index 1: number of occurrences in text
	// index 2: word count
	list($term, $occurrence, $word_count) = $term_info;
	echo "$term\n";
}

高级示例

<?php
// Text to extract terms from
$text = 'Inevitably, then, corporations do not restrict themselves merely to the arena of economics. Rather, as John Dewey observed, "politics is the shadow cast on society by big business". Over decades, corporations have worked together to ensure that the choices offered by \'representative democracy\' all represent their greed for maximised profits.

This is a sensitive task. We do not live in a totalitarian society - the public potentially has enormous power to interfere. The goal, then, is to persuade the public that corporate-sponsored political choice is meaningful, that it makes a difference. The task of politicians at all points of the supposed \'spectrum\' is to appear passionately principled while participating in what is essentially a charade.';

// include PHP files (not needed if using Composer's autoloader)
require 'term-extractor/TermExtractor.php';
require 'term-extractor/DefaultFilter.php';


// Filters
// -------
// Permissive - accept everything
//require '../TermExtractor/PermissiveFilter.php';
//$filter = new PermissiveFilter();

// Default - accept terms based on occurrence and word count
// min_occurrence - specify the number of times the term must appear in the original text for it be accepted.
// keep_if_strength - keep a term if the term's word count is equal to or greater than this, regardless of occurrence.
$filter = new DefaultFilter($min_occurrence=2, $keep_if_strength=2);

// Tagger
// ------
// Create Tagger instance.
// English is the only supported language at the moment.
$tagger = new Tagger('english');
// Initialise the Tagger instance.
// Use APC if available to store the dictionary file in memory 
// (otherwise it gets loaded from disk every time the Tagger is initialised).
$tagger->initialize($use_apc=true); 

// Term Extractor
// --------------
// Creater TermExtractor instance
$extractor = new TermExtractor($tagger, $filter);
// Extract terms from the text
$terms = $extractor->extract($text);
// We're outputting results in plain text...
header('Content-Type: text/plain; charset=UTF-8');
// Loop through extracted terms and print each term on a new line
foreach ($terms as $term_info) {
	// index 0: term
	// index 1: number of occurrences in text
	// index 2: word count
	list($term, $occurrence, $word_count) = $term_info;
	echo "$term\n";
	echo "  ->  occurrence: $occurrence, word count: $word_count\n\n";
}