imonroe / corpora
A PHP-friendly version of the dariusk/corpora javascript library. It provides "[a] collection of small corpuses of interesting data for the creation of bots and similar stuff"
Requires
- php: ^7.0|^8.0
- squizlabs/php_codesniffer: ^3.6
Requires (Dev)
- phpunit/phpunit: ^9.5
- squizlabs/php_codesniffer: ^3.6
This package is auto-updated.
Last update: 2024-09-11 23:55:15 UTC
README
This is a PHP- and Composer-friendly fork of the darius/corpora package, designed for easy use with PHP projects.
Check the files in the /data directory to find out all the things in the corpora. Each JSON file is an array with about a thousand examples of whatever you're asking for.
安装
composer require imonroe/corpora
使用
use imonroe\corpora\Corpora;
$corpora = new Corpora;
\\ Returns an array of available categories
$categories = $corpora->getCategories();
// Returns an array of subcategories
$subcategories = $corpora->getCategories('architecture');
// Return just the description of a given data file, if one is available.
$description = $corpora->getDescription('words.nouns');
// $description == "A list of English nouns."
// Returns an array of data from the corpora.
// Specify the file you want in the form of "dirname.dirname.filename"
// Do not include the .json extension.
// Available files are included in the \data directory of this repo.
// for instance, if you wanted the contents of the ./data/words/nouns.json file, you'd
// request it like this:
$nouns = $corpora->getDataFile('words.nouns');
// If you want ./data/music/genres.json, you'd call it like:
$genres = $corpora->getDataFile('music.genres');
// If you want ./data/societies_and_groups/fraternities/fraternities.json,
$fraternities = $corpora->getDataFile('societies_and_groups.fraternities.fraternities');
// You can inspect any of these arrays in the usual way to find out what they contain.
测试
composer test
样式
composer check-style and composer fix-style
Original darius/corpora README
Corpora
This project is a collection of static corpora (plural of "corpus") that are potentially useful in the creation of weird internet stuff. I've found that, as a creator, sometimes I am making something that needs access to a lot of adjectives, but not necessarily every adjective in the English language. So for the last year I've been copy/pasting an adjs.json file from project to project. This is kind of awful, so I'm hoping that this project will at least help me keep everything in one place.
I would like this to help with rapid prototyping of projects. For example: you might use nouns.json to start with, just to see if an idea you had was any good. Once you've built the project quickly around the nouns collection, you can then rip it out and replace it with a more complex or exhaustive data source.
I'm also hoping that this can be used as a teaching tool: maybe someone has three hours to teach how to make Twitter bots. That doesn't give the student much time to find/scrape/clean/parse interesting data. My hope is that students can be pointed to this project and they can pick and choose different interesting data sources to meld together for the creation of prototypes.
许可证
Since Corpora is more data than code, I have chosen to CC0 license this (rather than MIT license or similar).
To the extent possible under law, Darius Kazemi has waived all copyright and related or neighboring rights to Corpora. This work is published from: United States.
What is Corpora NOT?
This project is not meant to replace exhaustive APIs -- if you want nouns, and you want every noun in the English language, replete with metadata, consider Wordnik. If you want the title of every Wikipedia article, use the MediaWiki API.
What is Corpora?
- Corpora is a repository of JSON files, meant to be language-neutral. If you want to create an NPM repo or whatever based on this, be my guest, but this repository will remain a collection of data files that can be interpreted by any language that can parse JSON.
- Corpora is a collection of small files. It is not meant to be an exhaustive source of anything: a list of resources should contain somewhere in the vicinity of 1000 items.
- 例如,Corpora 将不会包含任何完整的“字典”样式文件。相反,我们提供 1000 个常见名词、形容词和动词的样本。
- 一些列表由于本身性质就足够小,我们可能包含它们类别中的所有事物。例如,一个美国人口密集城市列表可能只有75个城市,并被认为是完整的。
语料库相关工具列表
- corpora-project,一个用于离线访问语料库数据的Node.js NPM包。
- pycorpora,一个简单的语料库Python接口。
- corpora-api,一个提供语料库作为JSON API的Node.js服务器(现在在https://corpora-api.glitch.me上运行)
我有一些数据,如何提交?
我们接受对该仓库的pull请求。一些指南
- 通过将数据作为pull请求提交,您同意我们对数据进行CC0免费文化许可,这意味着任何人都可以永久无版权地使用这些数据,无需署名。
- 请以JSON格式提交所有数据,文件扩展名为
.json,请在提交前使用JSONLint检查您的文件--感谢Matt Rothenberg,我们使用了Travis-CI测试,它将自动对您的pull请求进行jsonlint检查。如果您在提交后收到测试失败的通知,说明您的JSON有问题! - 将单个文件保持在大约1000个“事物”以内。少于1000个也可以。
- 如果您希望得到认可,我很乐意在Readme文件中包含您的名字。只是记住,使用这些数据的人没有义务在他们的项目中包含认可。