acdh-oeaw/uri-normalizer

一个用于规范化外部实体引用源URI(Geonames、GND等URI)的简单类。

3.0.0 2024-02-13 15:42 UTC

This package is auto-updated.

Last update: 2024-09-13 16:57:10 UTC


README

Latest Stable Version Build status Coverage Status License

一个用于从Geonames、GND、VIAF、ORCID等服务中规范化命名实体URI的类,并从中检索RDF元数据。

默认情况下,使用arche-assets库中的规则,但您也可以提供自己的规则。

可以使用任何PSR-16兼容的缓存来加速重复URI的规范化/检索。同时提供了一个内存和基于sqlite的持久化缓存实现。

上下文

在查看命名实体数据库服务时,很难确定哪个URL是给定命名实体的规范URI。

让我们快速看看一些Geonames URL(肯定还有更多),它们描述了具有id 2761369的完全相同的Geonames命名实体:

哪一个是正确的?实际上答案很简单——在给定服务返回的RDF元数据中用作RDF三元组主题的URL。因此,本包的第一个目标是为转换来自给定服务的任何URL并提供一个工具,将其转换为服务在RDF元数据中使用的规范URI。

但这里又出现了另一个问题——如何知道给定命名实体的URI来获取其RDF元数据?

对于一些服务(如ORCID或VIAF),可以通过请求支持的一种RDF格式来简单地使用HTTP内容协商完成。但对于其他服务,则需要知道特定于服务的协商方法,例如,在Geonames中,您需要将/about.rdf追加到规范URI。本包的第二个目标是可以让您从命名实体URI/URL中检索RDF元数据,而无需担心所有这些特定于服务的特性。由于这种检索需要相当多的时间,因此还提供了一个缓存选项。

自动生成的文档

https://acdh-oeaw.github.io/arche-docs/devdocs/classes/acdhOeaw-UriNormalizer.html

安装

composer require acdh-oeaw/uri-normalizer

用法

###
# Initialization
###
$normalizer = new \acdhOeaw\UriNormalizer();

###
# string URL normalization
###
// returns 'https://sws.geonames.org/2761369/'
echo $normalizer->normalize('http://geonames.org/2761369/vienna.html');

###
# EasyRdf resource property normalization
###
$property = 'https://some.id/property';
$graph    = new EasyRdf\Graph();
$resource = $graph->resource('.');
$resource->addResource($property, 'http://aaa.geonames.org/276136/borj-ej-jaaiyat.html');
$normalizer->normalizeMeta($resource, $property);
// returns 'https://sws.geonames.org/276136/'
echo (string) $resource->getResource($property);

###
# Retrieve parsed/raw RDF metadata from URI/URL
###
// print parsed RDF metadata retrieved from the geonames
$metadata = $normalizer->fetch('http://geonames.org/2761369/vienna.html');
echo $metadata->dump('text') . "\n";

// get a PSR-7 request fetching the RDF metadata for a given geonames URL
$request = $normalizer->resolve('http://geonames.org/2761369/vienna.html');
echo $request->getUri() . "\n";

###
# Use your own normalization rules
# and supply a custom Guzzle HTTP client (can be any PSR-18 one) supplying authentication
###
$rules = [
  [
    "match"   => "^https://(?:my.)own.namespace/([0-9]+)(?:/.*)?$",
    "replace" => "https://own.namespace/\\1",
    "resolve" => "https://own.namespace/\\1",
    "format"  => "application/n-triples",
  ],
];
$client = new \GuzzleHttp\Client(['auth' => ['login', 'password']]);
$cache  = false;
$normalizer = new \acdhOeaw\UriNormalizer($rules, '', $client, $cache);
// returns 'https://own.namespace/123'
echo $normalizer->normalize('https://my.own.namespace/123/foo');
// obviously won't work but if the https://own.namespace would exist,
// it would be queried with the HTTP BASIC auth as set up above
$normalizer->fetch('https://my.own.namespace/123/foo');

###
# Use cache
###
$cache = new \acdhOeaw\UriNormalizerCache('db.sqlite');
$normalizer = new \acdhOeaw\UriNormalizer(cache: $cache);
// first retrieval should take 0.1-1 second depending on your connection speed
$t = microtime(true);
$metadata = $normalizer->fetch('http://geonames.org/2761369/vienna.html');
$t = (microtime(true) - $t);
echo $metadata->dump('text') . "\ntime: $t s\n";
// second retrieval should be very quick thanks to in-memory cache
$t = microtime(true);
$metadata = $normalizer->fetch('http://geonames.org/2761369/vienna.html');
$t = (microtime(true) - $t);
echo $metadata->dump('text') . "\ntime: $t s\n";
// a completely separate UriNormalizer instance still benefits from the persistent
// sqlite cache
$cache2 = new \acdhOeaw\UriNormalizerCache('db.sqlite');
$normalizer2 = new \acdhOeaw\UriNormalizer(cache: $cache);
$t = microtime(true);
$metadata = $normalizer2->fetch('http://geonames.org/2761369/vienna.html');
$t = (microtime(true) - $t);
echo $metadata->dump('text') . "\ntime: $t s\n";

###
# As a global singleton
###
// initialization is done with init() instead of a constructor
// the init() takes same parameters as the constructor
\acdhOeaw\UriNormalizer::init();
// all other methods (gNormalize(), gFetch() and gResolve()) also work in 
// the same way and take same parameters as their non-static counterparts
// returns 'https://sws.geonames.org/2761369/'
echo \acdhOeaw\UriNormalizer::gNormalize('http://geonames.org/2761369/vienna.html');
// fetch and cache parsed RDF metadata
echo \acdhOeaw\UriNormalizer::gFetch('http://geonames.org/2761369/vienna.html')->dump('text');
// fetch and cache raw RDF metadata
echo \acdhOeaw\UriNormalizer::gResolve('http://geonames.org/2761369/vienna.html')->getBody();
// normalize EasyRdf Resource property
$property = 'https://some.id/property';
$graph    = new EasyRdf\Graph();
$resource = $graph->resource('.');
$resource->addResource($property, 'http://aaa.geonames.org/276136/borj-ej-jaaiyat.html');
\acdhOeaw\UriNormalizer::gNormalizeMeta($resource, $property);
// returns 'https://sws.geonames.org/276136/'
echo (string) $resource->getResource($property);