acdh-oeaw / uri-normalizer
一个用于规范化外部实体引用源URI(Geonames、GND等URI)的简单类。
3.0.0
2024-02-13 15:42 UTC
Requires
- php: >= 8.0
- acdh-oeaw/arche-assets: ^3.5
- guzzlehttp/guzzle: ^7.5
- psr/simple-cache: ^3.0
- sweetrdf/quick-rdf: ^2
- sweetrdf/quick-rdf-io: ^1.0.7
Requires (Dev)
- phpstan/phpstan: ^1
- phpunit/phpunit: ^10
README
一个用于从Geonames、GND、VIAF、ORCID等服务中规范化命名实体URI的类,并从中检索RDF元数据。
默认情况下,使用arche-assets库中的规则,但您也可以提供自己的规则。
可以使用任何PSR-16兼容的缓存来加速重复URI的规范化/检索。同时提供了一个内存和基于sqlite的持久化缓存实现。
上下文
在查看命名实体数据库服务时,很难确定哪个URL是给定命名实体的规范URI。
让我们快速看看一些Geonames URL(肯定还有更多),它们描述了具有id 2761369的完全相同的Geonames命名实体:
- http://geonames.org/2761369
- https://geonames.org/2761369
- http://www.geonames.org/2761369
- https://www.geonames.org/2761369
- http://geonames.org/2761369/vienna
- https://geonames.org/2761369/vienna
- http://www.geonames.org/2761369/vienna
- https://www.geonames.org/2761369/vienna
- https://www.geonames.org/2761369/vienna/about.rdf
- https://www.geonames.org/2761369/vienna.html
哪一个是正确的?实际上答案很简单——在给定服务返回的RDF元数据中用作RDF三元组主题的URL。因此,本包的第一个目标是为转换来自给定服务的任何URL并提供一个工具,将其转换为服务在RDF元数据中使用的规范URI。
但这里又出现了另一个问题——如何知道给定命名实体的URI来获取其RDF元数据?
对于一些服务(如ORCID或VIAF),可以通过请求支持的一种RDF格式来简单地使用HTTP内容协商完成。但对于其他服务,则需要知道特定于服务的协商方法,例如,在Geonames中,您需要将/about.rdf
追加到规范URI。本包的第二个目标是可以让您从命名实体URI/URL中检索RDF元数据,而无需担心所有这些特定于服务的特性。由于这种检索需要相当多的时间,因此还提供了一个缓存选项。
自动生成的文档
https://acdh-oeaw.github.io/arche-docs/devdocs/classes/acdhOeaw-UriNormalizer.html
安装
composer require acdh-oeaw/uri-normalizer
用法
### # Initialization ### $normalizer = new \acdhOeaw\UriNormalizer(); ### # string URL normalization ### // returns 'https://sws.geonames.org/2761369/' echo $normalizer->normalize('http://geonames.org/2761369/vienna.html'); ### # EasyRdf resource property normalization ### $property = 'https://some.id/property'; $graph = new EasyRdf\Graph(); $resource = $graph->resource('.'); $resource->addResource($property, 'http://aaa.geonames.org/276136/borj-ej-jaaiyat.html'); $normalizer->normalizeMeta($resource, $property); // returns 'https://sws.geonames.org/276136/' echo (string) $resource->getResource($property); ### # Retrieve parsed/raw RDF metadata from URI/URL ### // print parsed RDF metadata retrieved from the geonames $metadata = $normalizer->fetch('http://geonames.org/2761369/vienna.html'); echo $metadata->dump('text') . "\n"; // get a PSR-7 request fetching the RDF metadata for a given geonames URL $request = $normalizer->resolve('http://geonames.org/2761369/vienna.html'); echo $request->getUri() . "\n"; ### # Use your own normalization rules # and supply a custom Guzzle HTTP client (can be any PSR-18 one) supplying authentication ### $rules = [ [ "match" => "^https://(?:my.)own.namespace/([0-9]+)(?:/.*)?$", "replace" => "https://own.namespace/\\1", "resolve" => "https://own.namespace/\\1", "format" => "application/n-triples", ], ]; $client = new \GuzzleHttp\Client(['auth' => ['login', 'password']]); $cache = false; $normalizer = new \acdhOeaw\UriNormalizer($rules, '', $client, $cache); // returns 'https://own.namespace/123' echo $normalizer->normalize('https://my.own.namespace/123/foo'); // obviously won't work but if the https://own.namespace would exist, // it would be queried with the HTTP BASIC auth as set up above $normalizer->fetch('https://my.own.namespace/123/foo'); ### # Use cache ### $cache = new \acdhOeaw\UriNormalizerCache('db.sqlite'); $normalizer = new \acdhOeaw\UriNormalizer(cache: $cache); // first retrieval should take 0.1-1 second depending on your connection speed $t = microtime(true); $metadata = $normalizer->fetch('http://geonames.org/2761369/vienna.html'); $t = (microtime(true) - $t); echo $metadata->dump('text') . "\ntime: $t s\n"; // second retrieval should be very quick thanks to in-memory cache $t = microtime(true); $metadata = $normalizer->fetch('http://geonames.org/2761369/vienna.html'); $t = (microtime(true) - $t); echo $metadata->dump('text') . "\ntime: $t s\n"; // a completely separate UriNormalizer instance still benefits from the persistent // sqlite cache $cache2 = new \acdhOeaw\UriNormalizerCache('db.sqlite'); $normalizer2 = new \acdhOeaw\UriNormalizer(cache: $cache); $t = microtime(true); $metadata = $normalizer2->fetch('http://geonames.org/2761369/vienna.html'); $t = (microtime(true) - $t); echo $metadata->dump('text') . "\ntime: $t s\n"; ### # As a global singleton ### // initialization is done with init() instead of a constructor // the init() takes same parameters as the constructor \acdhOeaw\UriNormalizer::init(); // all other methods (gNormalize(), gFetch() and gResolve()) also work in // the same way and take same parameters as their non-static counterparts // returns 'https://sws.geonames.org/2761369/' echo \acdhOeaw\UriNormalizer::gNormalize('http://geonames.org/2761369/vienna.html'); // fetch and cache parsed RDF metadata echo \acdhOeaw\UriNormalizer::gFetch('http://geonames.org/2761369/vienna.html')->dump('text'); // fetch and cache raw RDF metadata echo \acdhOeaw\UriNormalizer::gResolve('http://geonames.org/2761369/vienna.html')->getBody(); // normalize EasyRdf Resource property $property = 'https://some.id/property'; $graph = new EasyRdf\Graph(); $resource = $graph->resource('.'); $resource->addResource($property, 'http://aaa.geonames.org/276136/borj-ej-jaaiyat.html'); \acdhOeaw\UriNormalizer::gNormalizeMeta($resource, $property); // returns 'https://sws.geonames.org/276136/' echo (string) $resource->getResource($property);