uthmordar/cardator

解析网页并从微数据生成卡片对象

dev-master 2015-07-06 13:39 UTC

This package is not auto-updated.

Last update: 2024-09-28 17:28:10 UTC


README

Build Status SensioLabsInsight

Cardator 包

v 1.3.3

允许网页解析并收集微数据。

在卡片实例化或后处理中进行过滤/钩子操作。

输出:作为 hydration 对象或 json 编码字符串的卡片集合。

基本利用

需要 composer dump-autoload --optimize

require_once "vendor/autoload.php";

use Uthmordar\Cardator\Card\CardGenerator;
use Uthmordar\Cardator\Card\CardProcessor;
use Uthmordar\Cardator\Cardator;
use Uthmordar\Cardator\Parser\Parser;

try{
    $cardator=new Cardator(new CardGenerator, new CardProcessor, new Parser);

    /* give only Article type card in output (only has priority over except) */
    $cardator->addOnly('Article');

    /* Thing type card will not be given in output */
    $cardator->addExcept('Thing');

    /* choose url to crawl and extract data, throw RuntimeException if header status 400+ */
    $crawl=$cardator->crawl('http://google.fr');
    
    /* given closure will be use on given property for all card during the postprocess */
    $cardator->addPostProcessTreatment('my_property_to_filter', function($name, $value){
        // what I want to do
    });
    $cardator->doPostProcess();
    
    /* get cards as json */
    $cards=$cardator->getCards(true);
    
    /* get cards as SplObjectStorage collection */
    $cards=$cardator->getCards();
    foreach($cards as $c){
        // do something with cards
    }
}catch(\RuntimeException $e){
    // do something with error 
}

提取数据格式

此工具通过搜索 microdata 规范来爬取网页。

它还将跟踪一些特殊属性并将它们链接到给定的 itemprop

  • dk-raw 是一个属性,您应该使用它来提供只能由开发人员或机器人使用的信息,例如 datetime 而不是可读日期。
  • content 属性可用于元标签来标记对用户隐藏的内容
  • value 属性可用于传递与标签相关的数值

您可以通过以下方式访问一些处理信息

    $cardator->getTotalCard(); // Give number of card found
    $cardator->getExecutionTime(); // return crawl duration in s
    $cardator->getStatus(); // return crawler http status

    $cardator->getExecutionData(); // return previous informations as array

卡片生成

您可以使用以下方式轻松创建 Card 对象

    $cardator->createCard('Article');

您可以通过扩展 CardGenerator 并提供新的命名空间路径来更改卡片库,只要您尊重卡片接口实现即可

卡片属性

    $article=$cardator->createCard('Article');
    
    // GET
    $name=$article->name;
    $name=$article->name();
    // SET
    $article->name='My Article';
    $article->name('My Article');

    // Existant properties will be hydrated, non-existant property will create an entry in $params array
    $article->params['non-existant'];

    // You could access to all hydrated properties name in an array
    $properties=$article->properties;

    // Card type and card hierarchy
    $cardName=$article->getQualifiedName();
    $cardType=$article->type;

    // Parents : will return an array ['Thing', 'CreativeWork']
    // if more than one parent exist for an item : [['Thing', 'CreativeWork', 'SoftwareApplication'], ['Thing', 'CreativeWork', 'Game']]
    $cardParents=$article->getParents();
    OR 
    $cardParents=$article->parents;
    // will return 'Thing\CreativeWork\SoftawareApplication::Thing\CreativeWork\Game'

    // Return the direct parent Name
    $cardDirectParent=$article->getDirectparent();

数据处理:过滤属性值

如前所述,您可以在 cardator 上添加全局 PostProcessing

    $cardator->addPostProcessTreatment('my_property_to_filter', function($name, $value){
        // what I want to do
    });

如果您想要创建更具体的处理,您也可以按照以下方式编辑卡片库中的 Card

    public function __construct(){
        $this->addFilter('my_property_to_filter', function($name, $value){
            // what I want to do
        });
    }

您还可以在 Card\lib\FilterCard 中编辑您自己的处理操作

    $filter=[
        'my_property_to_filter'=>'function to call'
    ];