uthmordar / cardator
解析网页并从微数据生成卡片对象
dev-master
2015-07-06 13:39 UTC
Requires
- php: >=5.4.0
- fabpot/goutte: 2.0.*@dev
This package is not auto-updated.
Last update: 2024-09-28 17:28:10 UTC
README
Cardator 包
v 1.3.3
允许网页解析并收集微数据。
在卡片实例化或后处理中进行过滤/钩子操作。
输出:作为 hydration 对象或 json 编码字符串的卡片集合。
基本利用
需要 composer dump-autoload --optimize
require_once "vendor/autoload.php";
use Uthmordar\Cardator\Card\CardGenerator;
use Uthmordar\Cardator\Card\CardProcessor;
use Uthmordar\Cardator\Cardator;
use Uthmordar\Cardator\Parser\Parser;
try{
$cardator=new Cardator(new CardGenerator, new CardProcessor, new Parser);
/* give only Article type card in output (only has priority over except) */
$cardator->addOnly('Article');
/* Thing type card will not be given in output */
$cardator->addExcept('Thing');
/* choose url to crawl and extract data, throw RuntimeException if header status 400+ */
$crawl=$cardator->crawl('http://google.fr');
/* given closure will be use on given property for all card during the postprocess */
$cardator->addPostProcessTreatment('my_property_to_filter', function($name, $value){
// what I want to do
});
$cardator->doPostProcess();
/* get cards as json */
$cards=$cardator->getCards(true);
/* get cards as SplObjectStorage collection */
$cards=$cardator->getCards();
foreach($cards as $c){
// do something with cards
}
}catch(\RuntimeException $e){
// do something with error
}
提取数据格式
此工具通过搜索 microdata 规范来爬取网页。
它还将跟踪一些特殊属性并将它们链接到给定的 itemprop
- dk-raw 是一个属性,您应该使用它来提供只能由开发人员或机器人使用的信息,例如 datetime 而不是可读日期。
- content 属性可用于元标签来标记对用户隐藏的内容
- value 属性可用于传递与标签相关的数值
您可以通过以下方式访问一些处理信息
$cardator->getTotalCard(); // Give number of card found
$cardator->getExecutionTime(); // return crawl duration in s
$cardator->getStatus(); // return crawler http status
$cardator->getExecutionData(); // return previous informations as array
卡片生成
您可以使用以下方式轻松创建 Card 对象
$cardator->createCard('Article');
您可以通过扩展 CardGenerator 并提供新的命名空间路径来更改卡片库,只要您尊重卡片接口实现即可
卡片属性
$article=$cardator->createCard('Article');
// GET
$name=$article->name;
$name=$article->name();
// SET
$article->name='My Article';
$article->name('My Article');
// Existant properties will be hydrated, non-existant property will create an entry in $params array
$article->params['non-existant'];
// You could access to all hydrated properties name in an array
$properties=$article->properties;
// Card type and card hierarchy
$cardName=$article->getQualifiedName();
$cardType=$article->type;
// Parents : will return an array ['Thing', 'CreativeWork']
// if more than one parent exist for an item : [['Thing', 'CreativeWork', 'SoftwareApplication'], ['Thing', 'CreativeWork', 'Game']]
$cardParents=$article->getParents();
OR
$cardParents=$article->parents;
// will return 'Thing\CreativeWork\SoftawareApplication::Thing\CreativeWork\Game'
// Return the direct parent Name
$cardDirectParent=$article->getDirectparent();
数据处理:过滤属性值
如前所述,您可以在 cardator 上添加全局 PostProcessing
$cardator->addPostProcessTreatment('my_property_to_filter', function($name, $value){
// what I want to do
});
如果您想要创建更具体的处理,您也可以按照以下方式编辑卡片库中的 Card
public function __construct(){
$this->addFilter('my_property_to_filter', function($name, $value){
// what I want to do
});
}
您还可以在 Card\lib\FilterCard 中编辑您自己的处理操作
$filter=[
'my_property_to_filter'=>'function to call'
];