atrox / matcher
强大的XML和HTML匹配与数据提取库
v1.1.1
2018-02-10 20:51 UTC
Requires
- php: >=5.3.0
Requires (Dev)
- masterminds/html5: 2.*
- nette/tester: ~1.1
Suggests
- symfony/css-selector: Allows CSS selectors.
This package is not auto-updated.
Last update: 2024-09-14 15:47:37 UTC
README
Matcher - 使用XPath和纯魔法从XML和HTML中提取数据的强大工具。
为什么制作Matcher(捷克语),XPath简介(捷克语)
安装
使用 Composer 安装Matcher
composer require atrox/matcher
示例
use Atrox\Matcher; $m = Matcher::multi('//div[@id="siteTable"]/div[contains(@class, "thing")]', [ 'id' => '@data-fullname', 'title' => './/p[@class="title"]/a', 'url' => './/p[@class="title"]/a/@href', 'date' => './/time/@datetime', 'img' => 'a[contains(@class, "thumbnail")]/img/@src', 'votes' => (object) [ 'ups' => '@data-ups', 'downs' => '@data-downs', 'rank' => 'span[@class="rank"]', 'score' => './/div[contains(@class, "score")]', ], ])->fromHtml(); $f = file_get_contents('http://www.reddit.com/'); $extractedData = $m($f);
结果
[ [ "id" => "t3_1ep0c5", "title" => "Obligatory funny cat pictures.", "url" => "http://imgur.com/sGu0pEk", "date" => "2013-05-20T14:16:24+00:00", "img" => "http://e.thumbs.redditmedia.com/MZjtg3UnZ8MOVjcd.jpg", "votes" => (object) [ "ups" => "115036", "downs" => "10266", "rank" => "1", "score" => "105650" ] ], [ ... ] ]
Matchers可以任意链式和嵌套。
$postMatcher = Matcher::single('.//div[@class="postInfo desktop"]', [ 'id' => './input/@name', 'name' => './span[@class="nameBlock"]/span[@class="name"]', 'date' => './span/@data-utc', ]); $m = Matcher::multi('//div[@class="thread"]', [ 'op' => Matcher::single('./div[@class="postContainer opContainer"]', $postMatcher), 'replies' => Matcher::multi('./div[@class="postContainer replyContainer"]', $postMatcher) ])->fromHtml(); $f = file_get_contents('http://boards.4chan.org/po/'); $extractedData = $m($f);
结果
[ [ "op" => [ "id" => "481874858", "name" => "Anonymous", "date" => "1369242761" ], "replies" => [ [ "id" => "481879347", "name" => "moot", "date" => "1369244554" ], ... ] ], [ ... ], ... ]
与外部解析器一起使用
由于Matcher内部使用DOMDocument或SimpleXML对象,因此可以与外部HTML/XML解析器(如html5-php)一起使用。
$html5 = new Masterminds\HTML5(['disable_html_ns' => true]); $dom = $html5->loadHTML($html); $m = Matcher::single('//h1'); $title = $m($dom);