valhook/dparse

dParse 是一个用 PHP 编写的强大 jQuery 风格的 HTML/XML 解析器。

1.0 2016-05-08 16:34 UTC

This package is not auto-updated.

Last update: 2024-09-20 19:40:17 UTC


README

Last tested Latest stable version Composer MIT License Min PHP Version Dependencies

dParse 是一个用 PHP 编写的强大 jQuery 风格的 HTML/XML 解析器。当我意识到现在互联网上可以找到的解析器(如 Simple HTML DOM、Ganon 等)可以改进时,我启动了这个项目。因此,我决定创建一个 PHP 解析器,通过改进以下方面使其更好:

  • 速度
  • 功能
  • 灵活性
  • 内存使用

安装

当你处于根目录时,只需运行此命令即可将此软件包添加到你的应用程序中

composer require valhook/dparse:*

或将此软件包添加到你的 composer.json

{
    "valhook/dparse":"*"
}

用法

以下信息解释了如何使用 dParse 的不同功能。

创建 DOM

$dom = createdParseDOM($source, $args);
  • $source 可以是一个远程 URL、原始 HTML/XML 代码或本地文件路径。
  • $args 是一个可选参数,是一个数组,指定了高级选项。以下是默认的 args。
$defaultargs = array("method" => "GET", // just concatenate the url and the http body in the source parameter and specify here the HTTP Method
                     "fake_user_agent" => NULL,
                     "fake_http_referer" => NULL,
                     "force_input_charset" => NULL,
                     "output_charset" => NULL,
                     "strip_whitespaces" => false,
                     "connect_timeout" => 10,
                     "transfer_timeout" => 40,
                     "verify_peer" => false,
                     "http_auth_username" => NULL,
                     "http_auth_password" => NULL,
                     "cookie_file" => NULL,
                     "is_xml" => FALSE,
                     "enable_logger" => FALSE);

或直接获取原始内容

$contents = dParseGetContents($source, $args);

原始内容及 DOM 级别操作

DOM 对象提供了以下功能

$dom->getRawContent(); or $dom // output is a string
$dom->showRawContent(); // echoes the raw content
$dom->saveRawContent($filename); // writes the content to a file
$dom->getSize(); // output is an int of the byte size of the content
$dom->setWhitepaceStripping($bool); // Tells dParse to strip all extra whitespaces whenever a string is returned or echoed.
$dom->getWhitespaceStripping(); // Get the current whitespace stripping status
$dom->setInputCharset($charset); // Tells dParse which charset should be used to interprate the document data, by default it is deduced from the HTTP/HTML headers
$dom->getInputCharset();
$dom->setOutputCharset($charset); // Tells dParse if a charset translation should be done when echoing or returning a string computed from the original DOM, by default no translation is done so the output charset is the same as the input's.
$dom->getOutputCharset();
$dom->getNoise(); // Return an array of string of unparsed data
        /* Noise regexes used by dParse, you may add yours at line 270 */
        $noise = array("'<!--(.*?)-->'is",
                        "'<!DOCTYPE(.*?)>'is",
                        "'<!\[CDATA\[(.*?)\]\]>'is",
                        "'(<\?)(.*?)(\?>)'s",
                        "'(\{\w)(.*?)(\})'s"
                        );

获取 DOM 节点

使用与 jQuery 一样的 CSS 选择器查询 DOM 节点/标签

$nodes_that_are_images = $dom('img');
$nodes_that_are_link_with_btn_class = $dom('a.btn');

// Most of the CSS3 selecting standard is supported
$nodes = $dom('a + a');
$nodes = $dom('div ~ div');
$nodes = $dom('div > p');
$nodes = $dom('ul ul > li');
$nodes = $dom('input[type=text]');
$nodes = $dom('img[src^="https"]');
$nodes = $dom('img[src$=".jpg"]');
$nodes = $dom('a[href*=google]');
$nodes = $dom('body > *');
// Of course it is not funny if you cannot combine them
$nodes = $dom('article[class*=post] section > div + div.classy, #some_id ~ .classy');

// Getting the root element
$rootnode = $dom->root();

// Remaining bugs
// Multiple classes are not supported, use:
$nodes = $dom('a[class="btn btn-primary"]') /* instead of */ $nodes = $dom('a.btn.btn-primary');

/// Pseudo selectors, like :not, :first-child etc are not yet supported

// For PHP < 5.3 users, use:
$dom->find('foo'); /* instead of */ $dom('foo');

MetaNode 对象

  • 类似于 CSS 的选择查询返回一个 MetaNode 对象,它是一组不同的节点。
  • $dom('div'); 将返回一个包含 n 个 Node 对象的 MetaNode 对象

在查看如何与 MetaNode 对象中的节点交互之前,我们将查看所有 MetaNode 对象操作。

$nodes->merge($othernodes); // Returns a new MetaNode Object containing the union of all the nodes from both MetaNodes
$nodes->length(); // Returns the number of nodes inside this meta node.
$nodes->eq($n); /* or */ $nodes->elem($n); // Returns the nth node of this MetaNode.
    // If n is a metanode or node it will return the interesction of both sets.
MetaNode 的多功能性
  • MetaNode 是您与请求处理的 DOM 节点之间的接口。
  • 如果您有多个节点,可以将 MetaNode 的节点级别函数传递给所有这些节点,并将返回一个包含所有节点响应的数组。
/* Example */
$dom('a')->text(); // will return array("foo", "bar", "baz", ...)
  • 但是,如果您的 MetaNode 只包含一个节点,您可以直接将 MetaNode 用作 单个 节点并调用其列出的不同函数。
/* Example */
$dom('#unique-id')->text(); // will return "foo" and not array("foo")
  • 但是,如果您不想使用这种多功能性,或者如果您不知道您有多少个节点,您可以使用 foreach 遍历 MetaNode,即使它只包含一个节点。
foreach($nodes as $node) {
    // $node->do_something();
}
  • 您可以将任何 节点级别 函数传递给您的 MetaNode。节点级别函数列表如下。

节点对象

  • 节点是一个 HTML 标签。例如 <a ...>Foo</a> 是一个节点。
  • 节点对象可用于提取或修改其内容。
  • 如果进行了修改,它将直接更新整个 DOM。

获取器

获取器大部分与 jQuery 一样。

$node->_dom(); // Returns the DOM linked to this node.
$node->index(); // Returns the index of the node in the DOM. HTML is 0, HEAD is 1, TITLE can be 2 etc...
$node->length(); // Always returns 1, it is an compatibility abstraction with the MetaNode object.
$node->tagName(); or $node->name(); // Returns the tag name (a, li, div etc...)
$node->attributes(); // Returns a dictionary of the attributes
   /* Example:
    array(2) {
    ["href"]=>
    string(8) "#contact"
    ["class"]=>
    string(23) "btn border-button-black"
    }
    Therefore you will get an array of array if you call it from a MetaNode
    */
$node->XXXX; // Will return the content of an attribute, examples:
    $node->href;
    $node->src;
    $node->type;
    $node->attr('XXXX'); or $node->prop('XXXX'); /* it is the same as */ $node->XXXX;
$node->depth(); // Will return the depth (int) of the node inside the DOM
$node->breadcrumb(); // Will return the full path from the root element to this node
$node->breadcrumb_size(); // Returns the size of the breadcrumb
$node->breadcrumb_element($i); // Returns a sub-element of the node's breadcrumb
$node->val(); // Same as $node->value, *I Will later add support for textareas and selects as the value attribute is irrelevant for them
$node->html(); // Returns the inner HTML of the node as a string
$node->htmlLength();
$node->outerHTML(); // Returns the outer HTML of the node
$node->outerHTMLLength();
$node->text(); // Returns the inner text, therefore the inner HTML with HTML tags stripped
$node->textLenght();
$node; // This is the __toString method, it is the same as $node->outerHTML();

CSS 子查询

  • 就像 jQuery 一样,您可以从节点而不是整个 DOM 执行 CSS 查询。
  • 如果您将任何 CSS 子查询函数应用于 MetaNode,您将得到一个 MetaNode,它是将相同查询应用于每个节点得到的结果的并集。
选择器类型
  • 无选择器:该方法不接收任何参数
  • 选择器:该方法接收一个 CSS 选择器参数
  • 智能选择器:该方法接收一个参数,可以是
    1. CSS 选择器
    2. 一个 MetaNode 或节点(例如:parentsUntil($node)),将返回所有匹配指定节点或属于 MetaNode 的父节点
    3. 有时,如果这样做有道理:一个 INT(例如:parentsUntil(2))将返回前两个父节点
方法
$node->find($smartselector); // Finds the subnodes of this node matching this CSS selector
$node->parent($smartselector = NULL); // Returns the first parent, or the parents that match the selector
$node->parents(); // Returns all parents
$node->parentsUntil($smartselector); // Return all the parents until the selector
$node->prev($smartselector = NULL); // Returns the first previous element, same depth level, or the previous one that matches the selector.
$node->prevAll();
$node->prevUntil($smartSelector);
$node->next($smartselector = NULL);
$node->nextAll();
$node->nextUntil($smartselector);
$node->children($smartselector = NULL); // If the selector is empty it returns all the children, if it is an int *i* it returns the first i children in the order of declaration inside the DOM, if it is a CSS selector or a MetaNode it returns the children that intersect with the CSS selector or the nodes.
$node->is($smartselector); // Returns itself (castable to true) for chaining purposes or false according to wether the node is part of the metanode or the results of the css query
$node->has($smartselector = NULL); or $node->hasChild($smartselector = NULL); // Returns itself or false
$node->hasParent($smartselector = NULL);
$node->hasPrev($smartselector = NULL);
$node->hasNext($smartselector = NULL);

设置器

  • 就像jQuery一样,它们直接修改节点或元节点,以及此节点的所有其他实例和DOM。
  • 它们尚未全部完成,到目前为止,我只实现了四个方法
$node->XXXX = "YYYY"; or $node->attr('XXXX', 'YYYY'); or $node->prop('XXXX', 'YYYY'); // Changes an attribute
$node->addClass($class);
$node->removeClass($class);
$node->setTagName($name); // Changes the tag name, ex span to div;

这就是节点和元节点的全部内容!

日志记录器

  • dParse打包了一个用于调试的日志记录器。
  • 默认情况下,日志记录器是禁用的,但您可以在创建DOM时启用它(它是可指定参数之一)或稍后通过日志记录器API

方法

$dom->getLogger(); // Returns the logger object
$logger->isEnabled(); // Tells wether the logger is enabled
$logger->enable($bool); // Enables or disables the logger
$logger->getLogs(); // Returns an array of strings that are the logs
$logger->getLastLog(); // Returns the last entry in the logbook
$logger->clear(); // Clears all the logs
$logger->showLogs(); // Echoes all the logs
$logger->saveLogs($filename); // Writes all the logs to a file
$logger->log($message); // Logs a message if the logger is enabled

日志记录示例

Array
(
    [0] => Retrieved document contents in 0.78749895095825 seconds
    [1] => Now parsing document ...
    [2] => The document charset was found and is: UTF-8
    [3] => Found 57 noisy tags.
    [4] => Ignored 0 tags.
    [5] => Found 1225 elements, including 0 recreated tags to fix invalid HTML
    [6] => Document parsed in 0.09472606658936 seconds
    [7] => Memory peak usage: 13 915 152 bytes
    [8] => 
    [9] => Performing CSS query: h3 a[href*=watch]
    [10] => Found 19 nodes in 0.056853046417236 seconds
    [11] => Memory peak usage: 14 411 672 bytes
)

dParse的实用示例

获取维基百科文章的标题

include "dParse.php";
$wiki_root = "https://fr.wikipedia.org/wiki/";
$article = "Batman";
$doc = createDParseDOM($wiki_root.$article, array("strip_whitespaces", true));
$contents = $doc('#bodyContent')->children('h1, h2, h3, h4, h5, h6')->text();
print_r($contents);

输出

Array
(
    [0] => Sommaire
    [1] => Origines du personnage[modifier | modifier le code]
    [2] => Évolution du personnage[modifier | modifier le code]
    [3] => De 1939 à 1964[modifier | modifier le code]
    [4] => De 1964 à 1986[modifier | modifier le code]
    [5] => Batman moderne[modifier | modifier le code]
    [6] => La renaissance DC[modifier | modifier le code]
    [7] => Description[modifier | modifier le code]
    [8] => Personnalités[modifier | modifier le code]
    [9] => Bruce Wayne[modifier | modifier le code]
    [10] => Batman[modifier | modifier le code]
    [11] => Matches Malone[modifier | modifier le code]
    [12] => Équipement[modifier | modifier le code]
    [13] => Univers[modifier | modifier le code]
    [14] => Lieux[modifier | modifier le code]
    [15] => Gotham City[modifier | modifier le code]
    [16] => Batcave[modifier | modifier le code]
    [17] => L'asile d'Arkham[modifier | modifier le code]
    [18] => Alliés[modifier | modifier le code]
    [19] => Robin[modifier | modifier le code]
    [20] => Alfred Pennyworth[modifier | modifier le code]
    [21] => Lucius Fox[modifier | modifier le code]
    [22] => Le commissaire James Gordon[modifier | modifier le code]
    [23] => Batgirl[modifier | modifier le code]
    [24] => Ace[modifier | modifier le code]
    [25] => Relation avec les autres super-héros[modifier | modifier le code]
    [26] => Les équipes de super-héros[modifier | modifier le code]
    [27] => Relations entre Batman et Superman[modifier | modifier le code]
    [28] => Vie sentimentale[modifier | modifier le code]
    [29] => Dans les comics[modifier | modifier le code]
    [30] => Dans les films[modifier | modifier le code]
    [31] => Ennemis[modifier | modifier le code]
    [32] => Analyses et critiques[modifier | modifier le code]
    [33] => Analyses[modifier | modifier le code]
    [34] => Batman justicier[modifier | modifier le code]
    [35] => Approche psychanalytique[modifier | modifier le code]
    [36] => Critiques[modifier | modifier le code]
    [37] => Séries de comics[modifier | modifier le code]
    [38] => Autres média[modifier | modifier le code]
    [39] => Radio[modifier | modifier le code]
    [40] => Serials[modifier | modifier le code]
    [41] => Série télévisée[modifier | modifier le code]
    [42] => Batman[modifier | modifier le code]
    [43] => Gotham[modifier | modifier le code]
    [44] => Dessins animés[modifier | modifier le code]
    [45] => Longs métrages[modifier | modifier le code]
    [46] => Premier long métrage[modifier | modifier le code]
    [47] => Tétralogie des années 1990[modifier | modifier le code]
    [48] => Trilogie de Christopher Nolan[modifier | modifier le code]
    [49] => DC Cinematic Universe[modifier | modifier le code]
    [50] => Jeux vidéo[modifier | modifier le code]
    [51] => Produits dérivés[modifier | modifier le code]
    [52] => Notes et références[modifier | modifier le code]
    [53] => Notes[modifier | modifier le code]
    [54] => Références bibliographiques[modifier | modifier le code]
    [55] => Autres références[modifier | modifier le code]
    [56] => Ouvrages[modifier | modifier le code]
    [57] => Articles[modifier | modifier le code]
    [58] => Articles connexes[modifier | modifier le code]
    [59] => Voir aussi[modifier | modifier le code]
    [60] => Liens externes[modifier | modifier le code]
)

获取YouTube上的视频搜索结果和链接

$youtube_root = "https://www.youtube.com/results?search_query=";
$search = urlencode("funny potato");
$doc = createDParseDOM($youtube_root.$search);
$doc->setWhitespaceStripping(true);
$links = $doc('h3 a[href*=watch]');
$out = array();
foreach ($links as $l)
    $out[] = array("title" => $l->text(), "url" => $l->href);
    
print_r($out);

输出

Array
(
    [0] => Array
        (
            [title] => Funny Ferret Steals a Potato
            [url] => /watch?v=7IXi_ANNMC8
        )

    [1] => Array
        (
            [title] => A Potato Flew Around My Room Before You Came Vine Compilation
            [url] => /watch?v=41uD91e4GqA
        )

    [2] => Array
        (
            [title] => Garry&#39;s Mod | POTATO HIDE AND SEEK | Funny Potato Mod
            [url] => /watch?v=ri0dw7gLn14
        )

    [3] => Array
        (
            [title] => Play Doh Mr Potato Head Make Funny Faces Grow Hair Disney Play-Doh Pixar Toy Story &amp; Cookie Monster
            [url] => /watch?v=u-4o8oTjAF8
        )

    [4] => Array
        (
            [title] => The Potato Song
            [url] => /watch?v=hUhK8sS9gnY
        )

    [5] => Array
        (
            [title] => Pranks Funny : Funny Potato Cannon Prank
            [url] => /watch?v=mkT1tUqEqZw
        )

    [6] => Array
        (
            [title] => A Potato Flew Around My Room Before You Came Vine Compilation 2014
            [url] => /watch?v=L-gSakeOFqM
        )

    [7] => Array
        (
            [title] => Zombie Potatoes (Call of Duty WaW Zombies Custom Maps, Mods, &amp; Funny Moments)
            [url] => /watch?v=WcI5A4J0E0g
        )

    [8] => Array
        (
            [title] => Lord of the Rings funny (potato) edit
            [url] => /watch?v=57wK0JugLac
        )

    [9] => Array
        (
            [title] => DVBBS - Angel (DJ Potato Remix) [w/ Funny Old Man Laughing]
            [url] => /watch?v=MFWBprP1o5o
        )

    [10] => Array
        (
            [title] => GTA 5 Funny Gameplay Moments! #5 - New &quot;Swingset&quot; Glitch, Hot Potato, and More! (GTA Cannon Glitch)
            [url] => /watch?v=Ql8ovzDzx1E
        )

    [11] => Array
        (
            [title] => Potato Suicide
            [url] => /watch?v=La-FSCfQEsk
        )

    [12] => Array
        (
            [title] => CS GO BIGGEST WTF! FUNNY MOMENTS - Road to Global Potato ELITE COUNTER STRIKE MATCH MAKING
            [url] => /watch?v=AtLjUXDK9Vw
        )

    [13] => Array
        (
            [title] => Funny Potato Gun Fail
            [url] => /watch?v=ciGCXWWfjpU
        )

    [14] => Array
        (
            [title] => Call of Duty WaW Zombies Potato Edition! (Custom Map &amp; Funny Moments)
            [url] => /watch?v=fBeFRbmb_rg
        )

    [15] => Array
        (
            [title] => Funny Potato Video
            [url] => /watch?v=BybCDVD3Y-E
        )

    [16] => Array
        (
            [title] => Mr Potato Head, Pirate Island Costume - Create &amp; Play Funny Movies Cartoon iPad, iPhone
            [url] => /watch?v=4JINBHAi0vQ
        )

    [17] => Array
        (
            [title] => Battlefield 4 Random Moments 63 (Soldier Squishing, Dip Dip Potato Chip!)
            [url] => /watch?v=z6aSCVzIFM0
        )

    [18] => Array
        (
            [title] => The Potato Song
            [url] => /watch?v=q7uyKYeGPdE
        )

)

愉快的解析!