gm314 / diavazo
PHP 7 HTML 解析器
0.2.1
2017-05-06 19:15 UTC
Requires
- php: >=7.0.0
- gm314/common: master
Requires (Dev)
This package is not auto-updated.
Last update: 2024-09-29 03:52:38 UTC
README
Diavazo 是一个围绕 \DOMDocument
和 \DOMElement
的包装器。它添加了一些有用的功能,以便在子元素中搜索或按类查询。HTMLDocument
类允许加载字符串、文件或 URL。还有一些基本的搜索方法。
例如,方法 getElement("p .spanClass b.bClass")
允许搜索元素、类以及两者的组合。示例将找到所有 <p>
元素、所有具有类 spanClass
的元素以及所有 <b class="bClass">
。
这些搜索的结果是一个 HTMLElement
对象数组。这些对象再次允许查询,区别在于搜索仅应用于它们的直接子元素。
安装
composer require gm314/diavazo
使用方法
use Diavazo\HTMLDocument; $document = new HTMLDocument(); // load file $document->loadFile("local.html"); $document->loadFile("http://mypage.com/test.html"); // load from string $document->loadString("<html></html>");
HTMLDocument 方法
$document = new HTMLDocument(); $document->loadFile(__DIR__ . "/assets/TableToArrayTest.html"); // get element by id $table = $document->getElementById("associateArrayTest"); // get element by tag name $elementList = $document->getElementByTagName("div"); // find all <p> elements, all elements with the class 'spanClass' and all <b class="bClass"> $elementList = $document->getElement("p .spanClass b.bClass"); // xpath query $title = $document->query("/html/head/title"); // get root (<html>) $root = $document->getRootElement();
HTMLElement 子元素方法
HTML 元素是查询如 getElementById
的结果。可以在元素上应用更多搜索方法。它们将在所有子元素中搜索。
方法 getDescendantByName("td th")
允许搜索多个标签。
$document = new HTMLDocument(); $document->loadFile(__DIR__ . "/assets/TableToArrayTest.html"); $table = $document->getElementById("table"); // will return the first tr (Breadth-first search) $table->getFirstDescendantByName("tr"); // will return all td and th elements $tdList = $table->getDescendantByName("td th"); // will find all elements that have the class 'active' $root = $document->getRootElement(); $elementsWithClass = $root->getDescendantWithClassName("active"); // will find all elements that have the class 'myClass' and are td or th elements $elementsWithClass = $root->getDescendantWithClassName("myClass", "td th"); // will find all elements having only the class 'testClass' $elementsWithExactClass = $root->getDescendantWithClassNameStrict("testClass"); // will find all elements having only the class 'testClass' and are td or th elements $elementsWithExactClass = $root->getDescendantWithClassNameStrict("testClass", "td th"); // find all <p> elements, all elements with the class 'spanClass' and all <b class="bClass"> that are descendants of #myId $anyElement = $document-getElementById("myId"); $elementList = $document->getElement("p .spanClass b.bClass");
HTMLElement 属性方法
$document = new HTMLDocument(); $document->loadFile("myFile.html"); $table = $document->getElementBy("myTable"); // will return null if the attribute does not exist otherwise string $table->getAttributeValue("align");
表格到数组转换器
Diavazo 允许将表格转换为关联数组或基于索引的数组。关联数组将使用第一行作为键属性。
$document = new HTMLDocument(); $document->loadFile("tabletest.html"); $table = $document->getElementById("myTableID"); $arrayConverter = new TableToArrayConverter($table); $array = $arrayConverter->getAsAssociativeArray(); <table id="myTableID"> <tr> <td>Key1</td> <td>Key2</td> </tr> <tr> <td>Value 1</td> <td>Value 2</td> </tr> ... </table> will result in: $array = [ [ "Key1" => "Value 1", "Key2" => "Value 2" ], ... ]
使用提取器将表格转换为数组
以下示例展示了如何注册一个提取器。闭包将在表格数据单元格(<td>
)上调用,并期望返回将被添加到数组中的值。以下示例获取第一个 <a>
元素并提取 href 属性
$document = $this->getDocument(); $table = $document->getElementById("extractorTest"); $arrayConverter = new TableToArrayConverter($table); $arrayConverter->registerExtractor("columnName", function (HTMLElement $td) { $a = $td->getFirstDescendantByName("a"); return $a->getAttributeValue("href"); }); $array = $arrayConverter->getAsAssociativeArray();