gm314/diavazo

PHP 7 HTML 解析器

0.2.1 2017-05-06 19:15 UTC

This package is not auto-updated.

Last update: 2024-09-29 03:52:38 UTC


README

Diavazo 是一个围绕 \DOMDocument\DOMElement 的包装器。它添加了一些有用的功能,以便在子元素中搜索或按类查询。HTMLDocument 类允许加载字符串、文件或 URL。还有一些基本的搜索方法。

例如,方法 getElement("p .spanClass b.bClass") 允许搜索元素、类以及两者的组合。示例将找到所有 <p> 元素、所有具有类 spanClass 的元素以及所有 <b class="bClass">

这些搜索的结果是一个 HTMLElement 对象数组。这些对象再次允许查询,区别在于搜索仅应用于它们的直接子元素。

安装

composer require gm314/diavazo

使用方法

use Diavazo\HTMLDocument;
$document = new HTMLDocument();

// load file
$document->loadFile("local.html");
$document->loadFile("http://mypage.com/test.html");

// load from string
$document->loadString("<html></html>");

HTMLDocument 方法

$document = new HTMLDocument();
$document->loadFile(__DIR__ . "/assets/TableToArrayTest.html");

// get element by id
$table = $document->getElementById("associateArrayTest");

// get element by tag name
$elementList = $document->getElementByTagName("div");

// find all <p> elements, all elements with the class 'spanClass' and all <b class="bClass">  
$elementList = $document->getElement("p .spanClass b.bClass");

// xpath query
$title = $document->query("/html/head/title");

// get root (<html>)
$root = $document->getRootElement();

HTMLElement 子元素方法

HTML 元素是查询如 getElementById 的结果。可以在元素上应用更多搜索方法。它们将在所有子元素中搜索。

方法 getDescendantByName("td th") 允许搜索多个标签。

$document = new HTMLDocument();
$document->loadFile(__DIR__ . "/assets/TableToArrayTest.html");

$table = $document->getElementById("table");

// will return the first tr (Breadth-first search)
$table->getFirstDescendantByName("tr");

// will return all td and th elements
$tdList = $table->getDescendantByName("td th");

// will find all elements that have the class 'active'
$root = $document->getRootElement();
$elementsWithClass = $root->getDescendantWithClassName("active");

// will find all elements that have the class 'myClass' and are td or th elements
$elementsWithClass = $root->getDescendantWithClassName("myClass", "td th");

// will find all elements having only the class 'testClass'
$elementsWithExactClass = $root->getDescendantWithClassNameStrict("testClass");

// will find all elements having only the class 'testClass' and are td or th elements
$elementsWithExactClass = $root->getDescendantWithClassNameStrict("testClass", "td th");

// find all <p> elements, all elements with the class 'spanClass' and all <b class="bClass"> that are descendants of #myId  
$anyElement = $document-getElementById("myId");
$elementList = $document->getElement("p .spanClass b.bClass");

HTMLElement 属性方法

$document = new HTMLDocument();
$document->loadFile("myFile.html");

$table = $document->getElementBy("myTable");

// will return null if the attribute does not exist otherwise string
$table->getAttributeValue("align");

表格到数组转换器

Diavazo 允许将表格转换为关联数组或基于索引的数组。关联数组将使用第一行作为键属性。

$document = new HTMLDocument();
$document->loadFile("tabletest.html");

$table = $document->getElementById("myTableID");

$arrayConverter = new TableToArrayConverter($table);
$array = $arrayConverter->getAsAssociativeArray();


<table id="myTableID">
    <tr>
        <td>Key1</td>
        <td>Key2</td>
    </tr>
    <tr>
        <td>Value 1</td>
        <td>Value 2</td>
    </tr>
    ...
</table>

will result in:

$array = [
    [
       "Key1" => "Value 1",
       "Key2" => "Value 2"
    ],
    ...
]

使用提取器将表格转换为数组

以下示例展示了如何注册一个提取器。闭包将在表格数据单元格(<td>)上调用,并期望返回将被添加到数组中的值。以下示例获取第一个 <a> 元素并提取 href 属性

$document = $this->getDocument();
$table = $document->getElementById("extractorTest");

$arrayConverter = new TableToArrayConverter($table);
$arrayConverter->registerExtractor("columnName", function (HTMLElement $td) {
    $a = $td->getFirstDescendantByName("a");
    return $a->getAttributeValue("href");
});
$array = $arrayConverter->getAsAssociativeArray();