ordinary9843/html-master

分析和爬取静态/动态网站的HTML结构

v1.0.0 2024-01-02 09:42 UTC

This package is auto-updated.

Last update: 2024-09-23 09:13:22 UTC


README

build codecov

简介

分析和爬取静态/动态网站的HTML结构。

要求

此库有以下要求

  • PHP 7.1+
  • NodeJs 12+
  • 浏览器(默认浏览器是 /use/bin/chromium

安装

要求

apt-get install nodejs
apt-get install chromium # or `chromium-browser`

使用composer安装包

composer require ordinary9843/html-master

使用

示例用法

<?php
require './vendor/autoload.php';

use Ordinary9843\HtmlMaster;

$htmlMaster = new HtmlMaster();

// For the first time use of this package, it is recommended to enable the debug mode.
$htmlMaster->setDebug(true);

// Set the browser path for dynamic mode.
$htmlMaster->setExecutablePath('/usr/bin/chromium');

/**
 * Set the connection time (in seconds) for dynamic mode.
 *
 * If you are unable to obtain the dynamic (SPA) HTML.
 * You can try extending the wait time in seconds to wait for the website JavaScript elements to finish rendering.
 */
$htmlMaster->setWaitSeconds(5);

// Set the connection time (in seconds) for static mode.
$htmlMaster->setConnectTimeout(5);
$htmlMaster->setTimeout(5);

/**
 * The decision to execute the crawler in static or dynamic mode depends on whether your browser path is correctly set.
 * Please use `setExecutablePath()` to set the browser path.
 *
 * Output: [
 *  'title' => '',
 *  'description' => '',
 *  'meta' => [
 *    'keywords' => '',
 *    'description' => '',
 *    'viewport' => '',
 *    'author' => '',
 *    'copyright' => '',
 *    'robots' => '',
 *    'og' => [],
 *    'twitter' => []
 *  ],
 *  'icons' => [],
 *  'images' => [],
 *  'css' => [],
 *  'js' => []
 * ]
 */
$htmlMaster->parse('https://github.com/ordinary9843');

/**
 * Get all messages.
 *
 * Output: [
 *  '[INFO] Message.',
 *  '[ERROR] Message.'
 * ]
 */
$htmlMaster->getMessages();

测试

composer test

许可证

MIT 许可证)