README

使用 PHP 和 Goutte 简单地抓取旧网站标题，包括 1.谷歌学术 2.Neliti 3.研究之门 4.Springer Open。

不支持无头浏览器或现代网页，如单页应用程序 (SPA)

安装

composer

composer require nagara/hunter-php

或

克隆 GitHub

https://github.com/naagaraa/hunterPHP.git

代码维护者 🐐

miyukinagara

知识

学习 Goutte

学习 DomCrawler https://symfony.com.cn/doc/current/components/dom_crawler.html#form-and-link-support
browserKit https://symfony.com.cn/doc/current/components/browser_kit.html
Goutte PHP https://github.com/FriendsOfPHP/Goutte

基本用法

引入库

require "vendor/autoload.php";

如何使用它是如何工作的？这个程序与你在原始页面上的搜索相同，但通过这段代码，我可以同时搜索标题并将数据保存到数组中。

⚠️ 只显示第一页的搜索结果

use HunterPHP\Hunter;

$hunter = new Hunter;

echo "<h1>Web Data Extraction for Title Journal or Article at Online Journal</h1>";
echo "<h2>study case Web Data extraction for non Headless Browser</h2>";

// example get data from springe open journal
echo "<h3>springer open journal : data extraction -> keyword apriori</h3>";
dump($hunter->scrap("springeropen", "apriori"));

// example get data from google scholar open journal
echo "<h3>google scholar open journal : data extraction -> keyword AI</h3>";
dump($hunter->scrap("google_scholar", "AI"));

// example get data from neliti open journal
echo "<h3>neliti open journal : data extraction -> keyword AI</h3>";
dump($hunter->scrap("neliti", "AI"));

// example get data from research gate open journal
echo "<h3>research gate open journal : data extraction -> keyword AI</h3>";
dump($hunter->scrap("research_gate", "AI"));

另一个示例

<?php

require "vendor/autoload.php";

use HunterPHP\Hunter;

$hunter = new Hunter;

$keyword = "apriori";

$springeropen = $hunter->scrap("springeropen", $keyword);
$google_scholar = $hunter->scrap("google_scholar", $keyword);
$neliti = $hunter->scrap("neliti", $keyword);
$research_gate = $hunter->scrap("research_gate", $keyword);

$html = <<<HTML
<h1>example with table<h1>
HTML;
echo $html;

?>
<style>
    table {
        font-family: arial, sans-serif;
        border-collapse: collapse;
        width: 100%;
        /* margin: auto; */
    }

    td,
    th {
        border: 1px solid #dddddd;
        text-align: left;
        padding: 8px;
    }

    tr:nth-child(even) {
        background-color: #dddddd;
    }
</style>
<table>
    <tr>
        <th>springer open</th>
    </tr>
    <?php foreach ($springeropen as $title) : ?>
        <tr>
            <td><?= $title ?></td>
        </tr>
    <?php endforeach; ?>
</table>
<br><br>
<table>
    <tr>
        <th>google scholar</th>
    </tr>
    <?php foreach ($google_scholar as $title) : ?>
        <tr>
            <td><?= $title ?></td>
        </tr>
    <?php endforeach; ?>
</table>
<br><br>
<table>
    <tr>
        <th>research gate</th>
    </tr>
    <?php foreach ($research_gate as $title) : ?>
        <tr>
            <td><?= $title ?></td>
        </tr>
    <?php endforeach; ?>
</table>
<br><br>
<table>
    <tr>
        <th>neliti</th>
    </tr>
    <?php foreach ($neliti as $title) : ?>
        <tr>
            <td><?= $title ?></td>
        </tr>
    <?php endforeach; ?>
</table>

其他阅读材料

V8 JavaScript 引擎集成 - https://php.ac.cn/manual/en/book.v8js.php pecl v8 JavaScript 引擎为 PHP - https://pecl.php.net/package/v8js bug chromium - https://bugs.chromium.org/p/v8/issues/list

糟了

使用 PHP 抓取网站比较困难，因为现在许多网页应用都使用了像 Angular、ReactJS 和其他技术。或者可以说使用无头浏览器。我认为我想要用 NodeJS 和 JavaScript 构建另一个工具，并且我还在思考 PHP 引擎和 JavaScript 引擎之间如何通信。或者这种语言可以与另一种语言通信而不需要 API（应用程序编程接口）。

nagara / hunter-php

维护者

详细信息