nagara / hunter-php
此库提供了从谷歌学术、Neliti、Springer Open 和 Research Gate 等网站提取数据的工具
v0.0.3
2021-10-25 05:33 UTC
Requires
- fabpot/goutte: v4.0.1
Requires (Dev)
- symfony/var-dumper: 5.4.x-dev
README
使用 PHP 和 Goutte 简单地抓取旧网站标题,包括 1.谷歌学术 2.Neliti 3.研究之门 4.Springer Open。
不支持无头浏览器或现代网页,如单页应用程序 (SPA)
安装
composer
composer require nagara/hunter-php
或
克隆 GitHub
https://github.com/naagaraa/hunterPHP.git
代码维护者 🐐
miyukinagara
知识
学习 Goutte
- 学习 DomCrawler https://symfony.com.cn/doc/current/components/dom_crawler.html#form-and-link-support
- browserKit https://symfony.com.cn/doc/current/components/browser_kit.html
- Goutte PHP https://github.com/FriendsOfPHP/Goutte
基本用法
引入库
require "vendor/autoload.php";
如何使用它是如何工作的?这个程序与你在原始页面上的搜索相同,但通过这段代码,我可以同时搜索标题并将数据保存到数组中。
⚠️ 只显示第一页的搜索结果
use HunterPHP\Hunter; $hunter = new Hunter; echo "<h1>Web Data Extraction for Title Journal or Article at Online Journal</h1>"; echo "<h2>study case Web Data extraction for non Headless Browser</h2>"; // example get data from springe open journal echo "<h3>springer open journal : data extraction -> keyword apriori</h3>"; dump($hunter->scrap("springeropen", "apriori")); // example get data from google scholar open journal echo "<h3>google scholar open journal : data extraction -> keyword AI</h3>"; dump($hunter->scrap("google_scholar", "AI")); // example get data from neliti open journal echo "<h3>neliti open journal : data extraction -> keyword AI</h3>"; dump($hunter->scrap("neliti", "AI")); // example get data from research gate open journal echo "<h3>research gate open journal : data extraction -> keyword AI</h3>"; dump($hunter->scrap("research_gate", "AI"));
另一个示例
<?php require "vendor/autoload.php"; use HunterPHP\Hunter; $hunter = new Hunter; $keyword = "apriori"; $springeropen = $hunter->scrap("springeropen", $keyword); $google_scholar = $hunter->scrap("google_scholar", $keyword); $neliti = $hunter->scrap("neliti", $keyword); $research_gate = $hunter->scrap("research_gate", $keyword); $html = <<<HTML <h1>example with table<h1> HTML; echo $html; ?> <style> table { font-family: arial, sans-serif; border-collapse: collapse; width: 100%; /* margin: auto; */ } td, th { border: 1px solid #dddddd; text-align: left; padding: 8px; } tr:nth-child(even) { background-color: #dddddd; } </style> <table> <tr> <th>springer open</th> </tr> <?php foreach ($springeropen as $title) : ?> <tr> <td><?= $title ?></td> </tr> <?php endforeach; ?> </table> <br><br> <table> <tr> <th>google scholar</th> </tr> <?php foreach ($google_scholar as $title) : ?> <tr> <td><?= $title ?></td> </tr> <?php endforeach; ?> </table> <br><br> <table> <tr> <th>research gate</th> </tr> <?php foreach ($research_gate as $title) : ?> <tr> <td><?= $title ?></td> </tr> <?php endforeach; ?> </table> <br><br> <table> <tr> <th>neliti</th> </tr> <?php foreach ($neliti as $title) : ?> <tr> <td><?= $title ?></td> </tr> <?php endforeach; ?> </table>
其他阅读材料
V8 JavaScript 引擎集成 - https://php.ac.cn/manual/en/book.v8js.php pecl v8 JavaScript 引擎为 PHP - https://pecl.php.net/package/v8js bug chromium - https://bugs.chromium.org/p/v8/issues/list
糟了
使用 PHP 抓取网站比较困难,因为现在许多网页应用都使用了像 Angular、ReactJS 和其他技术。或者可以说使用无头浏览器。我认为我想要用 NodeJS 和 JavaScript 构建另一个工具,并且我还在思考 PHP 引擎和 JavaScript 引擎之间如何通信。或者这种语言可以与另一种语言通信而不需要 API(应用程序编程接口)。