deravenedwriter/crawlengine

Crawl Engine 是一个PHP库,用于帮助自动化登录受密码保护的网站并从中获取所需信息。

1.0.1 2021-03-05 16:45 UTC

This package is auto-updated.

Last update: 2024-09-25 19:22:14 UTC


README

阅读

CrawlEngine

Crawl Engine 是一个PHP库,用于帮助自动化登录受密码保护的网站并从中获取所需信息。它借助其他优秀的库(如Guzzle、DomCrawler等)来实现这一功能。 License: MIT

目录

安装

建议使用 Composer 安装 CrawlEngine,方法如下

composer require deravenedwriter/crawlengine

然后确保您的引导文件正在加载composer自动加载器

require_once 'vendor/autoload.php';

初始化引擎类

引擎类用于执行CrawlEngine的大部分功能,包括解析请求、从页面获取表单详情等。引擎可以如下初始化

<?php
// First we need to add the namespace
use CrawlEngine\Engine;

// Now We Initialize the main Engine class
$engine = new Engine();

/**
 * The Engine class accepts one optional parameter of an integer
 * This Parameter would be set as the default timeout for all web requests made with that Engine Instance
 * So for example if I want the Default timeout to be 20 seconds, I would do that as follows:
 */
 
 $engine = new Engine(20);
 
// although by default, the timeout is 10 (10 seconds)

初始化输入详情类

输入详情实例描述了表单的输入标签。一个表单的输入标签可以如下所示

<input name='name'  type='text' value='' placeholder='Input Your Name'/>

输入详情类用于将字段值传递给引擎类,并且当引擎被要求获取给定页面的表单输入时,它也会返回。它包含不同的属性,包括名称、类型和占位符。

我们可以如下初始化输入详情类

CrawlEngine 有一种方法可以访问网站以分析输入标签。例如,一个位于 https://example.com/login 的网站有一个如下所示的页面

<?php
// First we need to add the namespace
use CrawlEngine\InputDetail;

/**
 * Now We Initialize the Input Detail class
 * In the example below the InputDetail is initialized with a name
 * which refers to the name of the InputDetail in question as follows
 */
$input = new InputDetail('full_name');

/**
 * The name parameter is always compolsory for the instantiation of any InputDetail object
 * Other Optional Parameters includes value and inputString
 * the value refers to the value of the input while the InputString refers to the entire string of the input
 * Here is an example:
 */
 
 $input = new InputDetail(
                "full_name",
                "john Doe",
                "<input name='name'  type='text' value='' placeholder='Input Your Name'/>"
            );
 
/**
* The Purpose of the last parameter (inputString) is so all other values can be prefilled automatically
* so in this case I could just construct this InputDetail as follows:
*/
 $input = new InputDetail("name", "", <input name='name'  type='text' value='Joe' placeholder='Input Your Name'/>");

/**
* So in this case, other values would be generated by the constructor, so:
* $input->name is equal to 'name'
* $input->value is equal to 'Joe'
* $input->type is equal to 'text'
* $input->placeholder is equal to 'Input Your Name'
*/

// You could also echo out the properties of an InputDetail Instance:

echo $input;
// The above would display as follows:
/**
* Input Detail:
* Name: name
* Value: Joe
* Placeholder: Input Your Name
* Type: text
*/

从包含表单的页面获取输入标签详情

我们可以如下获取该页面中包含的所有输入标签的数组

<html>
    <head>
        <title>Example Site Login Page</title>
    </head>
    <body>
        <form method="POST" action="/login">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>
    </body>
</html>
);

如前所述,该函数返回页面中第一个表单的输入详情。如果有多个表单,例如

<?php
// First we need to add the namespace
use CrawlEngine\InputDetail;
use CrawlEngine\CrawlEngine;

$inputs = (new Engine())->getLoginFields('https://example.com/login');

// $inputs would contain an array of all the input tags.
//  found in the first form Element on the webpage of the uri given
// and they are in the form of an InputDetail instance
// so we could display them as shown:

foreach($inputs as $input){
    echo $input;
}
// the above code would output as shown:

/**
* Input Detail:
* Name: email
* Value: 
* Placeholder: Input Your Email
* Type: email
*
* Input Detail:
* Name: password
* Value: 
* Placeholder: Input Your Password
* Type: password
*
* Input Detail:
* Name: _token
* Value: wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH
* Placeholder:
* Type: hidden
*/

该函数只会返回第一个表单元素中的输入。如果您想返回第二个表单的值,您需要在getLoginFields函数中指定一个额外的第二个值,如下所示

<html>
    <head>
        <title>Example Site Login Page</title>
    </head>
    <body>
        <form method="POST" action="/userlogin">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>

        <form method="POST" action="/adminlogin">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>

    </body>
</html>
);

上面的代码将获取页面中第二个表单的详情。

$inputs = (new Engine())->getLoginFields('https://example.com/login', 2);

上面的代码将获取页面中第二个表单的详情。

使用CrawlEngine解析请求

要使用 Crawl Engine 发起请求,需要了解一些关于访问的网站的信息。这包括用于登录的表单的uri、表单提交到的uri和表单中必需的字段。例如,假设一个网站的登录表单位于 https://example.com/login,结构如下所示

<html>
    <head>
        <title>Example Site Login Page</title>
    </head>
    <body>
        <form method="POST" action="/login">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>
    </body>
</html>
);

以上是一个典型登录表单的示例。从这个登录表单中我们可以看到,表单提交的Uri是:https://example.com/login,并且我们需要有效的用户名和密码才能登录。我们还看到网站生成一个用于验证请求的csrf令牌,这是动态的。您不需要担心这个字段,因为CrawlEngine会自动处理它。您也不需要担心任何服务器预先填充的字段,除非您想修改它。当CrawlEngine发起请求时,它会获取表单页面,记录所有预先填充的输入值,将它们与您提供的值结合起来,然后发起请求。所以从上面的页面我们可以知道,我们只需要给CrawlEngine提供有效的用户名和密码来发起请求。负责解析请求的主要函数是Engine类的resolveRequest方法,并按以下方式使用

<?php
// First we need to add the namespace
use CrawlEngine\InputDetail;
use CrawlEngine\CrawlEngine;
use CrawlEngine\Engine;

// we then create instances of the inputDetail class to carry values the form needs as follows:
$emailInput = new InputDetail('email', 'johndoe@mymail.com');
$passwordInput = new InputDetail('password','topSecretPassword');

// We could then arrange them in an array like so:
$formFields = [$emailInput, $passwordInput];

// We would then define the Uri where the formPage would be found:
$formPageUri = 'https://example.com/login';
// And the Submit Uri
$submitUri = 'https://example.com/login';
/**
* After Logging in, we would need to retrieve some information from some password protected areas of the site.
* Lets say this areas are located at https://example.com/dashboard and https://example.com/transactions
* We would also define as follows:
*/
$contentPagesUri = ['https://example.com/dashboard', 'https://example.com/transactions'];

// after which we can then make our request as shown:
$engine = new Engine();
$crawlers =  $engine->resolveRequest(
                $formPageUri,
                $submitUri,
                $formFields,
                $contentPagesUri
            );

这就完成了所有操作,然后CrawlEngine会完成剩下的魔法。它会访问网站,带上您提供的详细信息以及网站上找到的任何未覆盖的预先填充信息并提交。在登录后,就像正常用户一样,它会访问所有contentPagesUri并作为爬虫对象返回整个页面。比如说,https://example.com/dashboard页面如下

<html>
<head>
    <title>Example Site Dashboard Page</title>
</head>
<body>
    <section class="main-content">
        <div>
            <span id="user-number">200432234233</span>
            <span id="user-address">No.3 Washington Avenue, California</span>
            <span id="user-email">johndoe@mymail.com</span>
        </div>
    </section>

</body>
</html>

resolveRequest函数随后返回一个包含爬虫对象的数组,每个爬虫对象对应一个提供的content页面。所以对于我们的请求

// $crawlers[0] will contain crawler object for https://example.com/dashboard
// $crawlers[1] will contain crawler object for https://example.com/transactions

// so I can then access values from the page as shown:

echo $crawlers[0]->filterXPath('//body/section/div/span')->text(); // this would output: '200432234233'
//or this:
echo $crawlers[0]->filter('body > section > div > span')->text(); // this would also output: '200432234233'

有关爬虫的更多信息以及如何在页面上访问不同的值,您可以查看DomCrawler文档

默认情况下,CrawlEngine会在包含表单的页面中搜索第一个表单的输入字段。如果登录页面上有多个表单,爬虫引擎会像以下那样访问

<html>
    <head>
        <title>Example Site Login Page</title>
    </head>
    <body>
        <form method="POST" action="/userlogin">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>

        <form method="POST" action="/adminlogin">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>

    </body>
</html>
);

那么默认情况下,CrawlEngine会引用第一个表单,所以csrf和其他预先填充的输入将来自第一个表单。如果希望指定请求是针对第二个表单,可以通过向resolveRequest方法添加一个额外的参数来实现,如下所示

$crawlers =  $engine->resolveRequest(
                $formPageUri,
                $submitUri,
                $formFields,
                $contentPagesUri,
                2
            );

上面的代码告诉CrawlEngine,您不是在引用页面的第一个表单,而是第二个表单。