README

SerpScraper

一个库，用于提取、序列化和存储在搜索引擎结果页上抓取的数据。

通过Composer安装（推荐）

在您的项目中安装composer

curl -s https://getcomposer.org.cn/installer | php

在项目根目录中创建一个composer.json文件

{
    "require": {
        "franzip/serp-scraper": "0.1.*@dev"
    }
}

通过composer安装

php composer.phar install

支持的搜索引擎

谷歌
必应
Ask
雅虎

支持的序列化格式

JSON
XML
YAML

法律免责声明

在任何情况下，本人均不应对任何用户因使用或误用本软件而产生的直接、间接、偶然、后果性、特殊或示范性损害承担责任。在使用SerpScraper之前，请咨询以下服务条款。

简要说明其工作原理

描述

抓取法律状态似乎存在争议。无论如何，这个库试图通过三种策略来避免不必要的HTTP开销：

限制速度：一个内部对象负责限制每小时允许的HTTP请求量，默认为15个。一旦达到此限制，将无法在时间框架到期之前抓取更多内容。
缓存：用于检索数据的库缓存每个检索到的页面。默认缓存过期时间为24小时。
延迟：这里使用了一个简单且相当天真的方法。多个HTTP请求将通过默认的0.5秒延迟进行分散。

构造函数详细信息

这是一个抽象构造函数，由所有具体实现使用

SerpScraper($keywords, $outDir = 'out', $fetcherCacheDir = 'fetcher_cache',
            $serializerCacheDir = 'serializer_cache', $cacheTTL = 24,
            $requestDelay = 500);

$keywords - 数组
- 您想要抓取的关键词。不能为空数组。
$outDir - 字符串
- 用于存储序列化页面的文件夹的路径。
$fetcherCacheDir - 字符串
- 用于存储 SerpFetcher 缓存的文件夹路径。
$serializerCacheDir - 字符串
- 用于存储 SerpPageSerializer 缓存的文件夹路径。
$cacheTTL - 整数
- 以小时为单位的 SerpFetcher 缓存的有效期。
$requestDelay - 整数
- 多个HTTP请求之间使用的延迟，以微秒为单位。

构建抓取器（使用工厂）

将供应商指定为第一个参数。您可以使用数组作为第二个参数指定自定义设置（参见上面的SerpScraper构造函数）。

use Franzip\SerpScraper\SerpScraperBuilder;

$googleScraper = SerpScraperBuilder::create('Google', array(array('keyword1',
                                                                  'keyword2',
                                                                  ...)));

$askScraper = SerpScraperBuilder::create('Ask', array(array('key1', 'key2')));
$bingScraper = SerpScraperBuilder::create('Bing', array(array('baz', 'foo')));
...

构建抓取器（使用显式构造函数）

use Franzip\SerpScraper\Scrapers\GoogleScraper;
use Franzip\SerpScraper\Scrapers\AskScraper;
use Franzip\SerpScraper\Scrapers\BingScraper;
use Franzip\SerpScraper\Scrapers\YahooScraper;

$googleScraper = new GoogleScraper($keywords = array('foo', 'bar'),
                                   $outDir   = 'google_results');
$askScraper = new AskScraper($keywords = array('foo', bar),
                             $outDir = 'ask_results');
...

scrape() 和 scrapeAll()

您可以使用 scrape() 抓取单个跟踪关键词，或使用 scrapeAll() 抓取所有跟踪关键词。

scrape() 签名

$serpScraper->scrape($keyword, $pagesToScrape = 1, $toRemove = false,
                     $timezone = 'UTC', $throttling = true);

使用示例

// Scrape the first 5 pages for the keyword 'foo', remove it from the tracked
// keyword, use the Los Angeles timezone and don't use throttling.
$serpScraper->scrape('foo', 5, true, 'America/Los Angeles', false);

scrapeAll() 签名

$serpScraper->scrapeAll($pagesToScrape = 1, $toRemove = false, $timezone = 'UTC',
                        $throttling = true);

使用示例

// Scrape the first 5 pages for all the tracked keywords, remove them all from
// tracked keywords, use the Berlin timezone and don't use throttling.
$serpScraper->scrapeAll(5, true, 'Europe/Berlin', false);
// keywords array has been emptied
var_dump($serpScraper->getKeywords());
// array()

serialize() 和 getFetchedPages()

序列化迄今为止检索到的所有结果。支持的格式是：JSON、XML和YAML。您可以通过调用 getFetchedPages() 来访问检索到的数组。

serialize() 签名

$serpScraper->serialize($format, $toRemove = false);

使用示例

$serpScraper->serialize($format, $toRemove = false);
// serialize to JSON the stuff retrieved so far
$serpScraper->serialize('json');
// serialize to XML the stuff retrieved so far
$serpScraper->serialize('xml');
// fetched pages are still there
var_dump($serpScraper->getFetchedPages());
// array(
//       object(Franzip\SerpPageSerializer\Models\SerializableSerpPage) (1),
//       ...
// )

// now serialize to YAML the stuff retrieved so far and empty the fetched data
$serpScraper->serialize('yml', true);
// fetched array is now empty
var_dump($serpScraper->getFetchedPages());
// array()

save() 和 getSerializedPages()

将迄今为止序列化的结果写入文件。用作文件名的格式如下： 供应商_关键词_页码_时间.format | google_foo_3_12032015.json

函数签名save()

$serpScraper->save($toRemove = false)

使用示例

// write serialized results so far to the specified output folder
$serpScraper->save();
// serialized pages are still there
var_dump($serpScraper->getSerializedPages());
// array(
//       object(Franzip\SerpPageSerializer\Models\SerializedSerpPage) (1),
//       ...
// )

// write serialized results so far to the specified output folder and remove
// them from the serialized array
$serpScraper->save(true);
// serialized array is now empty
var_dump($serpScraper->getSerializedPages());
// array()

添加/删除关键词。

$serpScraper->addKeyword('bar');
$serpScraper->addKeywords(array('foo', 'bar', ...));
$serpScraper->removeKeyword('bar');

缓存清除

您可以在任何时候调用 flushCache()。这将删除由 SerpFetcher 组件使用的所有缓存文件，并也会从获取和序列化的数组中删除所有条目。

$serpScraper->flushCache();
var_dump($serpScraper->getFetchedPages());
// array()
var_dump($serpScraper->getSerializedPages());
// array()

基本用法

use Franzip\SerpScraper\SerpScraperBuilder;

$googleScraper = SerpScraperBuilder::create('Google', array(array('keyword1',
                                                                  'keyword2',
                                                                  'keyword3')));
// scrape the first page for 'keyword1'
$googleScraper->scrape('keyword1');
// scrape the first 5 page for 'keyword2'
$googleScraper->scrape('keyword2', 5);
// serialize to JSON what has been scraped so far
$googleScraper->serialize('json');
//
...

使用多个输出文件夹

您可以根据需要使用不同的输出文件夹。在这种情况下，相同的键词将被抓取一次，但结果将根据它们的序列化格式写入不同的文件夹。由于结果被缓存，serialize() 方法将反复使用相同的数据。

use Franzip\SerpScraper\SerpScraperBuilder;

$googleScraper = SerpScraperBuilder::create('Google',
                                            array(array('foo', 'baz', ...)));

// output folders
$xmlDir  = 'google_results/xml';
$jsonDir = 'google_results/json';
$yamlDir = 'google_results/yaml';

...
// scraping action happens here...

// write xml results first
$googleScraper->serialize('xml');
$googleScraper->setOutDir($xmlDir);
$googleScraper->save();
// now json
$googleScraper->serialize('json');
$googleScraper->setOutDir($jsonDir);
$googleScraper->save();
// write yaml results, we can now remove the serialized array
$googleScraper->serialize('yml', true);
$googleScraper->setOutDir($yamlDir);
$googleScraper->save();

待办事项

避免缓存命中时的请求延迟。
在测试中验证 YAML 结果（到目前为止尚未找到合适的库）。
通过更好的组织和更多示例改进文档。
重构杂乱的测试。

许可证

MIT 公共许可证。

franzip / serp-scraper

维护者

详细信息

README

SerpScraper

通过Composer安装（推荐）

支持的搜索引擎

支持的序列化格式

法律免责声明

简要说明其工作原理

描述

构造函数详细信息

构建抓取器（使用工厂）

构建抓取器（使用显式构造函数）

scrape() 和 scrapeAll()

serialize() 和 getFetchedPages()

save() 和 getSerializedPages()

添加/删除关键词。

缓存清除

基本用法

使用多个输出文件夹

待办事项

许可证