snippetify/snippet-sniffer

爬取和抓取网页以提取片段

1.2.4 2020-07-01 22:14 UTC

This package is auto-updated.

Last update: 2024-09-29 05:55:42 UTC


README

代码片段嗅探器 允许您从任何网站提取代码片段。

功能

此库允许您

  1. 使用搜索引擎API(Google)获取代码片段
  2. 通过爬取URL种子从任何网页获取代码片段

如何使用

$ composer require snippetify/snippet-sniffer

代码片段嗅探器

use Snippetify\SnippetSniffer\SnippetSniffer;

// Configurations
$config = [
  // Required
  // Search engine api configuration keys
  'provider' => [
    "cx" => "your google Search engine ID",
    "key" => "your google API key"
    'name' => 'provider name (google)',
  ],
  // Optional
  // Useful for adding meta information to each snippet
  'app' => [
    "name" => "your App name",
    'version' => 'your App version',
  ],
  // Optional
  // Useful for logging
  'logger' => [
    "name" => "logger name",
    'file' => 'logger file path',
  ]
];

// Required
// Your query
$query = "your query";

// Optional
// Meta params
$meta = [
  "page" => 1,
  "limit" => 10,
];

// Fetch snippets
// @return Snippetify\SnippetSniffer\Common\Snippet[]
$snippets = SnippetSniffer::create($config)->fetch($query, $meta);
/*
* Snippet object public attributes [
*		title: string, 
* 	code: string, 
* 	description: string, 
* 	tags: array, // Array of string, also contains the snippet language
* 	meta: array
*]
*/

提供商

提供商允许您从搜索引擎API获取一个 种子堆(要抓取的URL)。目前仅支持Google搜索引擎API,但您可以创建自己的。

use Snippetify\SnippetSniffer\Providers\GoogleProvider;

// Search engine api configuration keys
$config = [
  "cx" => "your google Search engine ID",
  "key" => "your google API key"
];

// Your query
$query = "your query";

// Meta params
$meta = [
  "page" => 1,
  "limit" => 10,
];

// url seeds
// @return GuzzleHttp\Psr7\Uri[]
$urlSeeds = GoogleProvider::create($config)->fetch($query, $meta);
向包中添加新的提供商
  1. Git克隆项目
  2. Snippetify\SnippetSniffer\Providers 文件夹中创建您的新类
  3. 每个提供商都实现了 Snippetify\SnippetSniffer\Providers\ProviderInterface
  4. 查看 Snippetify\SnippetSniffer\Providers\GoogleProvider 获取帮助
  5. 您的fetch方法必须返回一个 Psr\Http\Message\UriInterface 数组
  6. Snippetify\SnippetSniffer\Core.php 中的提供商堆栈中添加它
  7. 编写测试。查看 Snippetify\SnippetSniffer\Tests\Providers\GoogleProviderTest 获取帮助
  8. 向我们发送拉取请求
使用您自己的提供商
  1. 您的提供商必须实现 Snippetify\SnippetSniffer\Providers\ProviderInterface
  2. 查看 Snippetify\SnippetSniffer\Providers\GoogleProvider 获取帮助
  3. 您的fetch方法必须返回一个 Psr\Http\Message\UriInterface 数组
  4. 在配置参数中传递您的新提供商或使用 addProvider 方法
use Snippetify\SnippetSniffer\SnippetSniffer;

// Use Configurations
$config = [
  "providers" => [
    "provider_name" => ProviderClass::class,
    "provider_2_name" => Provider2Class::class // You can add as many as you want
  ]
];

// Or use addProvider method as follow
SnippetSniffer::create(...)
  ->addProvider('provider_name', ProviderClass::class)
  ->addProvider('provider_2_name', Provider2Class::class) // You can add as many as you want
  ...

抓取器

抓取器允许您抓取HTML页面并提取片段。

use GuzzleHttp\Psr7\Uri;
use Snippetify\SnippetSniffer\Scrapers\DefaultScraper;

// Configurations
$config = [
  // Optional
  // Useful for adding meta information to each snippet
  'app' => [
    "name" => "your App name",
    'version' => 'your App version',
  ],
  // Optional
  // Useful for logging
  'logger' => [
    "name" => "logger name",
    'file' => 'logger file path',
  ]
];

// Your url
$urlSeed = "website url to scrape";

// Fetch snippets
// @return Snippetify\SnippetSniffer\Common\Snippet[]
$snippets = (new DefaultScraper($config))->fetch(new Uri($urlSeed));
向包中添加新的抓取器
  1. Git克隆项目
  2. Snippetify\SnippetSniffer\Scrapers 文件夹中创建您的新类
  3. 每个抓取器都实现了 Snippetify\SnippetSniffer\Scrapers\ScraperInterface
  4. 查看 Snippetify\SnippetSniffer\Scrapers\StackoverflowScraper 获取帮助
  5. 您的fetch方法必须返回一个 Snippetify\SnippetSniffer\Common\Snippet 数组
  6. Snippetify\SnippetSniffer\Core.php 中的抓取器堆栈中添加它
  7. 编写测试。查看 Snippetify\SnippetSniffer\Tests\Scrapers\StackoverflowScraperTest 获取帮助
  8. 向我们发送拉取请求
使用您自己的抓取器
  1. 您的抓取器必须实现 Snippetify\SnippetSniffer\Scrapers\ScraperInterface
  2. 查看 Snippetify\SnippetSniffer\Scrapers\StackoverflowScraper 获取帮助
  3. 您的fetch方法必须返回一个 Snippetify\SnippetSniffer\Common\Snippet 数组
  4. 在配置参数中传递您的新抓取器或使用 addScraper 方法
use Snippetify\SnippetSniffer\SnippetSniffer;

// Important: Scrapper's name must be the website uri without the scheme. i.e. vuejs.org

// Configurations
$config = [
  "scrapers" => [
    "scraper_name" => ScraperClass::class,
    "scraper_2_name" => Scraper2Class::class // You can add as many as you want
  ]
];

// Or use addProvider method as follow
SnippetSniffer::create(...)
  ->addScraper('scraper_name', ScraperClass::class)
  ->addScraper('scraper_2_name', Scraper2Class::class) // You can add as many as you want
  ...

代码片段爬虫

代码片段爬虫允许您通过爬取网站来提取所有代码片段。

use Snippetify\SnippetSniffer\WebCrawler;

// Optional
$config = [...];

// @return Snippetify\SnippetSniffer\Common\MetaSnippetCollection[]
$snippets = WebCrawler::create($config)->fetch(['your uri']);

配置参考

$config = [
  // Required 
  // Search engine api configuration keys
  // https://developers.google.com/custom-search/v1/introduction
  'provider' => [
    "cx" => "your google Search engine ID",
    "key" => "your google API key"
    'name' => 'provider name (google)',
  ],
  // Optional
  // Useful for adding meta information to each snippet
  'app' => [
    "name" => "your App name",
    'version' => 'your App version',
  ],
  // Optional
  // Useful for logging
  'logger' => [
    "name" => "logger name",
    'file' => 'logger file path',
  ],
  // Optional
  // Useful for scraping
  "html_tags" => [
    "snippet" => "pre[class] code, div[class] code, .highlight pre, code[class]", // Tags to fetch snippets
    "index" => "h1, h2, h3, h4, h5, h6, p, li" // Tags to index
  ],
  // Optional
  // Useful for adding new scrapers
  // The name must be the website host without the scheme i.e. not https://foo.com but foo.com
  "scrapers" => [
    "scraper_name" => ScraperClass::class,
    "scraper_2_name" => Scraper2Class::class // You can add as many as you want
  ],
  // Optional
  // Useful for adding new providers
  "providers" => [
    "provider_name" => ProviderClass::class,
    "provider_2_name" => Provider2Class::class // You can add as many as you want
  ],
  // Optional
  // Useful for web crawling
  // Please follow the link below for more information as we use Spatie crawler
  // https://github.com/spatie/crawler
  "crawler" => [
    "langs" => ['en'],
    "profile" => CrawlSubdomainsAndUniqueUri::class,
    "user_agent" => 'your user agent',
    "concurrency" => 10,
    "ignore_robots" => false,
    "maximum_depth" => null,
    "execute_javascript" => false,
    "maximum_crawl_count" => null,
    "parseable_mime_types" => 'text/html',
    "maximum_response_size" => 1024 * 1024 * 3,
    "delay_between_requests" => 250,
  ]
];

变更日志

请参阅 CHANGELOG 了解最近更改的详细信息。

测试

在运行测试之前,您必须在phpunit.xml文件中设置 PROVIDER_NAMEPROVIDER_CXPROVIDER_KEYCRAWLER_URIDEFAULT_SCRAPER_URISTACKOVERFLOW_SCRAPER_URI 键。

重要:那些链接必须至少包含一个片段,否则测试将失败。 Stackoverflow uri必须是一个有已接受答案的问题链接,否则测试将失败。

composer test

贡献

有关详细信息,请参阅 CONTRIBUTING

鸣谢

  1. Evens Pierre

许可证

MIT许可证(MIT)。有关更多信息,请参阅 许可证文件