snippetify / snippet-sniffer
爬取和抓取网页以提取片段
1.2.4
2020-07-01 22:14 UTC
Requires
- php: ^7.2
- fabpot/goutte: ^4.0
- guzzlehttp/psr7: ^1.4
- monolog/monolog: ^1.12|^2.0
- snippetify/programming-languages: dev-master
- spatie/crawler: dev-master
Requires (Dev)
- phpunit/phpunit: ^8.0|^9.0
README
代码片段嗅探器 允许您从任何网站提取代码片段。
功能
此库允许您
- 使用搜索引擎API(Google)获取代码片段
- 通过爬取URL种子从任何网页获取代码片段
如何使用
$ composer require snippetify/snippet-sniffer
代码片段嗅探器
use Snippetify\SnippetSniffer\SnippetSniffer; // Configurations $config = [ // Required // Search engine api configuration keys 'provider' => [ "cx" => "your google Search engine ID", "key" => "your google API key" 'name' => 'provider name (google)', ], // Optional // Useful for adding meta information to each snippet 'app' => [ "name" => "your App name", 'version' => 'your App version', ], // Optional // Useful for logging 'logger' => [ "name" => "logger name", 'file' => 'logger file path', ] ]; // Required // Your query $query = "your query"; // Optional // Meta params $meta = [ "page" => 1, "limit" => 10, ]; // Fetch snippets // @return Snippetify\SnippetSniffer\Common\Snippet[] $snippets = SnippetSniffer::create($config)->fetch($query, $meta); /* * Snippet object public attributes [ * title: string, * code: string, * description: string, * tags: array, // Array of string, also contains the snippet language * meta: array *] */
提供商
提供商允许您从搜索引擎API获取一个 种子堆(要抓取的URL)。目前仅支持Google搜索引擎API,但您可以创建自己的。
use Snippetify\SnippetSniffer\Providers\GoogleProvider; // Search engine api configuration keys $config = [ "cx" => "your google Search engine ID", "key" => "your google API key" ]; // Your query $query = "your query"; // Meta params $meta = [ "page" => 1, "limit" => 10, ]; // url seeds // @return GuzzleHttp\Psr7\Uri[] $urlSeeds = GoogleProvider::create($config)->fetch($query, $meta);
向包中添加新的提供商
- Git克隆项目
- 在
Snippetify\SnippetSniffer\Providers
文件夹中创建您的新类 - 每个提供商都实现了
Snippetify\SnippetSniffer\Providers\ProviderInterface
- 查看
Snippetify\SnippetSniffer\Providers\GoogleProvider
获取帮助 - 您的fetch方法必须返回一个
Psr\Http\Message\UriInterface
数组 - 在
Snippetify\SnippetSniffer\Core.php
中的提供商堆栈中添加它 - 编写测试。查看
Snippetify\SnippetSniffer\Tests\Providers\GoogleProviderTest
获取帮助 - 向我们发送拉取请求
使用您自己的提供商
- 您的提供商必须实现
Snippetify\SnippetSniffer\Providers\ProviderInterface
- 查看
Snippetify\SnippetSniffer\Providers\GoogleProvider
获取帮助 - 您的fetch方法必须返回一个
Psr\Http\Message\UriInterface
数组 - 在配置参数中传递您的新提供商或使用
addProvider
方法
use Snippetify\SnippetSniffer\SnippetSniffer; // Use Configurations $config = [ "providers" => [ "provider_name" => ProviderClass::class, "provider_2_name" => Provider2Class::class // You can add as many as you want ] ]; // Or use addProvider method as follow SnippetSniffer::create(...) ->addProvider('provider_name', ProviderClass::class) ->addProvider('provider_2_name', Provider2Class::class) // You can add as many as you want ...
抓取器
抓取器允许您抓取HTML页面并提取片段。
use GuzzleHttp\Psr7\Uri; use Snippetify\SnippetSniffer\Scrapers\DefaultScraper; // Configurations $config = [ // Optional // Useful for adding meta information to each snippet 'app' => [ "name" => "your App name", 'version' => 'your App version', ], // Optional // Useful for logging 'logger' => [ "name" => "logger name", 'file' => 'logger file path', ] ]; // Your url $urlSeed = "website url to scrape"; // Fetch snippets // @return Snippetify\SnippetSniffer\Common\Snippet[] $snippets = (new DefaultScraper($config))->fetch(new Uri($urlSeed));
向包中添加新的抓取器
- Git克隆项目
- 在
Snippetify\SnippetSniffer\Scrapers
文件夹中创建您的新类 - 每个抓取器都实现了
Snippetify\SnippetSniffer\Scrapers\ScraperInterface
- 查看
Snippetify\SnippetSniffer\Scrapers\StackoverflowScraper
获取帮助 - 您的fetch方法必须返回一个
Snippetify\SnippetSniffer\Common\Snippet
数组 - 在
Snippetify\SnippetSniffer\Core.php
中的抓取器堆栈中添加它 - 编写测试。查看
Snippetify\SnippetSniffer\Tests\Scrapers\StackoverflowScraperTest
获取帮助 - 向我们发送拉取请求
使用您自己的抓取器
- 您的抓取器必须实现
Snippetify\SnippetSniffer\Scrapers\ScraperInterface
- 查看
Snippetify\SnippetSniffer\Scrapers\StackoverflowScraper
获取帮助 - 您的fetch方法必须返回一个
Snippetify\SnippetSniffer\Common\Snippet
数组 - 在配置参数中传递您的新抓取器或使用
addScraper
方法
use Snippetify\SnippetSniffer\SnippetSniffer; // Important: Scrapper's name must be the website uri without the scheme. i.e. vuejs.org // Configurations $config = [ "scrapers" => [ "scraper_name" => ScraperClass::class, "scraper_2_name" => Scraper2Class::class // You can add as many as you want ] ]; // Or use addProvider method as follow SnippetSniffer::create(...) ->addScraper('scraper_name', ScraperClass::class) ->addScraper('scraper_2_name', Scraper2Class::class) // You can add as many as you want ...
代码片段爬虫
代码片段爬虫允许您通过爬取网站来提取所有代码片段。
use Snippetify\SnippetSniffer\WebCrawler; // Optional $config = [...]; // @return Snippetify\SnippetSniffer\Common\MetaSnippetCollection[] $snippets = WebCrawler::create($config)->fetch(['your uri']);
配置参考
$config = [ // Required // Search engine api configuration keys // https://developers.google.com/custom-search/v1/introduction 'provider' => [ "cx" => "your google Search engine ID", "key" => "your google API key" 'name' => 'provider name (google)', ], // Optional // Useful for adding meta information to each snippet 'app' => [ "name" => "your App name", 'version' => 'your App version', ], // Optional // Useful for logging 'logger' => [ "name" => "logger name", 'file' => 'logger file path', ], // Optional // Useful for scraping "html_tags" => [ "snippet" => "pre[class] code, div[class] code, .highlight pre, code[class]", // Tags to fetch snippets "index" => "h1, h2, h3, h4, h5, h6, p, li" // Tags to index ], // Optional // Useful for adding new scrapers // The name must be the website host without the scheme i.e. not https://foo.com but foo.com "scrapers" => [ "scraper_name" => ScraperClass::class, "scraper_2_name" => Scraper2Class::class // You can add as many as you want ], // Optional // Useful for adding new providers "providers" => [ "provider_name" => ProviderClass::class, "provider_2_name" => Provider2Class::class // You can add as many as you want ], // Optional // Useful for web crawling // Please follow the link below for more information as we use Spatie crawler // https://github.com/spatie/crawler "crawler" => [ "langs" => ['en'], "profile" => CrawlSubdomainsAndUniqueUri::class, "user_agent" => 'your user agent', "concurrency" => 10, "ignore_robots" => false, "maximum_depth" => null, "execute_javascript" => false, "maximum_crawl_count" => null, "parseable_mime_types" => 'text/html', "maximum_response_size" => 1024 * 1024 * 3, "delay_between_requests" => 250, ] ];
变更日志
请参阅 CHANGELOG 了解最近更改的详细信息。
测试
在运行测试之前,您必须在phpunit.xml文件中设置 PROVIDER_NAME、PROVIDER_CX、PROVIDER_KEY、CRAWLER_URI、DEFAULT_SCRAPER_URI、STACKOVERFLOW_SCRAPER_URI 键。
重要:那些链接必须至少包含一个片段,否则测试将失败。 Stackoverflow uri必须是一个有已接受答案的问题链接,否则测试将失败。
composer test
贡献
有关详细信息,请参阅 CONTRIBUTING
鸣谢
许可证
MIT许可证(MIT)。有关更多信息,请参阅 许可证文件