snapsearch / snapsearch-client-php
PHP HTTP 客户端中间件库,用于 SnapSearch。为单页应用提供搜索引擎优化。
Requires
- php: >=5.3.3
- ext-curl: *
- symfony/http-foundation: ~2.1
Requires (Dev)
- codeception/codeception: 1.8.3
- stack/builder: ~1.0
- stack/callable-http-kernel: ~1.0@dev
- symfony/http-kernel: ~2.1
Suggests
- stack/builder: Required for elegant middleware construction. Version ~1.0
- symfony/http-kernel: Required for Stack PHP integration. Version ~2.1
This package is not auto-updated.
Last update: 2024-09-14 15:37:25 UTC
README
Snapsearch Client PHP 是基于 PHP 的、与框架无关的 HTTP 客户端库,用于 SnapSearch (https://snapsearch.io/)。
- 符合 PSR-0 标准。
- 兼容 Stack PHP 或 HTTP Kernel 框架。
- 在 HHVM 上运行。(检查 Travis!)
Snapsearch 是一个用于复杂前端 JavaScript & AJAX 启用(可能实时)HTML5 网络应用的搜索引擎优化(SEO)和机器人代理。
像 Google 的爬虫和 Facebook 的图像提取机器人这样的简单 HTTP 客户端无法执行复杂的 JavaScript 应用。复杂的 JavaScript 应用包括使用 AngularJS、EmberJS、KnockoutJS、Dojo、Backbone.js、Ext.js、jQuery、JavascriptMVC、Meteor、SailsJS、Derby、RequireJS 等构建的网站。基本上,任何利用 JavaScript 在页面加载后异步引入内容和服务,或利用 JavaScript 在用户查看时操作页面内容的网站,例如动画。
Snapsearch 会拦截搜索引擎或机器人发出的任何请求,并派遣自己的 JavaScript 启用机器人提取您的页面内容,并创建缓存快照。然后,将此快照通过您的 Web 应用程序传递回搜索引擎、机器人或浏览器。
Snapsearch 的机器人是一个自动负载均衡的 Firefox 浏览器。这个 Firefox 浏览器与夜间版本保持更新,所以我们总能提供最新的 HTML5 技术。我们的负载均衡器确保您的请求不会因其他用户的请求而受到阻碍。
有关此功能的更多详细信息以及使用优势,请参阅 https://snapsearch.io/
SnapSearch 在其他语言中提供了类似的库: https://github.com/SnapSearch/Snapsearch-Clients
安装
需要 5.3.3 或更高版本和 Curl 扩展。
Composer
将此添加到您的 composer.json
"snapsearch/snapsearch-client-php": "~1.2"
然后运行 composer install
或 composer update
。
本地
只需将存储库提取到您的库位置。然后使用您自己的 PSR-0 自动加载器来自动加载 src/SnapSearchClientPHP/
内部的类。
您还可以使用提供的自动加载器。首先将此项目克隆到您期望的位置,然后编写
require_once('SnapSearch-Client-PHP/src/SnapSearchClientPHP/Bootstrap.php'); \SnapSearchClientPHP\Bootstrap::register();
如果您不想使用自动加载器,只需 require src/SnapSearchClientPHP/
内部的所有类,除了 Bootstrap.php
。
注意您必须手动安装依赖项并加载它们。查看 composer.json
文件,在 "require"
部分找到依赖项。
别忘了包含此库运行所需资源的 resources/
文件夹。
用法
SnapSearchClientPHP 应在应用程序的入口点开始。这可能是在前控制器、引导过程、IoC 容器或中间件内部。对于单页应用,您的入口点将是首先展示初始 HTML 页面的代码。
有关 API 和 API 请求参数的完整文档,请参阅: https://snapsearch.io/documentation
顺便说一句,您需要将非HTML资源如sitemap.xml
列入黑名单。这已在https://snapsearch.io/documentation#notes中解释。
###基本用法
$client = new \SnapSearchClientPHP\Client('email', 'key'); $detector = new \SnapSearchClientPHP\Detector; $interceptor = new \SnapSearchClientPHP\Interceptor($client, $detector); //exceptions should be ignored in production, but during development you can check it for validation errors try{ $response = $interceptor->intercept(); }catch(SnapSearchClientPHP\SnapSearchException $e){} if($response){ //this request is from a robot //status code header(' ', true, $response['status']); //as of PHP 5.4, you can use http_response_code($response['status']); //the complete $response['headers'] is not returned to the search engine due to potential content or transfer encoding issues, except for the potential location header, which is used when there is an HTTP redirect if(!empty($response['headers'])){ foreach($response['headers'] as $header){ if($header['name'] == 'Location'){ header($header['name'] . ': ' . $header['value']); } } } //content echo $response['html']; }else{ //this request is not from a robot //continue with normal operations... }
以下是一个示例$response
变量(并非所有变量都可用,您需要检查请求参数)
$response = [ 'cache' => true/false, 'callbackResult' => '', 'date' => 1390382314, 'headers' => [ [ 'name' => 'Content-Type', 'value' => 'text/html' ] ], 'html' => '<html></html>', 'message' => 'Success/Failed/Validation Errors', 'pageErrors' => [ [ 'error' => 'Error: document.querySelector(...) is null', 'trace' => [ [ 'file' => 'filename', 'function' => 'anonymous', 'line' => '41', 'sourceURL' => 'urltofile' ] ] ] ], 'screenshot' => 'BASE64 ENCODED IMAGE CONTENT', 'status' => 200 ]
###高级用法
$request_parameters = array( //add your API request parameters if you have any... ); $blacklisted_routes = array( //add your black listed routes if you have any... ); $whitelisted_routes = array( //add your white listed routes if you have any... ); $check_file_extensions = //if you wish for SnapSearchClient to check if the URL leads to a static file, switch this on to a boolean true, however this is expensive and time consuming, so it's better to use black listed or white listed routes $symfony_http_request_object = //get the Symfony\Component\HttpFoundation\Request $robot_json_path = //if you have a custom robots.json you can choose to use that instead, use the absolute path $extensions_json_path = //if you have a custom extensions.json you can choose hat insead, use the absolute path $client = new \SnapSearchClientPHP\Client('email', 'key', $request_parameters); $detector = new \SnapSearchClientPHP\Detector( $blacklisted_routes, $whitelisted_routes, $check_file_extensions, $symfony_http_request_object, $robot_json_path, $extensions_json_path ); //robots can be direct accessed and manipulated $detector->robots['match'][] = 'my_custom_bot_to_be_matched'; $detector->robots['ignore'][] = 'my_ignored_robot'; //extensions can as well, add to 'generic' or 'php' $detector->extensions['php'][] = 'validextension'; $interceptor = new \SnapSearchClientPHP\Interceptor($client, $detector); //your custom cache driver $cache = new YourCustomClientSideCacheDriver; //the before_intercept callback is called after the Detector has detected a search engine robot //if this callback returns an array, the array will be used as the $response to $interceptor->intercept(); //use it for client side caching in order to have millisecond responses to search engines //the after_intercept callback can be used to store the snapshot from SnapSearch as a client side cached resource //this is of course optional as SnapSearch caches your snapshot as well! $interceptor->before_intercept(function($url) use ($cache){ //get cache from redis/filesystem..etc //returned value should array if successful or boolean false if cache did not exist return $cache->get($url); })->after_intercept(function($url, $response) use ($cache){ //the cached time should be less then the cached time you passed to SnapSearch, we recommend half the SnapSearch cachetime $time = '12hrs'; $cache->store($url, $response, $time); }); //exceptions should be ignored in production, but during development you can check it for validation errors try{ $response = $interceptor->intercept(); }catch(SnapSearchClientPHP\SnapSearchException $e){} if($response){ //this request is from a robot //status code header(' ', true, $response['status']); //as of PHP 5.4, you can use http_response_code($response['status']); //the complete $response['headers'] is not returned to the search engine due to potential content or transfer encoding issues, except for the potential location header, which is used when there is an HTTP redirect if(!empty($response['headers'])){ foreach($response['headers'] as $header){ if(strtolower($header['name']) == 'location'){ header($header['name'] . ': ' . $header['value']); } } } //content echo $response['html']; }else{ //this request is not from a robot //continue with normal operations... }
###Stack PHP用法
Stack PHP是一个PHP的HTTP内核中间件框架,类似于Ruby Rack或Node Connect。以下示例使用了PHP 5.4代码。
$app = //HTTP Kernel core controller $stack = (new \Stack\Builder)->push( '\SnapSearchClientPHP\StackInterceptor', new Interceptor( new Client('email', 'key'), new Detector )->before_intercept(function($url){ //before interception callback (optional and chainable) })->after_intercept(function($url, $response){ //after interception callback (optional and chainable) }), function(array $response){ //this callback is completely optional, it allows you to customise your response //the $response array comes from SnapSearch and contains [(string) 'status', (array) 'headers', (string) 'html'] //remember $response['headers'] is in this format: //[ // [ // 'name' => 'Location', // 'value' => 'http://redirect.com/' // ] //] //it's an array of arrays which contain name and value properties //it's recommended to not pass through all of the headers, due to possible encoding problems //your server will already output the necessary headers anyway //however we are passing through the location header if it exists $headers = array_filter($response['headers'], function($header){ if(strtolower($header['name']) == 'location'){ return true; } return false; }); return [ 'status' => $response['status'], 'headers' => $headers, 'html' => $response['html'] ]; }, function($exception, $request){ //this is the exception callback and it's completely optional //it will only be called if a SnapSearchException is raised //which only happens if SnapSearch's servers are temporarily offline //if there is an exception, this middleware will simply pass to the next layer //if you want to stop and inspect or log the actual exception, this is where you can do it } ); $app = $stack->resolve($app); $request = Request::createFromGlobals(); $response = $app->handle($request)->send(); $app->terminate($request, $response); //or just do this if you have Stack\run //\Stack\run($app);
Detector构造函数中的布尔型$check_file_extensions
适用于可能提供静态文件的应用程序。通常HTTP服务器会提供静态文件,这些请求永远不会被代理到应用程序,这就是为什么默认情况下这个布尔型为false。然而,在确实提供静态文件的情况下,您可以将其切换为true,以防止静态文件路由被拦截。
黑名单静态文件路由可能更高效或更容易。这的好处在于,您可以防止前往可能不以特定文件扩展名结尾的二进制资源路由。例如,流式音频/视频。
当然,SnapSearchClientPHP也可以用于其他领域,例如增强型JavaScript爬取,因此如果您用于其他目的,不需要将其放在入口点。在这种情况下,只需使用SnapSearchPHP\Client
向SnapSearch API发送请求。
代理
SnapSearch-Client-PHP使用Symfony HTTP Foundation Request Object作为HTTP请求的抽象。这为您提供了相当大的灵活性,特别是在您位于反向代理(如负载均衡器)后面时构建HTTP请求。如果您位于反向代理后面,某些信息(如请求协议)可能不在正常位置。您可以配置Symfony HTTP Foundation Request Object来处理这些边缘情况,并将您的实例传递给Detector。更多信息请参见:https://symfony.com.cn/doc/current/components/http_foundation/trusting_proxies.html
开发
使用composer安装/更新依赖项
composer update
做出更改,同步,然后创建一个新的标签
git tag MAJOR.MINOR.PATCH git push git push --tags
Packagist已集成到Github服务钩子中,它将自动发布新包。
测试
单元测试使用Codeception编写。Codeception已经启动(codecept bootstrap
)。要运行测试,请使用codecept run
,或使用codecept run --debug
以显示调试信息。如果您更改了Codeception配置文件或向辅助器添加了额外的函数,请确保运行codecept build
,以便设置生效。