snapsearch/snapsearch-client-php

PHP HTTP 客户端中间件库,用于 SnapSearch。为单页应用提供搜索引擎优化。

1.2.2 2015-04-30 11:11 UTC

This package is not auto-updated.

Last update: 2024-09-14 15:37:25 UTC


README

Build Status

Snapsearch Client PHP 是基于 PHP 的、与框架无关的 HTTP 客户端库,用于 SnapSearch (https://snapsearch.io/)。

  • 符合 PSR-0 标准。
  • 兼容 Stack PHP 或 HTTP Kernel 框架。
  • 在 HHVM 上运行。(检查 Travis!)

Snapsearch 是一个用于复杂前端 JavaScript & AJAX 启用(可能实时)HTML5 网络应用的搜索引擎优化(SEO)和机器人代理。

像 Google 的爬虫和 Facebook 的图像提取机器人这样的简单 HTTP 客户端无法执行复杂的 JavaScript 应用。复杂的 JavaScript 应用包括使用 AngularJS、EmberJS、KnockoutJS、Dojo、Backbone.js、Ext.js、jQuery、JavascriptMVC、Meteor、SailsJS、Derby、RequireJS 等构建的网站。基本上,任何利用 JavaScript 在页面加载后异步引入内容和服务,或利用 JavaScript 在用户查看时操作页面内容的网站,例如动画。

Snapsearch 会拦截搜索引擎或机器人发出的任何请求,并派遣自己的 JavaScript 启用机器人提取您的页面内容,并创建缓存快照。然后,将此快照通过您的 Web 应用程序传递回搜索引擎、机器人或浏览器。

Snapsearch 的机器人是一个自动负载均衡的 Firefox 浏览器。这个 Firefox 浏览器与夜间版本保持更新,所以我们总能提供最新的 HTML5 技术。我们的负载均衡器确保您的请求不会因其他用户的请求而受到阻碍。

有关此功能的更多详细信息以及使用优势,请参阅 https://snapsearch.io/

SnapSearch 在其他语言中提供了类似的库: https://github.com/SnapSearch/Snapsearch-Clients

安装

需要 5.3.3 或更高版本和 Curl 扩展。

Composer

将此添加到您的 composer.json

"snapsearch/snapsearch-client-php": "~1.2"

然后运行 composer installcomposer update

本地

只需将存储库提取到您的库位置。然后使用您自己的 PSR-0 自动加载器来自动加载 src/SnapSearchClientPHP/ 内部的类。

您还可以使用提供的自动加载器。首先将此项目克隆到您期望的位置,然后编写

require_once('SnapSearch-Client-PHP/src/SnapSearchClientPHP/Bootstrap.php');
\SnapSearchClientPHP\Bootstrap::register();

如果您不想使用自动加载器,只需 require src/SnapSearchClientPHP/ 内部的所有类,除了 Bootstrap.php

注意您必须手动安装依赖项并加载它们。查看 composer.json 文件,在 "require" 部分找到依赖项。

别忘了包含此库运行所需资源的 resources/ 文件夹。

用法

SnapSearchClientPHP 应在应用程序的入口点开始。这可能是在前控制器、引导过程、IoC 容器或中间件内部。对于单页应用,您的入口点将是首先展示初始 HTML 页面的代码。

有关 API 和 API 请求参数的完整文档,请参阅: https://snapsearch.io/documentation

顺便说一句,您需要将非HTML资源如sitemap.xml列入黑名单。这已在https://snapsearch.io/documentation#notes中解释。

###基本用法

$client = new \SnapSearchClientPHP\Client('email', 'key');
$detector = new \SnapSearchClientPHP\Detector;
$interceptor = new \SnapSearchClientPHP\Interceptor($client, $detector);

//exceptions should be ignored in production, but during development you can check it for validation errors
try{

    $response = $interceptor->intercept();

}catch(SnapSearchClientPHP\SnapSearchException $e){}

if($response){

    //this request is from a robot

    //status code
    header(' ', true, $response['status']); //as of PHP 5.4, you can use http_response_code($response['status']);
    
    //the complete $response['headers'] is not returned to the search engine due to potential content or transfer encoding issues, except for the potential location header, which is used when there is an HTTP redirect
    if(!empty($response['headers'])){
        foreach($response['headers'] as $header){
            if($header['name'] == 'Location'){
                header($header['name'] . ': ' . $header['value']);
            }
        }
    }

    //content
    echo $response['html'];

}else{

    //this request is not from a robot
    //continue with normal operations...

}

以下是一个示例$response变量(并非所有变量都可用,您需要检查请求参数)

$response = [
    'cache'             => true/false,
    'callbackResult'    => '',
    'date'              => 1390382314,
    'headers'           => [
        [
            'name'  => 'Content-Type',
            'value' => 'text/html'
        ]
    ],
    'html'              => '<html></html>',
    'message'           => 'Success/Failed/Validation Errors',
    'pageErrors'        => [
        [
            'error'   => 'Error: document.querySelector(...) is null',
            'trace'   => [
                [
                    'file'      => 'filename',
                    'function'  => 'anonymous',
                    'line'      => '41',
                    'sourceURL' => 'urltofile'
                ]
            ]
        ]
    ],
    'screenshot'        => 'BASE64 ENCODED IMAGE CONTENT',
    'status'            => 200
]

###高级用法

$request_parameters = array(
    //add your API request parameters if you have any...
);

$blacklisted_routes = array(
    //add your black listed routes if you have any...
);

$whitelisted_routes = array(
    //add your white listed routes if you have any...
);

$check_file_extensions = //if you wish for SnapSearchClient to check if the URL leads to a static file, switch this on to a boolean true, however this is expensive and time consuming, so it's better to use black listed or white listed routes

$symfony_http_request_object = //get the Symfony\Component\HttpFoundation\Request

$robot_json_path = //if you have a custom robots.json you can choose to use that instead, use the absolute path

$extensions_json_path = //if you have a custom extensions.json you can choose hat insead, use the absolute path

$client = new \SnapSearchClientPHP\Client('email', 'key', $request_parameters);

$detector = new \SnapSearchClientPHP\Detector(
    $blacklisted_routes, 
    $whitelisted_routes, 
    $check_file_extensions,
    $symfony_http_request_object,
    $robot_json_path,
    $extensions_json_path
);

//robots can be direct accessed and manipulated
$detector->robots['match'][] = 'my_custom_bot_to_be_matched';
$detector->robots['ignore'][] = 'my_ignored_robot';

//extensions can as well, add to 'generic' or 'php'
$detector->extensions['php'][] = 'validextension';

$interceptor = new \SnapSearchClientPHP\Interceptor($client, $detector);

//your custom cache driver
$cache = new YourCustomClientSideCacheDriver;

//the before_intercept callback is called after the Detector has detected a search engine robot
//if this callback returns an array, the array will be used as the $response to $interceptor->intercept();
//use it for client side caching in order to have millisecond responses to search engines
//the after_intercept callback can be used to store the snapshot from SnapSearch as a client side cached resource
//this is of course optional as SnapSearch caches your snapshot as well!
$interceptor->before_intercept(function($url) use ($cache){

    //get cache from redis/filesystem..etc
    //returned value should array if successful or boolean false if cache did not exist
    return $cache->get($url); 
    
})->after_intercept(function($url, $response) use ($cache){

    //the cached time should be less then the cached time you passed to SnapSearch, we recommend half the SnapSearch cachetime
    $time = '12hrs';
    $cache->store($url, $response, $time);
    
});

//exceptions should be ignored in production, but during development you can check it for validation errors
try{

    $response = $interceptor->intercept();

}catch(SnapSearchClientPHP\SnapSearchException $e){}

if($response){

    //this request is from a robot

    //status code
    header(' ', true, $response['status']); //as of PHP 5.4, you can use http_response_code($response['status']);
    
    //the complete $response['headers'] is not returned to the search engine due to potential content or transfer encoding issues, except for the potential location header, which is used when there is an HTTP redirect
    if(!empty($response['headers'])){
        foreach($response['headers'] as $header){
            if(strtolower($header['name']) == 'location'){
                header($header['name'] . ': ' . $header['value']);
            }
        }
    }
    
    //content
    echo $response['html'];

}else{

    //this request is not from a robot
    //continue with normal operations...

}

###Stack PHP用法

Stack PHP是一个PHP的HTTP内核中间件框架,类似于Ruby Rack或Node Connect。以下示例使用了PHP 5.4代码。

$app =  //HTTP Kernel core controller

$stack = (new \Stack\Builder)->push(
    '\SnapSearchClientPHP\StackInterceptor',
    new Interceptor(
        new Client('email', 'key'), 
        new Detector
    )->before_intercept(function($url){
        //before interception callback (optional and chainable)
    })->after_intercept(function($url, $response){
        //after interception callback (optional and chainable)
    }),
    function(array $response){

        //this callback is completely optional, it allows you to customise your response
        //the $response array comes from SnapSearch and contains [(string) 'status', (array) 'headers', (string) 'html']

        //remember $response['headers'] is in this format:
        //[
        //    [
        //        'name'  => 'Location',
        //        'value' => 'http://redirect.com/'
        //    ]
        //]
        //it's an array of arrays which contain name and value properties

        //it's recommended to not pass through all of the headers, due to possible encoding problems
        //your server will already output the necessary headers anyway
        //however we are passing through the location header if it exists
        $headers = array_filter($response['headers'], function($header){
            if(strtolower($header['name']) == 'location'){
                return true;
            }
            return false;
        });

        return [
            'status'    => $response['status'],
            'headers'   => $headers,
            'html'      => $response['html']
        ];

    },
    function($exception, $request){

        //this is the exception callback and it's completely optional
        //it will only be called if a SnapSearchException is raised
        //which only happens if SnapSearch's servers are temporarily offline
        //if there is an exception, this middleware will simply pass to the next layer
        //if you want to stop and inspect or log the actual exception, this is where you can do it

    }
);

$app = $stack->resolve($app);

$request  = Request::createFromGlobals();
$response = $app->handle($request)->send();
$app->terminate($request, $response);
//or just do this if you have Stack\run
//\Stack\run($app);

Detector构造函数中的布尔型$check_file_extensions适用于可能提供静态文件的应用程序。通常HTTP服务器会提供静态文件,这些请求永远不会被代理到应用程序,这就是为什么默认情况下这个布尔型为false。然而,在确实提供静态文件的情况下,您可以将其切换为true,以防止静态文件路由被拦截。

黑名单静态文件路由可能更高效或更容易。这的好处在于,您可以防止前往可能不以特定文件扩展名结尾的二进制资源路由。例如,流式音频/视频。

当然,SnapSearchClientPHP也可以用于其他领域,例如增强型JavaScript爬取,因此如果您用于其他目的,不需要将其放在入口点。在这种情况下,只需使用SnapSearchPHP\Client向SnapSearch API发送请求。

代理

SnapSearch-Client-PHP使用Symfony HTTP Foundation Request Object作为HTTP请求的抽象。这为您提供了相当大的灵活性,特别是在您位于反向代理(如负载均衡器)后面时构建HTTP请求。如果您位于反向代理后面,某些信息(如请求协议)可能不在正常位置。您可以配置Symfony HTTP Foundation Request Object来处理这些边缘情况,并将您的实例传递给Detector。更多信息请参见:https://symfony.com.cn/doc/current/components/http_foundation/trusting_proxies.html

开发

使用composer安装/更新依赖项

composer update

做出更改,同步,然后创建一个新的标签

git tag MAJOR.MINOR.PATCH
git push
git push --tags

Packagist已集成到Github服务钩子中,它将自动发布新包。

测试

单元测试使用Codeception编写。Codeception已经启动(codecept bootstrap)。要运行测试,请使用codecept run,或使用codecept run --debug以显示调试信息。如果您更改了Codeception配置文件或向辅助器添加了额外的函数,请确保运行codecept build,以便设置生效。