athlon1600 / serpscraper
PHP 驱动的接口,用于查询最受欢迎的搜索引擎
v4.0.1
2024-08-31 21:33 UTC
Requires
- ext-curl: *
- athlon1600/php-captcha-solver: ^2.0
- athlon1600/php-curl-client: ^1.1
Requires (Dev)
- phpstan/phpstan: ^1.12
- phpunit/phpunit: ^8.5.39 || ^9.5 || ^10.0
README
SerpScraper
该库的目的是提供一个简单、不易被发现、并能抵抗验证码的从像 Google 和 Bing 这样的流行搜索引擎提取搜索结果的方法。
安装
推荐通过 Composer 安装此软件
composer require athlon1600/serpscraper "^4.0"
从 Google 提取搜索结果
<?php use SerpScraper\Engine\GoogleSearch; $page = 1; $google = new GoogleSearch(); // all available preferences for Google $google->setPreference('results_per_page', 100); //$google->setPreference('google_domain', 'google.lt'); //$google->setPreference('date_range', 'hour'); $results = array(); do { $response = $google->search("how to scrape google", $page); // error field must be empty otherwise query failed if(empty($response->error)){ $results = array_merge($results, $response->results); $page++; } else if($response->error == 'captcha'){ // read below break; } } while ($response->has_next_page);
自动解决 Google 搜索验证码
为了使其正常工作,您需要在 2captcha.com 注册服务并获取 API 密钥。强烈建议使用代理服务器。
在此处安装您自己的 VPS 上的私有代理服务器
https://github.com/Athlon1600/useful#squid
<?php use SerpScraper\Engine\GoogleSearch; use SerpScraper\GoogleCaptchaSolver; $google = new GoogleSearch(); $browser = $google->getBrowser(); $browser->setProxy('PROXY:IP'); $solver = new GoogleCaptchaSolver($browser); while(true){ $response = $google->search('famous people born in ' . mt_rand(1500, 2020)); if ($response->error == 'captcha') { echo "Captcha detected!" . PHP_EOL; $temp = $solver->solveUsingTwoCaptcha($response, '2CAPTCHA_API_KEY', 90); if ($temp->status == 200) { echo "Captcha solved successfully!" . PHP_EOL; } else { echo 'Solving captcha has failed...' . PHP_EOL; } } else { echo "OK. "; } sleep(2); }
从 Bing 提取搜索结果
<?php use SerpScraper\Engine\BingSearch; $bing = new BingSearch(); $results = array(); for($page = 1; $page < 10; $page++){ $response = $bing->search("search bing using php", $page); if($response->error == false){ $results = array_merge($results, $response->results); } if($response->has_next_page == false){ break; } } var_dump($results);