简化版/spider

检索网站的必要信息

0.1.13 2015-11-11 22:06 UTC

This package is auto-updated.

Last update: 2024-09-20 23:59:30 UTC


README

								     /      \
							  	  \  \  ,,  /  /
								   '-.`\()/`.-'
								  .--_'(  )'_--.
							     / /` /`""`\ `\ \
			 					  |  |  ><  |  |
								  \  \      /  /
 									  '.__.'
			
								  Simplon/Spider

Build Status

简介

什么是简化版/spider?

Spider解析指定的HTML文档并汇总所有必要数据

  • 标题
  • 描述
  • 关键词
  • 所有h1内容
  • open-graph标签
  • twitter标签
  • 所有图片

它基本上提供了与Facebook抓取器相同类型的响应。然而,Facebook的抓取器不提供所有必要数据。

Facebook抓取器响应

{
   "og_object":{
      "id":"379786107965",
      "description":"Find the latest breaking news and information on the top stories, weather, business, entertainment, politics, and more. For in-depth coverage, CNN provides special reports, video, audio, photo galleries, and interactive guides",
      "title":"Breaking News, U.S., World, Weather, Entertainment & Video News - CNN.com",
      "type":"website",
      "updated_time":"2015-09-01T13:15:53+0000",
      "url":"http:\/\/www.cnn.com\/"
   },
   "share":{
      "comment_count":0,
      "share_count":1340555
   },
   "id":"http:\/\/cnn.com"
}

Spider响应

{
   "title":"Breaking News, U.S., World, Weather, Entertainment & Video News - CNN.com",
   "description":"Find the latest breaking news and information on the top stories, weather, business, entertainment, politics, and more. For in-depth coverage, CNN provides special reports, video, audio, photo galleries, and interactive guides",
   "keywords":"breaking news, news online, U.S. news, world news, developing story, news video, CNN news, weather, business, money, politics, law, technology, entertainment, education, travel, health, special reports, autos, CNN TV",
   "url": "http:\/\/www.cnn.com\/",
   "images":[
      "http://i2.cdn.turner.com/cnnnext/dam/assets/150901143136-budapest-migrant-protest-fists-large-169.jpg",
      "http://i2.cdn.turner.com/cnnnext/dam/assets/110902115913-gates-of-auschwitz-large-169.jpg"
   ],
   "openGraph":{
      "pubdate":"2014-02-24T14:45:54Z",
      "url":"http://www.cnn.com",
      "title":"Breaking News, U.S., World, Weather, Entertainment &amp; Video News - CNN.com",
      "description":"Find the latest breaking news and information on the top stories, weather, business, entertainment, politics, and more. For in-depth coverage, CNN provides special reports, video, audio, photo galleries, and interactive guides",
      "site_name":"CNN",
      "type":"website"
   },
   "twitter":{
      "card":"summary_large_image"
   }
}

有依赖项吗?

  • PHP 5.4
  • CURL

安装

通过composer轻松安装。还不知道composer是什么?请在这里了解更多信息

{
    "require": {
        "simplon/spider": "*"
    }
}

示例

以下示例直接明了,无需额外解释。

通过获取页面来解析

use Simplon\Spider\Spider;

// fetch and parse
$data = Spider::fetchParse('http://cnn.com');

echo json_encode($data); // json encode result

通过现有HTML来解析

use Simplon\Spider\Spider;

// page html
$html = '...';

// fetch and parse
$data = Spider::parse($html, 'http://cnn.com'); // URL is needed to rebuild absolute image paths

echo json_encode($data); // json encode result

两种情况下的结果

{
   "title":"Breaking News, U.S., World, Weather, Entertainment & Video News - CNN.com",
   "description":"Find the latest breaking news and information on the top stories, weather, business, entertainment, politics, and more. For in-depth coverage, CNN provides special reports, video, audio, photo galleries, and interactive guides",
   "keywords":"breaking news, news online, U.S. news, world news, developing story, news video, CNN news, weather, business, money, politics, law, technology, entertainment, education, travel, health, special reports, autos, CNN TV",
   "url": "http:\/\/www.cnn.com\/",
   "images":[
      "http://i2.cdn.turner.com/cnnnext/dam/assets/150901143136-budapest-migrant-protest-fists-large-169.jpg",
      "http://i2.cdn.turner.com/cnnnext/dam/assets/110902115913-gates-of-auschwitz-large-169.jpg"
   ],
   "openGraph":{
      "pubdate":"2014-02-24T14:45:54Z",
      "url":"http://www.cnn.com",
      "title":"Breaking News, U.S., World, Weather, Entertainment &amp; Video News - CNN.com",
      "description":"Find the latest breaking news and information on the top stories, weather, business, entertainment, politics, and more. For in-depth coverage, CNN provides special reports, video, audio, photo galleries, and interactive guides",
      "site_name":"CNN",
      "type":"website"
   },
   "twitter":{
      "card":"summary_large_image"
   }
}

许可证

简化版/spider可以在MIT许可证的条款下自由分发。

版权(c)2015 Tino Ehrich (tino@bigpun.me)

特此授予任何获得此软件及其相关文档副本(“软件”)的人免费使用权,包括但不限于使用、复制、修改、合并、发布、分发、再许可和/或销售软件副本,并允许获得软件的人这样做,但受以下条件的约束:

上述版权声明和本许可声明应包含在软件的所有副本或主要部分中。

软件按“原样”提供,不提供任何明示或暗示的保证,包括但不限于适销性、特定用途适用性和非侵权性保证。在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是基于合同、侵权或其他方式,无论是因软件或其使用或其他方式而产生的。