嵌入/嵌入

使用 oembed、opengraph 等获取页面信息的 PHP 库

v4.4.12 2024-07-24 14:08 UTC

README

Latest Version on Packagist Total Downloads Monthly Downloads Software License

PHP 库,用于获取任何网页信息(使用 oembed、opengraph、twitter-cards、HTML 抓取等)。它与任何网络服务(YouTube、Vimeo、Flickr、Instagram 等)兼容,并具有一些网站的适配器,如(archive.org、github、facebook 等)。

要求

如果您需要 PHP 5.5-7.3 支持,请使用 3.x 版本

在线演示

运行 php -S localhost:8888 demo/index.php

视频教程

安装

此包可通过 Composer 以 embed/embed 的方式安装和自动加载。

$ composer require embed/embed

用法

use Embed\Embed;

$embed = new Embed();

//Load any url:
$info = $embed->get('https://www.youtube.com/watch?v=PP1xn5wHtxE');

//Get content info

$info->title; //The page title
$info->description; //The page description
$info->url; //The canonical url
$info->keywords; //The page keywords

$info->image; //The thumbnail or main image

$info->code->html; //The code to embed the image, video, etc
$info->code->width; //The exact width of the embed code (if exists)
$info->code->height; //The exact height of the embed code (if exists)
$info->code->ratio; //The percentage of height / width to emulate the aspect ratio using paddings.

$info->authorName; //The resource author
$info->authorUrl; //The author url

$info->cms; //The cms used
$info->language; //The language of the page
$info->languages; //The alternative languages

$info->providerName; //The provider name of the page (Youtube, Twitter, Instagram, etc)
$info->providerUrl; //The provider url
$info->icon; //The big icon of the site
$info->favicon; //The favicon of the site (an .ico file or a png with up to 32x32px)

$info->publishedTime; //The published time of the resource
$info->license; //The license url of the resource
$info->feeds; //The RSS/Atom feeds

并行多个请求

use Embed\Embed;

$embed = new Embed();

//Load multiple urls asynchronously:
$infos = $embed->getMulti(
    'https://www.youtube.com/watch?v=PP1xn5wHtxE',
    'https://twitter.com/carlosmeixidefl/status/1230894146220625933',
    'https://en.wikipedia.org/wiki/Tordoia',
);

foreach ($infos as $info) {
    echo $info->title;
}

文档

文档是存储页面 HTML 代码的对象。您可以使用它从 HTML 代码中提取额外信息

//Get the document object
$document = $info->getDocument();

$document->link('image_src'); //Returns the href of a <link>
$document->getDocument(); //Returns the DOMDocument instance
$html = (string) $document; //Returns the html code

$document->select('.//h1'); //Search

您可以使用 xpath 查询来选择特定元素。搜索始终返回一个 Embed\QueryResult 实例

//Search the A elements
$result = $document->select('.//a');

//Filter the results
$result->filter(fn ($node) => $node->getAttribute('href'));

$id = $result->str('id'); //Return the id of the first result as string
$text = $result->str(); //Return the content of the first result

$ids = $result->strAll('id'); //Return an array with the ids of all results as string
$texts = $result->strAll(); //Return an array with the content of all results as string

$tabindex = $result->int('tabindex'); //Return the tabindex attribute of the first result as integer
$number = $result->int(); //Return the content of the first result as integer

$href = $result->url('href'); //Return the href attribute of the first result as url (converts relative urls to absolutes)
$url = $result->url(); //Return the content of the first result as url

$node = $result->node(); //Return the first node found (DOMElement)
$nodes = $result->nodes(); //Return all nodes found

元数据

为了方便起见,对象 Metas 存储了 HTML 中所有 <meta> 元素的值,因此您可以更容易地获取这些值。每个元数据的键来自 namepropertyitemprop 属性,值来自 content

//Get the Metas object
$metas = $info->getMetas();

$metas->all(); //Return all values
$metas->get('og:title'); //Return a key value
$metas->str('og:title'); //Return the value as string (remove html tags)
$metas->html('og:description'); //Return the value as html
$metas->int('og:video:width'); //Return the value as integer
$metas->url('og:url'); //Return the value as full url (converts relative urls to absolutes)

OEmbed

除了 HTML 和元数据外,此库还使用 oEmbed 端点获取额外数据。您可以根据以下方式获取这些数据

//Get the oEmbed object
$oembed = $info->getOEmbed();

$oembed->all(); //Return all raw data
$oembed->get('title'); //Return a key value
$oembed->str('title'); //Return the value as string (remove html tags)
$oembed->html('html'); //Return the value as html
$oembed->int('width'); //Return the value as integer
$oembed->url('url'); //Return the value as full url (converts relative urls to absolutes)

也可以提供额外的 oEmbed 参数(如 Instagram 的 hidecaption

$embed = new Embed();

$result = $embed->get('https://www.instagram.com/p/B_C0wheCa4V/');
$result->setSettings([
    'oembed:query_parameters' => ['hidecaption' => true]
]);
$oembed = $info->getOEmbed();

LinkedData

默认情况下可用的另一个 API,用于使用 JsonLD 架构提取信息。

//Get the linkedData object
$ld = $info->getLinkedData();

$ld->all(); //Return all data
$ld->get('name'); //Return a key value
$ld->str('name'); //Return the value as string (remove html tags)
$ld->html('description'); //Return the value as html
$ld->int('width'); //Return the value as integer
$ld->url('url'); //Return the value as full url (converts relative urls to absolutes)

其他 API

一些网站(如维基百科或 Archive.org)提供自定义 API,用于获取更可靠的数据。您可以使用 getApi() 方法获取 API 对象,但请注意,并非所有结果都具有此方法。API 对象具有与 oEmbed 相同的方法

//Get the API object
$api = $info->getApi();

$api->all(); //Return all raw data
$api->get('title'); //Return a key value
$api->str('title'); //Return the value as string (remove html tags)
$api->html('html'); //Return the value as html
$api->int('width'); //Return the value as integer
$api->url('url'); //Return the value as full url (converts relative urls to absolutes)

扩展 Embed

根据您的需求,您可能希望使用额外功能扩展此库或更改其执行某些操作的方式。

PSR

Embed 使用一些 PSR 标准以实现最大的互操作性

  • PSR-7 标准接口,用于表示 HTTP 请求、响应和 URI
  • PSR-17 标准工厂,用于创建 PSR-7 对象
  • PSR-18 标准接口,用于发送 HTTP 请求并返回响应

Embed 内置了一个兼容 PSR-18 的 CURL 客户端,但您需要安装一个 PSR-7 / PSR-17 库。 在此处您可以查看流行的库列表,并且库可以自动检测 'laminas\diactoros', 'guzzleHttp\psr7', 'slim\psr7', 'nyholm\psr7' 和 'sunrise\http'(按此顺序)。如果您想使用不同的 PSR 实现,可以通过这种方式实现

use Embed\Embed;
use Embed\Http\Crawler;

$client = new CustomHttpClient();
$requestFactory = new CustomRequestFactory();
$uriFactory = new CustomUriFactory();

//The Crawler is responsible for perform http queries
$crawler = new Crawler($client, $requestFactory, $uriFactory);

//Create an embed instance passing the Crawler
$embed = new Embed($crawler);

适配器

有些网站有特殊需求:因为它们提供了允许提取更多信息(如维基百科或Archive.org)的公共 API,或者因为我们需要更改在此特定网站上提取数据的方式。对于所有这些情况,我们都有适配器,适配器是扩展默认类以提供额外功能的类。

在创建适配器之前,您需要了解 Embed 的工作原理:当您执行此代码时,您会得到一个 Extractor

//Get the Extractor with all info
$info = $embed->get($url);

//The extractor have document and oembed:
$document = $info->getDocument();
$oembed = $info->getOEmbed();

Extractor 类有许多 Detectors。每个检测器负责检测特定的信息。例如,有一个检测标题的检测器,还有检测描述、图像、代码等的检测器。

因此,适配器基本上是为特定网站创建的提取器。它还可以包含自定义检测器或 API。如果您查看 src/Adapters 文件夹,您可以查看所有适配器。

如果您创建了一个适配器,您还需要将其注册到 Embed,以便它知道在哪个网站需要使用。为此,有一个 ExtractorFactory 对象,它负责为每个网站实例化正确的提取器。

use Embed\Embed;

$embed = new Embed();

$factory = $embed->getExtractorFactory();

//Use this MySite adapter for mysite.com
$factory->addAdapter('mysite.com', MySite::class);

//Remove the adapter for pinterest.com, so it will use the default extractor
$factory->removeAdapter('pinterest.com');

//Change the default extractor
$factory->setDefault(CustomExtractor::class);

检测器

Embed 随带一些预定义的检测器,但您可能想更改或添加更多。只需创建一个扩展 Embed\Detectors\Detector 类的类,并在提取器工厂中注册它即可。例如

use Embed\Embed;
use Embed\Detectors\Detector;

class Robots extends Detector
{
    public function detect(): ?string
    {
        $response = $this->extractor->getResponse();
        $metas = $this->extractor->getMetas();

        return $response->getHeaderLine('x-robots-tag'),
            ?: $metas->str('robots');
    }
}

//Register the detector
$embed = new Embed();
$embed->getExtractorFactory()->addDetector('robots', Robots::class);

//Use it
$info = $embed->get('http://example.com');
$robots = $info->robots;

设置

如果您需要将设置传递给 CurlClient 以执行 http 查询

use Embed\Embed;
use Embed\Http\Crawler;
use Embed\Http\CurlClient;

$client = new CurlClient();
$client->setSettings([
    'cookies_path' => $cookies_path,
    'ignored_errors' => [18],
    'max_redirs' => 3,               // see CURLOPT_MAXREDIRS
    'connect_timeout' => 2,          // see CURLOPT_CONNECTTIMEOUT
    'timeout' => 2,                  // see CURLOPT_TIMEOUT
    'ssl_verify_host' => 2,          // see CURLOPT_SSL_VERIFYHOST
    'ssl_verify_peer' => 1,          // see CURLOPT_SSL_VERIFYPEER
    'follow_location' => true,       // see CURLOPT_FOLLOWLOCATION
    'user_agent' => 'Mozilla',       // see CURLOPT_USERAGENT
]);

$embed = new Embed(new Crawler($client));

如果您需要将设置传递给检测器,您可以将设置添加到 ExtractorFactory

use Embed\Embed;

$embed = new Embed();
$embed->setSettings([
    'oembed:query_parameters' => [],  //Extra parameters send to oembed
    'twitch:parent' => 'example.com', //Required to embed twitch videos as iframe
    'facebook:token' => '1234|5678',  //Required to embed content from Facebook
    'instagram:token' => '1234|5678', //Required to embed content from Instagram
    'twitter:token' => 'asdf',        //Improve the data from twitter
]);
$info = $embed->get($url);

注意:内置的检测器不需要设置。此功能仅用于方便,如果您创建了一个需要设置的特定检测器。