README

PicoFeed最初是为Miniflux开发的，Miniflux是一个极简且开源的新闻阅读器。

然而，这个库可以用于任何项目中。PicoFeed经过大量不同源头的测试，简单易用。

功能和需求

简单快捷
Atom 1.0和RSS（0.91、0.92、1.0和2.0）的源解析器
Atom 1.0和RSS 2.0的源写入器
导入/导出OPML订阅
内容过滤器：HTML清理、移除像素追踪器和广告
许多HTTP客户端适配器：cURL或Stream Context
内容抓取器：从原始网站下载完整内容
许可证：Unlicense http://unlicense.org/

需求

PHP >= 5.3
libxml >= 2.7
XML PHP扩展：DOM和SimpleXML
cURL或Stream Context (allow_url_fopen=On)

使用方法

导入OPML文件

require 'vendor/PicoFeed/Import.php';

use PicoFeed\Import;

$opml = file_get_contents('mySubscriptions.opml');
$import = new Import($opml);
$entries = $import->execute();

print_r($entries);

导出到OPML

require 'vendor/PicoFeed/Export.php';

use PicoFeed\Export;

$feeds = array(
    array(
        'title' => 'Site title',
        'description' => 'Optional description',
        'site_url' => 'http://petitcodeur.fr/',
        'site_feed' => 'http://petitcodeur.fr/feed.xml'
    )
);

$export = new Export($feeds);
$opml = $export->execute();

echo $opml; // XML content

下载并解析源

require 'vendor/PicoFeed/Reader.php';

use PicoFeed\Reader;

$reader = new Reader;

// Try to discover the XML feed automatically
$reader->download('http://petitcodeur.fr/');

$parser = $reader->getParser();

if ($parser !== false) {

    $feed = $parser->execute();

    echo $feed->title;
    echo $feed->url;
    print_r($feed->items);
}

处理HTTP缓存

require 'vendor/PicoFeed/Reader.php';

use PicoFeed\Reader;

$reader = new Reader;

// Get last modified infos from previous requests
$lastModified = '...';
$etag = '...';

// Download directly the feed
$resource = $reader->download('http://petitcodeur.fr/feed.xml', $lastModified, $etag);

if ($resource->isModified()) {

    $parser = $reader->getParser();

    if ($parser !== false) {

        $feed = $parser->execute();

        echo $feed->title;
        echo $feed->url;
        print_r($feed->items);

        // Save cache infos for the next request
        $lastModified = $resource->getLastModified();
        $etag = $resource->getEtag();
    }
}

修改用户代理和连接超时

$reader->download(
    'http://petitcodeur.fr/',
    'last modified date',
    'etag value',
    10,
    'My RSS reader user agent'
);

使用HTTP代理

在所有其他操作之前调用静态方法 proxy()

PicoFeed\Client::proxy($hostname, $port);

如果你的代理受登录/密码保护

PicoFeed\Client::proxy($hostname, $port, $username, $password);

生成RSS 2.0源

require_once 'lib/PicoFeed/Writers/Rss20.php';

use PicoFeed\Writers\Rss20;

$writer = new Rss20();
$writer->title = 'My site';
$writer->site_url = 'http://boo/';
$writer->feed_url = 'http://boo/feed.atom';
$writer->author = array(
    'name' => 'Me',
    'url' => 'http://me',
    'email' => 'me@here'
);

$writer->items[] = array(
    'title' => 'My article 1',
    'updated' => strtotime('-2 days'),
    'url' => 'http://foo/bar',
    'summary' => 'Super summary',
    'content' => '<p>content</p>'
);

$writer->items[] = array(
    'title' => 'My article 2',
    'updated' => strtotime('-1 day'),
    'url' => 'http://foo/bar2',
    'summary' => 'Super summary 2',
    'content' => '<p>content 2 &nbsp; &copy; 2015</p>',
    'author' => array(
        'name' => 'Me too',
    )
);

$writer->items[] = array(
    'title' => 'My article 3',
    'url' => 'http://foo/bar3'
);

echo $writer->execute();

生成Atom源

require_once 'lib/PicoFeed/Writers/Atom.php';

use PicoFeed\Writers\Atom;

$writer = new Atom();
$writer->title = 'My site';
$writer->site_url = 'http://boo/';
$writer->feed_url = 'http://boo/feed.atom';
$writer->author = array(
    'name' => 'Me',
    'url' => 'http://me',
    'email' => 'me@here'
);

$writer->items[] = array(
    'title' => 'My article 1',
    'updated' => strtotime('-2 days'),
    'url' => 'http://foo/bar',
    'summary' => 'Super summary',
    'content' => '<p>content</p>'
);

echo $writer->execute();

获取日志消息

通过调用此代码可以获取所有调试输出

print_r(PicoFeed\Logging::$messages);

输出将类似于以下内容

Array
(
    [0] => Fetch URL: http://petitcodeur.fr/feed.xml
    [1] => Etag:
    [2] => Last-Modified:
    [3] => cURL total time: 0.711378
    [4] => cURL dns lookup time: 0.001064
    [5] => cURL connect time: 0.100733
    [6] => cURL speed download: 74825
    [7] => HTTP status code: 200
    [8] => HTTP headers: Set-Cookie => start=R2701971637; path=/; expires=Sat, 06-Jul-2013 05:16:33 GMT
    [9] => HTTP headers: Date => Sat, 06 Jul 2013 03:55:52 GMT
    [10] => HTTP headers: Content-Type => application/xml
    [11] => HTTP headers: Content-Length => 53229
    [12] => HTTP headers: Connection => close
    [13] => HTTP headers: Server => Apache
    [14] => HTTP headers: Last-Modified => Tue, 02 Jul 2013 03:26:02 GMT
    [15] => HTTP headers: ETag => "393e79c-cfed-4e07ee78b2680"
    [16] => HTTP headers: Accept-Ranges => bytes
)

覆盖内容过滤器的黑名单/白名单

这些变量是静态数组，可以扩展或替换实际数组。

例如，添加新的iframe白名单

Filter::$iframe_whitelist[] = 'http://www.kickstarter.com';

或替换整个白名单

Filter::$iframe_whitelist = array('http://www.kickstarter.com');

可用变量

// Allow only specified tags and attributes
Filter::$whitelist_tags

// Strip content of these tags
Filter::$blacklist_tags

// Allow only specified URI scheme
Filter::$whitelist_scheme

// List of attributes used for external resources: src and href
Filter::$media_attributes

// Blacklist of external resources
Filter::$media_blacklist

// Required attributes for tags, if the attribute is missing the tag is dropped
Filter::$required_attributes

// Add attribute to specified tags
Filter::$add_attributes

// Integer Attributes
Filter::$integer_attributes

// Iframe allowed source
Filter::$iframe_whitelist

有关更多详细信息，请参阅 Filter 类。

内容抓取器是如何工作的？

首先使用规则（xpath模式）尝试域名（请参阅 PicoFeed\Rules\）
通过使用class和id的常见属性尝试找到文本内容
最后，如果没有找到任何内容，则显示源内容

内容下载器使用一个假的用户代理，实际上是Mac Os X下的Google Chrome。

使用Xpath规则文件可以获得最佳结果。

在PicoFeed内部有一个PHP脚本来导入Fivefilters规则，但我没有使用它，因为这些模式中的大部分都已过时。

如何编写抓取器规则文件？

将PHP文件添加到目录 PicoFeed\Rules 中，文件名必须是域名

例如，BBC网站的示例，www.bbc.co.uk.php

<?php
return array(
    'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
    'body' => array(
        '//div[@class="story-body"]',
    ),
    'strip' => array(
        '//script',
        '//form',
        '//style',
        '//*[@class="story-date"]',
        '//*[@class="story-header"]',
        '//*[@class="story-related"]',
        '//*[contains(@class, "byline")]',
        '//*[contains(@class, "story-feature")]',
        '//*[@id="video-carousel-container"]',
        '//*[@id="also-related-links"]',
        '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
    )
);

实际上，仅支持 body、strip 和 test_url。

别忘了发送拉取请求或工单来与大家分享你的贡献

如何使用内容抓取器？

require 'vendor/PicoFeed/Reader.php';

use PicoFeed\Reader;

$reader = new Reader;
$reader->download('http://www.egscomics.com/rss.php');

$parser = $reader->getParser();

if ($parser !== false) {

    $parser->grabber = true; // <= Enable the content grabber
    $feed = $parser->execute();
    // ...
}

当内容抓取器启用时，由于每个项目都需要新的HTTP请求，并且下载的HTML将通过XML/Xpath解析，所以所有操作都会变慢。

内容抓取器规则列表

如果你想添加新规则，只需打开一个工单，我会帮你完成。

*.blog.lemonde.fr
*.blog.nytimes.com
*.nytimes.com
*.phoronix.com
*.slate.com
*.theguardian.com
*.wikipedia.org
*.wired.com
*.wsj.com
github.com
golem.de
ing.dk
karriere.jobfinder.dk
lifehacker.com
lists.*
medium.com
pastebin.com
plus.google.com
rue89.com
smallhousebliss.com
spiegel.de
techcrunch.com
version2.dk
www.bbc.co.uk
www.businessweek.com
www.cnn.com
www.egscomics.com
www.forbes.com
www.lemonde.fr
www.lepoint.fr
www.npr.org
www.numerama.com
www.slate.fr

includable-deps / picofeed

维护者

详细信息