includable-deps / picofeed
Feeds writer/reader (SimplePie的替代品)
1.0.1
2019-04-03 15:34 UTC
Requires
- php: >=5.3.0
This package is not auto-updated.
Last update: 2024-09-12 18:15:23 UTC
README
PicoFeed最初是为Miniflux开发的,Miniflux是一个极简且开源的新闻阅读器。
然而,这个库可以用于任何项目中。PicoFeed经过大量不同源头的测试,简单易用。
功能和需求
- 简单快捷
- Atom 1.0和RSS(0.91、0.92、1.0和2.0)的源解析器
- Atom 1.0和RSS 2.0的源写入器
- 导入/导出OPML订阅
- 内容过滤器:HTML清理、移除像素追踪器和广告
- 许多HTTP客户端适配器:cURL或Stream Context
- 内容抓取器:从原始网站下载完整内容
- 许可证:Unlicense http://unlicense.org/
需求
- PHP >= 5.3
- libxml >= 2.7
- XML PHP扩展:DOM和SimpleXML
- cURL或Stream Context (
allow_url_fopen=On
)
使用方法
导入OPML文件
require 'vendor/PicoFeed/Import.php';
use PicoFeed\Import;
$opml = file_get_contents('mySubscriptions.opml');
$import = new Import($opml);
$entries = $import->execute();
print_r($entries);
导出到OPML
require 'vendor/PicoFeed/Export.php';
use PicoFeed\Export;
$feeds = array(
array(
'title' => 'Site title',
'description' => 'Optional description',
'site_url' => 'http://petitcodeur.fr/',
'site_feed' => 'http://petitcodeur.fr/feed.xml'
)
);
$export = new Export($feeds);
$opml = $export->execute();
echo $opml; // XML content
下载并解析源
require 'vendor/PicoFeed/Reader.php';
use PicoFeed\Reader;
$reader = new Reader;
// Try to discover the XML feed automatically
$reader->download('http://petitcodeur.fr/');
$parser = $reader->getParser();
if ($parser !== false) {
$feed = $parser->execute();
echo $feed->title;
echo $feed->url;
print_r($feed->items);
}
处理HTTP缓存
require 'vendor/PicoFeed/Reader.php';
use PicoFeed\Reader;
$reader = new Reader;
// Get last modified infos from previous requests
$lastModified = '...';
$etag = '...';
// Download directly the feed
$resource = $reader->download('http://petitcodeur.fr/feed.xml', $lastModified, $etag);
if ($resource->isModified()) {
$parser = $reader->getParser();
if ($parser !== false) {
$feed = $parser->execute();
echo $feed->title;
echo $feed->url;
print_r($feed->items);
// Save cache infos for the next request
$lastModified = $resource->getLastModified();
$etag = $resource->getEtag();
}
}
修改用户代理和连接超时
$reader->download(
'http://petitcodeur.fr/',
'last modified date',
'etag value',
10,
'My RSS reader user agent'
);
使用HTTP代理
在所有其他操作之前调用静态方法 proxy()
PicoFeed\Client::proxy($hostname, $port);
如果你的代理受登录/密码保护
PicoFeed\Client::proxy($hostname, $port, $username, $password);
生成RSS 2.0源
require_once 'lib/PicoFeed/Writers/Rss20.php';
use PicoFeed\Writers\Rss20;
$writer = new Rss20();
$writer->title = 'My site';
$writer->site_url = 'http://boo/';
$writer->feed_url = 'http://boo/feed.atom';
$writer->author = array(
'name' => 'Me',
'url' => 'http://me',
'email' => 'me@here'
);
$writer->items[] = array(
'title' => 'My article 1',
'updated' => strtotime('-2 days'),
'url' => 'http://foo/bar',
'summary' => 'Super summary',
'content' => '<p>content</p>'
);
$writer->items[] = array(
'title' => 'My article 2',
'updated' => strtotime('-1 day'),
'url' => 'http://foo/bar2',
'summary' => 'Super summary 2',
'content' => '<p>content 2 © 2015</p>',
'author' => array(
'name' => 'Me too',
)
);
$writer->items[] = array(
'title' => 'My article 3',
'url' => 'http://foo/bar3'
);
echo $writer->execute();
生成Atom源
require_once 'lib/PicoFeed/Writers/Atom.php';
use PicoFeed\Writers\Atom;
$writer = new Atom();
$writer->title = 'My site';
$writer->site_url = 'http://boo/';
$writer->feed_url = 'http://boo/feed.atom';
$writer->author = array(
'name' => 'Me',
'url' => 'http://me',
'email' => 'me@here'
);
$writer->items[] = array(
'title' => 'My article 1',
'updated' => strtotime('-2 days'),
'url' => 'http://foo/bar',
'summary' => 'Super summary',
'content' => '<p>content</p>'
);
echo $writer->execute();
获取日志消息
通过调用此代码可以获取所有调试输出
print_r(PicoFeed\Logging::$messages);
输出将类似于以下内容
Array
(
[0] => Fetch URL: http://petitcodeur.fr/feed.xml
[1] => Etag:
[2] => Last-Modified:
[3] => cURL total time: 0.711378
[4] => cURL dns lookup time: 0.001064
[5] => cURL connect time: 0.100733
[6] => cURL speed download: 74825
[7] => HTTP status code: 200
[8] => HTTP headers: Set-Cookie => start=R2701971637; path=/; expires=Sat, 06-Jul-2013 05:16:33 GMT
[9] => HTTP headers: Date => Sat, 06 Jul 2013 03:55:52 GMT
[10] => HTTP headers: Content-Type => application/xml
[11] => HTTP headers: Content-Length => 53229
[12] => HTTP headers: Connection => close
[13] => HTTP headers: Server => Apache
[14] => HTTP headers: Last-Modified => Tue, 02 Jul 2013 03:26:02 GMT
[15] => HTTP headers: ETag => "393e79c-cfed-4e07ee78b2680"
[16] => HTTP headers: Accept-Ranges => bytes
)
覆盖内容过滤器的黑名单/白名单
这些变量是静态数组,可以扩展或替换实际数组。
例如,添加新的iframe白名单
Filter::$iframe_whitelist[] = 'http://www.kickstarter.com';
或替换整个白名单
Filter::$iframe_whitelist = array('http://www.kickstarter.com');
可用变量
// Allow only specified tags and attributes
Filter::$whitelist_tags
// Strip content of these tags
Filter::$blacklist_tags
// Allow only specified URI scheme
Filter::$whitelist_scheme
// List of attributes used for external resources: src and href
Filter::$media_attributes
// Blacklist of external resources
Filter::$media_blacklist
// Required attributes for tags, if the attribute is missing the tag is dropped
Filter::$required_attributes
// Add attribute to specified tags
Filter::$add_attributes
// Integer Attributes
Filter::$integer_attributes
// Iframe allowed source
Filter::$iframe_whitelist
有关更多详细信息,请参阅 Filter
类。
内容抓取器是如何工作的?
- 首先使用规则(xpath模式)尝试域名(请参阅
PicoFeed\Rules\
) - 通过使用class和id的常见属性尝试找到文本内容
- 最后,如果没有找到任何内容,则显示源内容
内容下载器使用一个假的用户代理,实际上是Mac Os X下的Google Chrome。
使用Xpath规则文件可以获得最佳结果。
在PicoFeed内部有一个PHP脚本来导入Fivefilters规则,但我没有使用它,因为这些模式中的大部分都已过时。
如何编写抓取器规则文件?
将PHP文件添加到目录 PicoFeed\Rules
中,文件名必须是域名
例如,BBC网站的示例,www.bbc.co.uk.php
<?php
return array(
'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
'body' => array(
'//div[@class="story-body"]',
),
'strip' => array(
'//script',
'//form',
'//style',
'//*[@class="story-date"]',
'//*[@class="story-header"]',
'//*[@class="story-related"]',
'//*[contains(@class, "byline")]',
'//*[contains(@class, "story-feature")]',
'//*[@id="video-carousel-container"]',
'//*[@id="also-related-links"]',
'//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
)
);
实际上,仅支持 body
、strip
和 test_url
。
别忘了发送拉取请求或工单来与大家分享你的贡献
如何使用内容抓取器?
require 'vendor/PicoFeed/Reader.php';
use PicoFeed\Reader;
$reader = new Reader;
$reader->download('http://www.egscomics.com/rss.php');
$parser = $reader->getParser();
if ($parser !== false) {
$parser->grabber = true; // <= Enable the content grabber
$feed = $parser->execute();
// ...
}
当内容抓取器启用时,由于每个项目都需要新的HTTP请求,并且下载的HTML将通过XML/Xpath解析,所以所有操作都会变慢。
内容抓取器规则列表
如果你想添加新规则,只需打开一个工单,我会帮你完成。
- *.blog.lemonde.fr
- *.blog.nytimes.com
- *.nytimes.com
- *.phoronix.com
- *.slate.com
- *.theguardian.com
- *.wikipedia.org
- *.wired.com
- *.wsj.com
- github.com
- golem.de
- ing.dk
- karriere.jobfinder.dk
- lifehacker.com
- lists.*
- medium.com
- pastebin.com
- plus.google.com
- rue89.com
- smallhousebliss.com
- spiegel.de
- techcrunch.com
- version2.dk
- www.bbc.co.uk
- www.businessweek.com
- www.cnn.com
- www.egscomics.com
- www.forbes.com
- www.lemonde.fr
- www.lepoint.fr
- www.npr.org
- www.numerama.com
- www.slate.fr