2dareis2do / newspaper3k-php-wrapper
Newspaper3k 文章抓取和编辑的 PHP 包装器
Requires
- php: >=7.0
- symfony/process: >=6.0
This package is auto-updated.
Last update: 2024-09-03 16:04:05 UTC
README
简单的 Newspaper3/4k 文章抓取和编辑的 php 包装器。
现在已更新,以添加更改当前工作目录的支持,使您可以根据每个作业自定义编辑脚本。
自定义 ArticleScraping.py
以下是一个自定义的 ArticleScraping.py 示例,它使用 Playwright 包装器。
#!/usr/bin/python
# -*- coding: utf8 -*-
import json, sys, os
import nltk
import newspaper
from newspaper import Article
from datetime import datetime
import lxml, lxml.html
from playwright.sync_api import sync_playwright
sys.stdout = open(os.devnull, "w") #To prevent a function from printing in the batch console in Python
url = functionName = sys.argv[1]
def accept_cookies_and_fetch_article(url):
# Using Playwright to handle login and fetch article
with sync_playwright() as p:
browser = p.chromium.launch(headless=True) # Set headless=False to watch the browser actions
page = browser.new_page()
# create a new incognito browser context
context = browser.new_context()
# create a new page inside context.
page = context.new_page()
page.goto(url)
# Automating iframe button click
page.frame_locator("iframe[title=\"SP Consent Message\"]").get_by_label("Essential cookies only").click()
content = page.content()
# dispose context once it is no longer needed.
context.close()
browser.close()
# Using Newspaper4k to parse the page content
article = newspaper.article(url, input_html=content, language='en')
article.parse() # Parse the article
article.nlp() # Keyword extraction wrapper
return article
article = accept_cookies_and_fetch_article(url)
# article.download() #Downloads the link’s HTML content
# 1 time download of the sentence tokenizer
# perhaps better to run from command line as we don't need to install each time?
#nltk.download('all')
#nltk.download('punkt')
sys.stdout = sys.__stdout__
data = article.__dict__
del data['config']
del data['extractor']
for i in data:
if type(data[i]) is set:
data[i] = list(data[i])
if type(data[i]) is datetime:
data[i] = data[i].strftime("%Y-%m-%d %H:%M:%S")
if type(data[i]) is lxml.html.HtmlElement:
data[i] = lxml.html.tostring(data[i])
if type(data[i]) is bytes:
data[i] = str(data[i])
print(json.dumps(data))
使用 Newspaper3kWrapper
在这个简化的示例中,我们只需将当前工作目录传递给 Newspaper3kWrapper。
use Twodareis2do\Scrape\Newspaper3kWrapper;
try {
// initiate the parser
$this->parser = new Newspaper3kWrapper();
// If no $cwd then use default 'ArticleScraping.py'
if (isset($cwd)) {
$output = $this->parser->scrape($value, $debug, $cwd);
}
else {
$output = $this->parser->scrape($value, $debug);
}
// return any scraped output
return $output;
}
catch (\Exception $e) {
// Logs a notice to channel if we get http error response.
$this->logger->notice('Newspaper Playwright Failed to get (1) URL @url "@error". @code', [
'@url' => $value,
'@error' => $e->getMessage(),
'@code' => $e->getCode(),
]);
// return empty string
return '';
}
替代文章抓取脚本
可以通过传递 cwd 来更改 ArticleScraping.py
的路径。以下是一个使用 Cloudscraper 库的示例。
#!/usr/bin/python
# -*- coding: utf8 -*-
import json, sys, os
import nltk
from newspaper import Article
from newspaper import Config
from newspaper.article import ArticleException, ArticleDownloadState
from datetime import datetime
import lxml, lxml.html
import cloudscraper
browser={
'browser': 'chrome',
'platform': 'android',
'desktop': False
}
scraper = cloudscraper.create_scraper(browser) # returns a CloudScraper instance
sys.stdout = open(os.devnull, "w") #To prevent a function from printing in the batch console in Python
url = functionName = sys.argv[1]
scraped = scraper.get(url).text
article = Article('')
article.html = scraped
ds = article.download_state
if ds == ArticleDownloadState.SUCCESS:
article.parse() #Parse the article
# 1 time download of the sentence tokenizer
# perhaps better to run from command line as we don't need to install each time?
#nltk.download('all')
#nltk.download('punkt')
article.nlp()# Keyword extraction wrapper
sys.stdout = sys.__stdout__
data = article.__dict__
del data['config']
del data['extractor']
for i in data:
if type(data[i]) is set:
data[i] = list(data[i])
if type(data[i]) is datetime:
data[i] = data[i].strftime("%Y-%m-%d %H:%M:%S")
if type(data[i]) is lxml.html.HtmlElement:
data[i] = lxml.html.tostring(data[i])
if type(data[i]) is bytes:
data[i] = str(data[i])
print(json.dumps(data))
elif ds == ArticleDownloadState.FAILED_RESPONSE:
pass
功能
- 多线程文章下载框架
- 新闻 URL 识别
- 从 html 中提取文本
- 从 html 中提取顶部图像
- 从 html 中提取所有图像
- 从文本中提取关键词
- 从文本中提取摘要
- 从文本中提取作者
- 从文本中提取 Google 趋势术语
- 支持 10+ 种语言(英语、中文、德语、阿拉伯语等)
>>> import newspaper
>>> newspaper.languages()
Your available languages are:
input code full name
ar Arabic
be Belarusian
bg Bulgarian
da Danish
de German
el Greek
en English
es Spanish
et Estonian
fa Persian
fi Finnish
fr French
he Hebrew
hi Hindi
hr Croatian
hu Hungarian
id Indonesian
it Italian
ja Japanese
ko Korean
lt Lithuanian
mk Macedonian
nb Norwegian (Bokmål)
nl Dutch
no Norwegian
pl Polish
pt Portuguese
ro Romanian
ru Russian
sl Slovenian
sr Serbian
sv Swedish
sw Swahili
th Thai
tr Turkish
uk Ukrainian
vi Vietnamese
zh Chinese
立即获取
运行 ✅ pip3 install newspaper3k
✅
NOT ⛔ pip3 install newspaper
⛔
在 python3 中,您必须安装 newspaper3k
,而不是 newspaper
。 newspaper
是我们的 python2 库。尽管使用 pip <http://www.pip-installer.org/>
安装 newspaper 很简单,但如果您尝试在 ubuntu 上安装,您可能会遇到可修复的问题。
如果您在 Debian / Ubuntu 上,请使用以下命令安装
-
安装
pip3
命令,用于安装newspaper3k
软件包:$ sudo apt-get install python3-pip
-
Python 开发版本,用于 Python.h:
$ sudo apt-get install python-dev
-
lxml 需求:
$ sudo apt-get install libxml2-dev libxslt-dev
-
为了 PIL 识别 .jpg 图像:
$ sudo apt-get install libjpeg-dev zlib1g-dev libpng12-dev
注意:如果您在安装 libpng12-dev
时遇到问题,请尝试安装 libpng-dev
。
-
下载 NLP 相关语料库:
$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
-
通过 pip 安装发行版:
$ pip3 install newspaper3k
如果您在 OSX 上,请使用以下命令安装,您可以使用 homebrew 或 macports
::
$ brew install libxml2 libxslt
$ brew install libtiff libjpeg webp little-cms2
$ pip3 install newspaper3k
$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
否则,请使用以下命令安装
注意:您仍然可能需要通过您的包管理器安装以下库
- PIL:
libjpeg-dev
zlib1g-dev
libpng12-dev
- lxml:
libxml2-dev
libxslt-dev
- Python 开发版本:
python-dev
::
$ pip3 install newspaper3k
$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
安装
composer require 2dareis2do/newspaper3k-php-wrapper
一次下载句子分词器
在安装 NLTK 软件包后,请安装必要的数据集/模型,以便特定功能能够正常工作。
特别是您需要 Punkt 句子分词器。
例如。
$ python
加载 python 解释器
>>> import nltk
>>> nltk.download('all')
或
>>> nltk.download('punkt')
用法
use Twodareis2do\Scrape\Newspaper3kWrapper; $parser = new Newspaper3kWrapper(); $parser->scrape('your url');