2dareis2do/newspaper3k-php-wrapper

Newspaper3k 文章抓取和编辑的 PHP 包装器

2.0.2 2024-06-27 12:31 UTC

This package is auto-updated.

Last update: 2024-09-03 16:04:05 UTC


README

Software License Packagist Version

简单的 Newspaper3/4k 文章抓取和编辑的 php 包装器。

现在已更新,以添加更改当前工作目录的支持,使您可以根据每个作业自定义编辑脚本。

自定义 ArticleScraping.py

以下是一个自定义的 ArticleScraping.py 示例,它使用 Playwright 包装器。

#!/usr/bin/python
# -*- coding: utf8 -*-

import json, sys, os
import nltk
import newspaper
from newspaper import Article
from datetime import datetime
import lxml, lxml.html
from playwright.sync_api import sync_playwright

sys.stdout = open(os.devnull, "w") #To prevent a function from printing in the batch console in Python

url = functionName = sys.argv[1]

def accept_cookies_and_fetch_article(url):
    # Using Playwright to handle login and fetch article
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # Set headless=False to watch the browser actions
        page = browser.new_page()

        # create a new incognito browser context
        context = browser.new_context()
        # create a new page inside context.
        page = context.new_page()

        page.goto(url)

        # Automating iframe button click
        page.frame_locator("iframe[title=\"SP Consent Message\"]").get_by_label("Essential cookies only").click()

        content = page.content()
        # dispose context once it is no longer needed.
        context.close()
        browser.close()

    # Using Newspaper4k to parse the page content
    article = newspaper.article(url, input_html=content, language='en')
    article.parse() # Parse the article
    article.nlp() # Keyword extraction wrapper

    return article

article = accept_cookies_and_fetch_article(url)

# article.download() #Downloads the link’s HTML content
# 1 time download of the sentence tokenizer
# perhaps better to run from command line as we don't need to install each time?
#nltk.download('all') 
#nltk.download('punkt')

sys.stdout = sys.__stdout__

data = article.__dict__
del data['config']
del data['extractor']

for i in data:
    if type(data[i]) is set:
        data[i] = list(data[i])
    if type(data[i]) is datetime:
        data[i] = data[i].strftime("%Y-%m-%d %H:%M:%S")
    if type(data[i]) is lxml.html.HtmlElement:
        data[i] = lxml.html.tostring(data[i])
    if type(data[i]) is bytes:
        data[i] = str(data[i])

print(json.dumps(data))

使用 Newspaper3kWrapper

在这个简化的示例中,我们只需将当前工作目录传递给 Newspaper3kWrapper。

  use Twodareis2do\Scrape\Newspaper3kWrapper;

      try {

        // initiate the parser
        $this->parser = new Newspaper3kWrapper();

        // If no $cwd then use default 'ArticleScraping.py'
        if (isset($cwd)) {
          $output = $this->parser->scrape($value, $debug, $cwd);
        }
        else {
          $output = $this->parser->scrape($value, $debug);
        }
        // return any scraped output
        return $output;

      }
      catch (\Exception $e) {

        // Logs a notice to channel if we get http error response.
        $this->logger->notice('Newspaper Playwright Failed to get (1) URL @url "@error". @code', [
          '@url' => $value,
          '@error' => $e->getMessage(),
          '@code' => $e->getCode(),
        ]);

        // return empty string
        return '';
      }
      

替代文章抓取脚本

可以通过传递 cwd 来更改 ArticleScraping.py 的路径。以下是一个使用 Cloudscraper 库的示例。

#!/usr/bin/python
# -*- coding: utf8 -*-

import json, sys, os
import nltk
from newspaper import Article
from newspaper import Config
from newspaper.article import ArticleException, ArticleDownloadState
from datetime import datetime
import lxml, lxml.html
import cloudscraper

browser={
    'browser': 'chrome',
    'platform': 'android',
    'desktop': False
}

scraper = cloudscraper.create_scraper(browser)  # returns a CloudScraper instance

sys.stdout = open(os.devnull, "w") #To prevent a function from printing in the batch console in Python

url = functionName = sys.argv[1]

scraped = scraper.get(url).text

article = Article('')
article.html = scraped

ds = article.download_state

if ds == ArticleDownloadState.SUCCESS:
    article.parse() #Parse the article
    # 1 time download of the sentence tokenizer
    # perhaps better to run from command line as we don't need to install each time?
    #nltk.download('all') 
    #nltk.download('punkt')
    article.nlp()#  Keyword extraction wrapper

    sys.stdout = sys.__stdout__

    data = article.__dict__
    del data['config']
    del data['extractor']

    for i in data:
        if type(data[i]) is set:
            data[i] = list(data[i])
        if type(data[i]) is datetime:
            data[i] = data[i].strftime("%Y-%m-%d %H:%M:%S")
        if type(data[i]) is lxml.html.HtmlElement:
            data[i] = lxml.html.tostring(data[i])
        if type(data[i]) is bytes:
            data[i] = str(data[i])

    print(json.dumps(data))

elif ds == ArticleDownloadState.FAILED_RESPONSE:
    pass

功能

  • 多线程文章下载框架
  • 新闻 URL 识别
  • 从 html 中提取文本
  • 从 html 中提取顶部图像
  • 从 html 中提取所有图像
  • 从文本中提取关键词
  • 从文本中提取摘要
  • 从文本中提取作者
  • 从文本中提取 Google 趋势术语
  • 支持 10+ 种语言(英语、中文、德语、阿拉伯语等)
    >>> import newspaper
    >>> newspaper.languages()

    Your available languages are:
    input code      full name

      ar              Arabic
      be              Belarusian
      bg              Bulgarian
      da              Danish
      de              German
      el              Greek
      en              English
      es              Spanish
      et              Estonian
      fa              Persian
      fi              Finnish
      fr              French
      he              Hebrew
      hi              Hindi
      hr              Croatian
      hu              Hungarian
      id              Indonesian
      it              Italian
      ja              Japanese
      ko              Korean
      lt              Lithuanian
      mk              Macedonian
      nb              Norwegian (Bokmål)
      nl              Dutch
      no              Norwegian
      pl              Polish
      pt              Portuguese
      ro              Romanian
      ru              Russian
      sl              Slovenian
      sr              Serbian
      sv              Swedish
      sw              Swahili
      th              Thai
      tr              Turkish
      uk              Ukrainian
      vi              Vietnamese
      zh              Chinese

立即获取

运行 ✅ pip3 install newspaper3k

NOT ⛔ pip3 install newspaper

在 python3 中,您必须安装 newspaper3k,而不是 newspapernewspaper 是我们的 python2 库。尽管使用 pip <http://www.pip-installer.org/> 安装 newspaper 很简单,但如果您尝试在 ubuntu 上安装,您可能会遇到可修复的问题。

如果您在 Debian / Ubuntu 上,请使用以下命令安装

  • 安装 pip3 命令,用于安装 newspaper3k 软件包:

    $ sudo apt-get install python3-pip

  • Python 开发版本,用于 Python.h:

    $ sudo apt-get install python-dev

  • lxml 需求:

    $ sudo apt-get install libxml2-dev libxslt-dev

  • 为了 PIL 识别 .jpg 图像:

    $ sudo apt-get install libjpeg-dev zlib1g-dev libpng12-dev

注意:如果您在安装 libpng12-dev 时遇到问题,请尝试安装 libpng-dev

如果您在 OSX 上,请使用以下命令安装,您可以使用 homebrew 或 macports

::

$ brew install libxml2 libxslt

$ brew install libtiff libjpeg webp little-cms2

$ pip3 install newspaper3k

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

否则,请使用以下命令安装

注意:您仍然可能需要通过您的包管理器安装以下库

  • PIL: libjpeg-dev zlib1g-dev libpng12-dev
  • lxml: libxml2-dev libxslt-dev
  • Python 开发版本: python-dev

::

$ pip3 install newspaper3k

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

安装

composer require 2dareis2do/newspaper3k-php-wrapper

一次下载句子分词器

在安装 NLTK 软件包后,请安装必要的数据集/模型,以便特定功能能够正常工作。

特别是您需要 Punkt 句子分词器

例如。

$ python

加载 python 解释器

>>> import nltk
>>> nltk.download('all')

>>> nltk.download('punkt')

用法

use Twodareis2do\Scrape\Newspaper3kWrapper;

$parser = new Newspaper3kWrapper();

$parser->scrape('your url');

阅读更多

Newspaper

nltk

使用 Python 抓取和总结新闻文章