pieedweb/seo-pocket-crawler

检查一些SEO基础功能的Web爬虫。

v0.0.7 2021-01-20 16:29 UTC

This package is auto-updated.

Last update: 2024-09-21 00:24:27 UTC


README

Open Source Package

CLI Seo Pocket Crawler

Latest Version Software License GitHub Tests Action Status Quality Score Code Coverage Type Coverage Total Downloads

检查一些SEO基础功能的Web爬虫。

使用您喜欢的电子表格软件或通过您喜欢的语言检索收集到的数据。

提供法语文档: https://piedweb.com/seo/crawler

安装

通过 Packagist

$ composer create-project piedweb/seo-pocket-crawler

用法

Crawler CLI

$ bin/console crawler:go $start

参数

  start                            Define where the crawl start. Eg: https://piedweb.com
                                   You can specify an id from a previous crawl. Other options will not be listen.
                                   You can use `last` to continue the last crawl (just stopped)

选项

  -l, --limit=LIMIT                Define where a depth limit [default: 5]
  -i, --ignore=IGNORE              Virtual Robots.txt to respect (could be a string or an URL).
  -u, --user-agent=USER-AGENT      Define the user-agent used during the crawl. [default: "SEO Pocket Crawler - PiedWeb.com/seo/crawler"]
  -w, --wait=WAIT                  In Microseconds, the time to wait between 2 requests. Default 0,1s. [default: 100000]
  -c, --cache-method=CACHE-METHOD  In Microseconds, the time to wait between two request. Default : 100000 (0,1s). [default: 2]
  -r, --restart=RESTART            Permit to restart a previous crawl. Values 1 = fresh restart, 2 = restart from cache
  -h, --help                       Display this help message
  -q, --quiet                      Do not output any message
  -V, --version                    Display this application version
      --ansi                       Force ANSI output
      --no-ansi                    Disable ANSI output
  -n, --no-interaction             Do not ask any interactive question
  -v|vv|vvv, --verbose             Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug



从之前的爬取中提取所有外部链接(1秒内)

$ bin/console crawler:external $id [--host]
    --id
        id from a previous crawl
        You can use  `last` too show external links from the last crawl.

    --host -ho
        flag permitting to get only host

计算页面排名

将更新之前生成的data.csv。然后您可以使用PoC pagerank.html(在服务器 npx http-server -c-1 --port 3000)探索您的网站。

$ bin/console crawler:pagerank $id
    --id
        id from a previous crawl
        You can use `last` too calcul page rank from the last crawl.

测试

$ composer test

待办事项

  • 更好的链接采集和记录(记录上下文(列表、导航、句子...))
  • 转换PoC(页面排名可视化器)
  • 复杂的页面排名计算器(包括301、规范链接、nofollow等。)

贡献

请参阅 贡献指南

鸣谢

许可证

MIT许可证(MIT)。请参阅许可证文件获取更多信息。

Latest Version Software License Build Status Quality Score Code Coverage Total Downloads