piedweb/crawler

用于检查一些SEO基础功能的网页爬虫。

0.1.793 2024-08-29 16:14 UTC

README

Open Source Package

CLI Seo Pocket Crawler

Latest Version Software License GitHub Tests Action Status Quality Score Code Coverage Type Coverage Total Downloads

用于检查一些SEO基础功能的网页爬虫。

使用收集到的数据在您喜欢的电子表格软件中,或者通过您喜欢的语言检索它们。

有法语文档可供使用:https://piedweb.com/seo/crawler

安装

通过 Packagist

$ composer create-project piedweb/crawler

使用方法

Crawler CLI

$ bin/console crawler:go $start

参数

  start                            Define where the crawl start. Eg: https://piedweb.com
                                   You can specify an id from a previous crawl. Other options will not be listen.
                                   You can use `last` to continue the last crawl (just stopped)

选项

  -l, --limit=LIMIT                Define where a depth limit [default: 5]
  -i, --ignore=IGNORE              Virtual Robots.txt to respect (could be a string or an URL).
  -u, --user-agent=USER-AGENT      Define the user-agent used during the crawl. [default: "SEO Pocket Crawler - PiedWeb.com/seo/crawler"]
  -w, --wait=WAIT                  In Microseconds, the time to wait between 2 requests. Default 0,1s. [default: 100000]
  -c, --cache-method=CACHE-METHOD  In Microseconds, the time to wait between two request. Default : 100000 (0,1s). [default: 2]
  -r, --restart=RESTART            Permit to restart a previous crawl. Values 1 = fresh restart, 2 = restart from cache
  -h, --help                       Display this help message
  -q, --quiet                      Do not output any message
  -V, --version                    Display this application version
      --ansi                       Force ANSI output
      --no-ansi                    Disable ANSI output
  -n, --no-interaction             Do not ask any interactive question
  -v|vv|vvv, --verbose             Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug



从之前爬取中提取1秒内所有外部链接

$ bin/console crawler:external $id [--host]
    --id
        id from a previous crawl
        You can use  `last` too show external links from the last crawl.

    --host -ho
        flag permitting to get only host

计算页面排名

将更新之前生成的 data.csv。然后您可以使用PoC pagerank.html(在服务器 npx http-server -c-1 --port 3000)来探索您的网站。

$ bin/console crawler:pagerank $id
    --id
        id from a previous crawl
        You can use `last` too calcul page rank from the last crawl.

测试

$ composer test

待办事项

  • 更好的链接收集和记录(记录上下文(列表、导航、句子...))
  • 转换PoC(页面排名可视化器)
  • 复杂的页面排名计算器(包含301、规范、nofollow等)

贡献

请参阅 贡献指南

致谢

许可证

MIT许可证(MIT)。请参阅 许可证文件 了解更多信息。

Latest Version Software License Build Status Quality Score Code Coverage Total Downloads