pieedweb / seo-pocket-crawler
检查一些SEO基础功能的Web爬虫。
v0.0.7
2021-01-20 16:29 UTC
Requires
- php: ~7.3|^8.0
- league/csv: ^9.1
- league/uri: dev-master
- piedweb/curl: ^0.0
- piedweb/url-harvester: ^0.0.27
- symfony/console: ^5.0
- vimeo/psalm: ^4.4
Requires (Dev)
- phpunit/phpunit: >=7.0
- squizlabs/php_codesniffer: ^3.0
README
CLI Seo Pocket Crawler
检查一些SEO基础功能的Web爬虫。
使用您喜欢的电子表格软件或通过您喜欢的语言检索收集到的数据。
提供法语文档: https://piedweb.com/seo/crawler
安装
通过 Packagist
$ composer create-project piedweb/seo-pocket-crawler
用法
Crawler CLI
$ bin/console crawler:go $start
参数
start Define where the crawl start. Eg: https://piedweb.com
You can specify an id from a previous crawl. Other options will not be listen.
You can use `last` to continue the last crawl (just stopped)
选项
-l, --limit=LIMIT Define where a depth limit [default: 5]
-i, --ignore=IGNORE Virtual Robots.txt to respect (could be a string or an URL).
-u, --user-agent=USER-AGENT Define the user-agent used during the crawl. [default: "SEO Pocket Crawler - PiedWeb.com/seo/crawler"]
-w, --wait=WAIT In Microseconds, the time to wait between 2 requests. Default 0,1s. [default: 100000]
-c, --cache-method=CACHE-METHOD In Microseconds, the time to wait between two request. Default : 100000 (0,1s). [default: 2]
-r, --restart=RESTART Permit to restart a previous crawl. Values 1 = fresh restart, 2 = restart from cache
-h, --help Display this help message
-q, --quiet Do not output any message
-V, --version Display this application version
--ansi Force ANSI output
--no-ansi Disable ANSI output
-n, --no-interaction Do not ask any interactive question
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
从之前的爬取中提取所有外部链接(1秒内)
$ bin/console crawler:external $id [--host]
--id
id from a previous crawl
You can use `last` too show external links from the last crawl.
--host -ho
flag permitting to get only host
计算页面排名
将更新之前生成的data.csv
。然后您可以使用PoC pagerank.html
(在服务器 npx http-server -c-1 --port 3000
)探索您的网站。
$ bin/console crawler:pagerank $id
--id
id from a previous crawl
You can use `last` too calcul page rank from the last crawl.
测试
$ composer test
待办事项
- 更好的链接采集和记录(记录上下文(列表、导航、句子...))
- 转换PoC(页面排名可视化器)
- 复杂的页面排名计算器(包括301、规范链接、nofollow等。)
贡献
请参阅 贡献指南
鸣谢
许可证
MIT许可证(MIT)。请参阅许可证文件获取更多信息。