geoffroy-aubry/awk-csv-parser

用于轻松解析包含可能嵌入逗号和引号的CSV文件的AWK和Bash代码。

v1.0.2 2017-11-14 20:17 UTC

This package is not auto-updated.

Last update: 2024-09-28 14:48:04 UTC


README

Latest stable version Build Status

用于轻松解析包含可能嵌入逗号和引号的CSV文件的AWK和Bash代码。

目录

功能

  • 仅使用Bash和Awk解析CSV文件。
  • 允许使用标准的UNIX shell命令处理CSV数据。
  • 正确处理包含字段分隔符(默认为逗号)和字段封装符(默认为双引号)的CSV数据。
  • 可以从stdin管道以及从多个命令行文件参数中处理CSV。
  • 可以处理字段分隔符和字段封装符的任何字符。
  • 可以重写CSV记录,使用多字符输出字段分隔符,移除CSV封装字符,以及取消转义的封装字符。
  • 文件中每一行可能不包含相同数量的字段。

已知限制

  • 目前尚未处理数据字段内嵌入的新行。

链接

其他Awk实现

要求

  • Bash v4 (2009) 及以上版本
  • GNU Awk 3.1+

已在Debian/Ubuntu Linux上测试。

使用

显示方式

$ awk-csv-parser.sh --help

Help on command prompt

文本版本
Description
    AWK and Bash code to easily parse CSV files, with possibly embedded commas and quotes.

Usage
    awk-csv-parser.sh [OPTION]… [<CSV-file>]…

Options
    -e <character>, --enclosure=<character>
        Set the CSV field enclosure. One character only, '"' (double quote) by default.

    -o <string>, --output-separator=<string>
        Set the output field separator. Multiple characters allowed, '|' (pipe) by default.

    -s <character>, --separator=<character>
        Set the CSV field separator. One character only, ',' (comma) by default.

    -h, --help
        Display this help.

    <CSV-file>
        CSV file to parse.

Discussion
    – The last record in the file may or may not have an ending line break.
    – Each line may not contain the same number of fields throughout the file.
    – The last field in the record must not be followed by a field separator.
    – Fields containing field enclosures or field separators must be enclosed in field
      enclosure.
    – A field enclosure appearing inside a field must be escaped by preceding it with
      another field enclosure. Example: "aaa","b""bb","ccc"

Examples
    Parse a CSV and display records without field enclosure, fields pipe-separated:
        awk-csv-parser.sh --output-separator='|' resources/iso_3166-1.csv

    Remove CSV's header before parsing:
        tail -n+2 resources/iso_3166-1.csv | awk-csv-parser.sh

    Keep only first column of multiple files:
        awk-csv-parser.sh a.csv b.csv c.csv | cut -d'|' -f1

    Keep only first column, using multiple UTF-8 characters output separator:
        awk-csv-parser.sh -o '⇒⇒' resources/iso_3166-1.csv | awk -F '⇒⇒' '{print $1}'

    You can directly call the Awk script:
        awk -f csv-parser.awk -v separator=',' -v enclosure='"' --source '{
            csv_parse_record($0, separator, enclosure, csv)
            print csv[2] " ⇒ " csv[0]
        }' resources/iso_3166-1.csv

示例

来自 resources/iso_3166-1.csv 的摘录 (完整版本)

Country or Area Name,ISO ALPHA-2 Code,ISO ALPHA-3 Code,ISO Numeric Code
Brazil,BR,BRA,076
British Virgin Islands,VG,VGB,092
British Indian Ocean Territory,IO,IOT,086
Brunei Darussalam,BN,BRN,096
Burkina Faso,BF,BFA,854
"Hong Kong, Special Administrative Region of China",HK,HKG,344
"Macao, Special Administrative Region of China",MO,MAC,446
Christmas Island,CX,CXR,162
Cocos (Keeling) Islands,CC,CCK,166
1. 解析CSV并显示不包含字段封装的记录,输出字段使用管道分隔
$ awk-csv-parser.sh --output-separator='|' resources/iso_3166-1.csv | head -n10
# or:
$ cat resources/iso_3166-1.csv | awk-csv-parser.sh --output-separator='|' | head -n10

结果

Country or Area Name|ISO ALPHA-2 Code|ISO ALPHA-3 Code|ISO Numeric Code|
Brazil|BR|BRA|076|
British Virgin Islands|VG|VGB|092|
British Indian Ocean Territory|IO|IOT|086|
Brunei Darussalam|BN|BRN|096|
Burkina Faso|BF|BFA|854|
Hong Kong, Special Administrative Region of China|HK|HKG|344|
Macao, Special Administrative Region of China|MO|MAC|446|
Christmas Island|CX|CXR|162|
Cocos (Keeling) Islands|CC|CCK|166|
2. 删除CSV标题,仅保留第一列并grep包含分隔符的字段
$ tail -n+2 resources/iso_3166-1.csv | awk-csv-parser.sh | cut -d'|' -f1 | grep ,

结果

Hong Kong, Special Administrative Region of China
Macao, Special Administrative Region of China
Congo, Democratic Republic of the
Iran, Islamic Republic of
Korea, Democratic People's Republic of
Korea, Republic of
Micronesia, Federated States of
Taiwan, Republic of China
Tanzania, United Republic of
3. 您可以直接调用Awk脚本
$ awk -f csv-parser.awk -v separator=',' -v enclosure='"' --source '{
    csv_parse_record($0, separator, enclosure, csv)
    print csv[2] " ⇒ " csv[0]
}' resources/iso_3166-1.csv | head -n10

结果

ISO ALPHA-3 Code ⇒ Country or Area Name
BRA ⇒ Brazil
VGB ⇒ British Virgin Islands
IOT ⇒ British Indian Ocean Territory
BRN ⇒ Brunei Darussalam
BFA ⇒ Burkina Faso
HKG ⇒ Hong Kong, Special Administrative Region of China
MAC ⇒ Macao, Special Administrative Region of China
CXR ⇒ Christmas Island
CCK ⇒ Cocos (Keeling) Islands
4. 技术示例

tests/resources/ok.csv 的内容

,,
a, b,c , d ,e e
"","a","a,",",a",",,"
"a""b","""","c"""""

测试

$ awk-csv-parser.sh tests/resources/ok.csv

结果

|| |
a| b|c | d |e e|
|a|a,|,a|,,|
a"b|"|c""|
5. 错误

tests/resources/invalid.csv 的内容

"
"a,
a"
"a"b

测试

$ awk-csv-parser.sh tests/resources/invalid.csv

结果

[CSV ERROR: 3] Missing closing quote after '' in following record: '"'
[CSV ERROR: 3] Missing closing quote after 'a,' in following record: '"a,'
[CSV ERROR: 1] Missing opening quote before 'a' in following record: 'a"'
[CSV ERROR: 2] Missing separator after 'a' in following record: '"a"b'

安装

Debian/Ubuntu

  1. 将文件移动到您希望存储源代码的目录。

  2. 克隆仓库

$ git clone https://github.com/geoffroy-aubry/awk-csv-parser.git
  1. 您应该在 stable 分支上。如果不是,切换您的克隆到该分支
$ cd awk-csv-parser && git checkout stable
  1. 您可以为 awk-csv-parser.sh 创建符号链接
$ sudo ln -s /path/to/src/awk-csv-parser.sh /usr/local/bin/awk-csv-parser
  1. 已经准备好使用
$ awk-csv-parser

OS X

由于Mac OS X的 readlinksed 版本都基于BSD,与GNU版本略有不同,因此您需要安装GNU工具

$ brew install coreutils gnu-sed [--with-default-names]

使用 --with-default-names 选项,GNU工具将替换Mac OS X的工具。否则,GNU工具将使用前缀 g,您需要编辑脚本 src/awk-csv-parser.shtests/all-tests.sh,将 readlinksed 分别替换为 greadlinkgsed

然后按照Debian/Ubuntu安装过程进行。

版权与许可

许可协议为GNU通用公共许可证v3(LGPL版本3)。有关详细信息,请参阅LICENSE 文件。

变更日志

有关详细信息,请参阅CHANGELOG 文件。

持续集成

Build Status

启动单元测试

$ tests/all-tests.sh

Git分支模型

用于开发的git分支模型是描述和由 twgit 工具辅助的:https://github.com/Twenga/twgit.