imtigger/unicode-filter

基于Unicode标准定义的Unicode块,PHP Unicode字符串过滤库

1.0.0 2019-02-13 10:23 UTC

This package is auto-updated.

Last update: 2024-09-14 20:35:41 UTC


README

基于Unicode 11.0标准定义的 Unicode块 的PHP Unicode字符串过滤库

用法

基本用法

UnicodeFilter::whitelist($input, $filters = [], $excepts = [], $replacement = '')

  • 仅保留BASIC_LATIN块中的字符
echo UnicodeFilter::whitelist("Hello World! 😃", [
    UnicodeFilter::BASIC_LATIN
]);

// Hello World! 
  • 仅保留BASIC_LATIN块中的字符,将其他所有内容替换为下划线 "_"
echo UnicodeFilter::whitelist("Hello World! 😃", [
    UnicodeFilter::BASIC_LATIN
], [], "_");

// Hello World! _

UnicodeFilter::blacklist($input, $filters = [], $excepts = [], $replacement = '')

  • 仅移除EMOTICONS块中的字符
echo UnicodeFilter::blacklist("Hello World! 😃", [
    UnicodeFilter::EMOTICONS
]);

// Hello World! 
  • 如果字符串被处理则返回 true/false

UnicodeFilter::isWhitelistProcessed($input, $filters = [], $excepts = [])

UnicodeFilter::isBlacklistProcessed($input, $filters = [], $excepts = [])

  • $filter$excepts 可以接受以下格式的数组
    • 块名称(例如 UnicodeFilter::BASIC_LATIN)
    • 任意十进制代码点(例如 0x200b,mb_ord("好")
    • 任意十进制代码点范围(例如 [0x2000, 0x200F])

高级用法

  • 仅保留BASIC_LATIN块中的字符,但排除U+00..U+20范围,将其他所有内容替换为下划线 "_"
echo UnicodeFilter::whitelist("Hello\nWorld! 😃", [
    UnicodeFilter::BASIC_LATIN
], [
    [0x00, 0x20]
], "_");

// Hello_World! _
  • 仅保留(大多数)英语、中文、日文和韩文字符
echo UnicodeFilter::whitelist("Hello 您好 こんにちは 안녕하세요 สวัสดีค่ะ", [
    UnicodeFilter::BASIC_LATIN,
    UnicodeFilter::CJK_UNIFIED_IDEOGRAPHS,
    UnicodeFilter::CJK_COMPATIBILITY,
    UnicodeFilter::HIRAGANA,
    UnicodeFilter::KATAKANA,
    UnicodeFilter::HANGUL_SYLLABLES
]);

// Hello 您好 こんにちは 안녕하세요
// (Thai is not included so it's removed) 
  • 仅保留(大多数)英语、中文、日文、韩文、泰文以及 一般标点 和一个额外的 😃 字符,但排除U+2000..U+200F和U+205F..U+206F(不可打印字符)范围,最后将任何其他字符替换为下划线
echo UnicodeFilter::whitelist("‷Hello×您好×こんにちは×안녕하세요×สวัสดีค่ะ‴ 😃", [
    UnicodeFilter::BASIC_LATIN,
    UnicodeFilter::CJK_UNIFIED_IDEOGRAPHS,
    UnicodeFilter::CJK_COMPATIBILITY,
    UnicodeFilter::HIRAGANA,
    UnicodeFilter::KATAKANA,
    UnicodeFilter::HANGUL_SYLLABLES,
    UnicodeFilter::THAI,
    UnicodeFilter::GENERAL_PUNCTUATION,
    mb_ord('😃')
], [
    [0x2000, 0x200F],
    [0x205F, 0x206F]
], "_");

// ‷Hello_您好_こんにちは_안녕하세요_สวัสดีค่ะ‴ 😃
  • 为给定字符串的每个字符生成详细信息(代码点和块)数组

analysis($string)

array(14) {
  [0]=>
  array(3) {
    ["character"]=>
    string(1) "H"
    ["codepoint"]=>
    int(72)
    ["block"]=>
    string(11) "BASIC_LATIN"
  }
  ...
}
  • 生成处理白名单/黑名单的详细信息和结果

whitelistInfo($input, $filters = [], $excepts = [], $replacement = '')

blacklistInfo($input, $filters = [], $excepts = [], $replacement = '')

array(6) {
  ["input"]=>
  string(12) "Hello 您好"
  ["output"]=>
  string(6) "Hello "
  ["pattern"]=>
  string(18) "/[^\x{0}-\x{7f}]/u"
  ["isProcessed"]=>
  bool(true)
  ["processedCount"]=>
  int(2)
  ["processedCharacters"]=>
  string(6) "您好"
}

调试函数

  • 将白名单/黑名单信息输出到控制台

dumpWhitelistInfo($input, $filters = [], $excepts = [], $replacement = '')

dumpBlacklistInfo($input, $filters = [], $excepts = [], $replacement = '')

echo UnicodeFilter::dumpWhitelistInfo("Hello 您好", [
    UnicodeFilter::BASIC_LATIN
]);
Output: Hello  (6)
Pattern: /[^\x{0}-\x{7f}]/u
Processed: Yes
Processed Characters: 2
您 (U+60a8) in block CJK_UNIFIED_IDEOGRAPHS
好 (U+597d) in block CJK_UNIFIED_IDEOGRAPHS

dumpString($string)

echo UnicodeFilter::dumpString("Hello×您好×こんにちは");
H (U+48) in block BASIC_LATIN
e (U+65) in block BASIC_LATIN
l (U+6c) in block BASIC_LATIN
l (U+6c) in block BASIC_LATIN
o (U+6f) in block BASIC_LATIN
× (U+d7) in block LATIN_1_SUPPLEMENT
您 (U+60a8) in block CJK_UNIFIED_IDEOGRAPHS
好 (U+597d) in block CJK_UNIFIED_IDEOGRAPHS
× (U+d7) in block LATIN_1_SUPPLEMENT
こ (U+3053) in block HIRAGANA
ん (U+3093) in block HIRAGANA
に (U+306b) in block HIRAGANA
ち (U+3061) in block HIRAGANA
は (U+306f) in block HIRAGANA

dumpFilters($filters = [])

echo UnicodeFilter::dumpFilters([
    UnicodeFilter::BASIC_LATIN,
    UnicodeFilter::LATIN_1_SUPPLEMENT,
    UnicodeFilter::LATIN_EXTENDED_A,
    UnicodeFilter::LATIN_EXTENDED_B
]);
BASIC_LATIN / U+0..U+7f
LATIN_1_SUPPLEMENT / U+80..U+ff
LATIN_EXTENDED_A / U+100..U+17f
LATIN_EXTENDED_B / U+180..U+24f

常见问题

  • 白名单是更受欢迎的工作方式,因为有太多的字符(截至Unicode 11.0有137,374个字符)

  • 一些语言,特别是中文和东南亚语言,字符分布在多个块中。例如,有CJK_COMPATIBILITY,CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A,CJK_UNIFIED_IDEOGRAPHS,CJK_COMPATIBILITY_IDEOGRAPHS等块。因此,需要多次测试以包括您可能实际需要的所有块

参考