imtigger / unicode-filter
基于Unicode标准定义的Unicode块,PHP Unicode字符串过滤库
Requires
- php: >=7.0.0
- ext-mbstring: *
Requires (Dev)
- phpunit/phpunit: ^6.4
This package is auto-updated.
Last update: 2024-09-14 20:35:41 UTC
README
基于Unicode 11.0标准定义的 Unicode块 的PHP Unicode字符串过滤库
用法
基本用法
UnicodeFilter::whitelist($input, $filters = [], $excepts = [], $replacement = '')
- 仅保留BASIC_LATIN块中的字符
echo UnicodeFilter::whitelist("Hello World! 😃", [ UnicodeFilter::BASIC_LATIN ]); // Hello World!
- 仅保留BASIC_LATIN块中的字符,将其他所有内容替换为下划线 "_"
echo UnicodeFilter::whitelist("Hello World! 😃", [ UnicodeFilter::BASIC_LATIN ], [], "_"); // Hello World! _
UnicodeFilter::blacklist($input, $filters = [], $excepts = [], $replacement = '')
- 仅移除EMOTICONS块中的字符
echo UnicodeFilter::blacklist("Hello World! 😃", [ UnicodeFilter::EMOTICONS ]); // Hello World!
- 如果字符串被处理则返回
true
/false
UnicodeFilter::isWhitelistProcessed($input, $filters = [], $excepts = [])
UnicodeFilter::isBlacklistProcessed($input, $filters = [], $excepts = [])
$filter
和$excepts
可以接受以下格式的数组- 块名称(例如 UnicodeFilter::BASIC_LATIN)
- 任意十进制代码点(例如 0x200b,mb_ord("好")
- 任意十进制代码点范围(例如 [0x2000, 0x200F])
高级用法
- 仅保留BASIC_LATIN块中的字符,但排除U+00..U+20范围,将其他所有内容替换为下划线 "_"
echo UnicodeFilter::whitelist("Hello\nWorld! 😃", [ UnicodeFilter::BASIC_LATIN ], [ [0x00, 0x20] ], "_"); // Hello_World! _
- 仅保留(大多数)英语、中文、日文和韩文字符
echo UnicodeFilter::whitelist("Hello 您好 こんにちは 안녕하세요 สวัสดีค่ะ", [ UnicodeFilter::BASIC_LATIN, UnicodeFilter::CJK_UNIFIED_IDEOGRAPHS, UnicodeFilter::CJK_COMPATIBILITY, UnicodeFilter::HIRAGANA, UnicodeFilter::KATAKANA, UnicodeFilter::HANGUL_SYLLABLES ]); // Hello 您好 こんにちは 안녕하세요 // (Thai is not included so it's removed)
- 仅保留(大多数)英语、中文、日文、韩文、泰文以及 一般标点 和一个额外的 😃 字符,但排除U+2000..U+200F和U+205F..U+206F(不可打印字符)范围,最后将任何其他字符替换为下划线
echo UnicodeFilter::whitelist("‷Hello×您好×こんにちは×안녕하세요×สวัสดีค่ะ‴ 😃", [ UnicodeFilter::BASIC_LATIN, UnicodeFilter::CJK_UNIFIED_IDEOGRAPHS, UnicodeFilter::CJK_COMPATIBILITY, UnicodeFilter::HIRAGANA, UnicodeFilter::KATAKANA, UnicodeFilter::HANGUL_SYLLABLES, UnicodeFilter::THAI, UnicodeFilter::GENERAL_PUNCTUATION, mb_ord('😃') ], [ [0x2000, 0x200F], [0x205F, 0x206F] ], "_"); // ‷Hello_您好_こんにちは_안녕하세요_สวัสดีค่ะ‴ 😃
- 为给定字符串的每个字符生成详细信息(代码点和块)数组
analysis($string)
array(14) {
[0]=>
array(3) {
["character"]=>
string(1) "H"
["codepoint"]=>
int(72)
["block"]=>
string(11) "BASIC_LATIN"
}
...
}
- 生成处理白名单/黑名单的详细信息和结果
whitelistInfo($input, $filters = [], $excepts = [], $replacement = '')
blacklistInfo($input, $filters = [], $excepts = [], $replacement = '')
array(6) {
["input"]=>
string(12) "Hello 您好"
["output"]=>
string(6) "Hello "
["pattern"]=>
string(18) "/[^\x{0}-\x{7f}]/u"
["isProcessed"]=>
bool(true)
["processedCount"]=>
int(2)
["processedCharacters"]=>
string(6) "您好"
}
调试函数
- 将白名单/黑名单信息输出到控制台
dumpWhitelistInfo($input, $filters = [], $excepts = [], $replacement = '')
dumpBlacklistInfo($input, $filters = [], $excepts = [], $replacement = '')
echo UnicodeFilter::dumpWhitelistInfo("Hello 您好", [ UnicodeFilter::BASIC_LATIN ]);
Output: Hello (6)
Pattern: /[^\x{0}-\x{7f}]/u
Processed: Yes
Processed Characters: 2
您 (U+60a8) in block CJK_UNIFIED_IDEOGRAPHS
好 (U+597d) in block CJK_UNIFIED_IDEOGRAPHS
dumpString($string)
echo UnicodeFilter::dumpString("Hello×您好×こんにちは");
H (U+48) in block BASIC_LATIN
e (U+65) in block BASIC_LATIN
l (U+6c) in block BASIC_LATIN
l (U+6c) in block BASIC_LATIN
o (U+6f) in block BASIC_LATIN
× (U+d7) in block LATIN_1_SUPPLEMENT
您 (U+60a8) in block CJK_UNIFIED_IDEOGRAPHS
好 (U+597d) in block CJK_UNIFIED_IDEOGRAPHS
× (U+d7) in block LATIN_1_SUPPLEMENT
こ (U+3053) in block HIRAGANA
ん (U+3093) in block HIRAGANA
に (U+306b) in block HIRAGANA
ち (U+3061) in block HIRAGANA
は (U+306f) in block HIRAGANA
dumpFilters($filters = [])
echo UnicodeFilter::dumpFilters([ UnicodeFilter::BASIC_LATIN, UnicodeFilter::LATIN_1_SUPPLEMENT, UnicodeFilter::LATIN_EXTENDED_A, UnicodeFilter::LATIN_EXTENDED_B ]);
BASIC_LATIN / U+0..U+7f
LATIN_1_SUPPLEMENT / U+80..U+ff
LATIN_EXTENDED_A / U+100..U+17f
LATIN_EXTENDED_B / U+180..U+24f
常见问题
-
白名单是更受欢迎的工作方式,因为有太多的字符(截至Unicode 11.0有137,374个字符)
-
一些语言,特别是中文和东南亚语言,字符分布在多个块中。例如,有CJK_COMPATIBILITY,CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A,CJK_UNIFIED_IDEOGRAPHS,CJK_COMPATIBILITY_IDEOGRAPHS等块。因此,需要多次测试以包括您可能实际需要的所有块