網頁

2018年4月17日 星期二

[筆記] 要如使用Unicode Range 濾掉文章的特殊字元


最近在實作文章(中文)抽詞,斷詞的程式,其中要做的一件事就是把一些符號和標點符號濾掉,在網路上google 都會搜尋到五花八門的範例,常見的regular expression如下:
  • !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
  • ['?:!.,;]*([a-z]+)['?:!.,;]*

如果沒有好好去瞭解,其實都會被這些眼花繚亂的符號所迷惑,其實很多多東西在Java的文件裡面都有定義好了,只是從來都沒有好好去研究 (遮臉)。參考Java Pattern的文件其實很輕易的就可以列舉出所有的符號:

POSIX character classes (US-ASCII only)
\p{Lower} A lower-case alphabetic character: [a-z]
\p{Upper} An upper-case alphabetic character:[A-Z]
\p{ASCII} All ASCII:[\x00-\x7F]
\p{Alpha} An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit} A decimal digit: [0-9]
\p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct} Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph} A visible character: [\p{Alnum}\p{Punct}]
\p{Print} A printable character: [\p{Graph}\x20]
\p{Blank} A space or a tab: [ \t]
\p{Cntrl} A control character: [\x00-\x1F\x7F]
\p{XDigit} A hexadecimal digit: [0-9a-fA-F]
\p{Space} A whitespace character: [ \t\n\x0B\f\r]


但是真實世界往往不只這樣,除了一般ASCII符號外,其實現在越來越多怪異的 Unicode 符號也會出現在文章中,比如說:
  • "─" U+2500 Box Drawings Light Horizontal Unicode Character
  • "⋯" U+22EF Midline Horizontal Ellipsis Unicode Character - Compart
  • "䶵" U+4DB5  cjk Ideograph Extension A, Last
甚至最近連emoji 也非常常出現:
到底emoji 需不需要濾掉呢?這個問題我也不知道,不過至少確定許多符號都應該要濾掉,為了方便使用就把要濾掉的特殊字元整理如下:

Java 內建的 ( https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html )
// "\\p{Space}" A whitespace character
// "\\p{Punct}" A punctuation character
-------------Unicode Range--------------------
// "\u0080-\u00FF" Latin-1 Supplement https://www.compart.com/en/unicode/block/U+0080
// "\u2000-\u206F" GeneralPunctuation https://www.compart.com/en/unicode/block/U+2000
// "\u2190-\u21FF" Arrows https://www.compart.com/en/unicode/block/U+2190
// "\u2200-\u22FF" MathematicalOperators https://www.compart.com/en/unicode/block/U+2200
// "\u2500–\u257F" BoxDrawing https://www.compart.com/en/unicode/block/U+2500
// "\u25A0-\u25FF" GeometricShapes https://www.compart.com/en/unicode/block/U+25A0
// "\u2600-\u26FF" MiscellaneousSymbols https://www.compart.com/en/unicode/block/U+2600
// "\u2800-\u28FF" BraillePatterns https://www.compart.com/en/unicode/block/U+2800
// "\u2E80-\u2EFF" CjkRadicalsSupplement https://www.compart.com/en/unicode/block/U+2E80
// "\u2F00-\u2FDF" KangxiRadicals https://www.compart.com/en/unicode/block/U+2F00
// "\u2FF0-\u2FFF" IdeographicDescriptionCharacters https://www.compart.com/en/unicode/block/U+2FF0
// "\u3000-\u303F" CjkSymbolsAndPunctuation https://www.compart.com/en/unicode/block/U+3000
// "\u31C0-\u31EF" CjkStrokes https://www.compart.com/en/unicode/block/U+31C0
// "\u3200-\u32FF" EnclosedCjkLettersAndMonths https://www.compart.com/en/unicode/block/U+3200
// "\u3300-\u33FF" CjkCompatibility https://www.compart.com/en/unicode/block/U+3300
// "\u4DC0-\u4DFF" CjkUnifiedIdeographsExtensionA https://www.compart.com/en/unicode/block/U+4D4C
// "\u4E00" CjkIdeographFirst https://www.compart.com/en/unicode/block/U+4E00
// "\u9FD5" CjkIdeographLast https://www.compart.com/en/unicode/U+9FD5
// "\uFF00-\uFFEF" HalfwidthAndFullwidthForms https://www.compart.com/en/unicode/block/U+FF00 全型?
// "\uFE50-\uFE6F" SmallFormVariants https://www.compart.com/en/unicode/block/U+FE50


有興趣的歡迎取用~

沒有留言:

張貼留言