最近在實作文章(中文)抽詞,斷詞的程式,其中要做的一件事就是把一些符號和標點符號濾掉,在網路上google 都會搜尋到五花八門的範例,常見的regular expression如下:
! "#$%&'()*+,-./:;<=>?@[\]^_`{|}~
['?:!.,;]*([a-z]+)['?:!.,;]*
如果沒有好好去瞭解,其實都會被這些眼花繚亂的符號所迷惑,其實很多多東西在Java的文件裡面都有定義好了,只是從來都沒有好好去研究 (遮臉)。參考
Java Pattern 的文件其實很輕易的就可以列舉出所有的符號:
POSIX character classes (US-ASCII only)
\p{Lower}
A lower-case alphabetic character: [a-z]
\p{Upper}
An upper-case alphabetic character:[A-Z]
\p{ASCII}
All ASCII:[\x00-\x7F]
\p{Alpha}
An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit}
A decimal digit: [0-9]
\p{Alnum}
An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct}
Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}
A visible character: [\p{Alnum}\p{Punct}]
\p{Print}
A printable character: [\p{Graph}\x20]
\p{Blank}
A space or a tab: [ \t]
\p{Cntrl}
A control character: [\x00-\x1F\x7F]
\p{XDigit}
A hexadecimal digit: [0-9a-fA-F]
\p{Space}
A whitespace character: [ \t\n\x0B\f\r]
但是真實世界往往不只這樣,除了一般ASCII符號外,其實現在越來越多怪異的
Unicode 符號也會出現在文章中,比如說:
"─" U+2500 Box Drawings Light Horizontal Unicode Character
"⋯" U+22EF Midline Horizontal Ellipsis Unicode Character - Compart
"䶵"
U+4DB5
cjk Ideograph Extension A, Last
甚至最近連
emoji 也非常常出現:
到底emoji 需不需要濾掉呢?這個問題我也不知道,不過至少確定許多符號都應該要濾掉,為了方便使用就把要濾掉的特殊字元整理如下:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Java 內建的 ( https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html )
// "\\p{Space}" A whitespace character
// "\\p{Punct}" A punctuation character
-------------Unicode Range--------------------
// "\u0080-\u00FF" Latin-1 Supplement https://www.compart.com/en/unicode/block/U+0080
// "\u2000-\u206F" GeneralPunctuation https://www.compart.com/en/unicode/block/U+2000
// "\u2190-\u21FF" Arrows https://www.compart.com/en/unicode/block/U+2190
// "\u2200-\u22FF" MathematicalOperators https://www.compart.com/en/unicode/block/U+2200
// "\u2500–\u257F" BoxDrawing https://www.compart.com/en/unicode/block/U+2500
// "\u25A0-\u25FF" GeometricShapes https://www.compart.com/en/unicode/block/U+25A0
// "\u2600-\u26FF" MiscellaneousSymbols https://www.compart.com/en/unicode/block/U+2600
// "\u2800-\u28FF" BraillePatterns https://www.compart.com/en/unicode/block/U+2800
// "\u2E80-\u2EFF" CjkRadicalsSupplement https://www.compart.com/en/unicode/block/U+2E80
// "\u2F00-\u2FDF" KangxiRadicals https://www.compart.com/en/unicode/block/U+2F00
// "\u2FF0-\u2FFF" IdeographicDescriptionCharacters https://www.compart.com/en/unicode/block/U+2FF0
// "\u3000-\u303F" CjkSymbolsAndPunctuation https://www.compart.com/en/unicode/block/U+3000
// "\u31C0-\u31EF" CjkStrokes https://www.compart.com/en/unicode/block/U+31C0
// "\u3200-\u32FF" EnclosedCjkLettersAndMonths https://www.compart.com/en/unicode/block/U+3200
// "\u3300-\u33FF" CjkCompatibility https://www.compart.com/en/unicode/block/U+3300
// "\u4DC0-\u4DFF" CjkUnifiedIdeographsExtensionA https://www.compart.com/en/unicode/block/U+4D4C
// "\u4E00" CjkIdeographFirst https://www.compart.com/en/unicode/block/U+4E00
// "\u9FD5" CjkIdeographLast https://www.compart.com/en/unicode/U+9FD5
// "\uFF00-\uFFEF" HalfwidthAndFullwidthForms https://www.compart.com/en/unicode/block/U+FF00 全型?
// "\uFE50-\uFE6F" SmallFormVariants https://www.compart.com/en/unicode/block/U+FE50
有興趣的歡迎取用~