External dependencies :

pip install pandas

Configure tabular data display in this notebook :

Character set normalization for french¶

The config object from frenchtext.core defines the directory where the character normalization tables are located :

!ls {chardatadir}

charset-fr.csv		  latinletters.csv     unicode_categories.csv
charsetstats_norm.csv	  latinnumbers.csv     unicode_families.csv
charsetstats_raw.csv	  latinsymbols.csv     unsupported.stats.csv
combiningdiacritics.csv   normalizedchars.csv  utf8-windows1252-errors.csv
controlchars.csv	  stats		       windows1252-iso8859-errors.csv
cyrillic-greek-chars.csv  unicode_blocks.csv   windows1252-utf8-errors.csv

0. Unicode characters properties¶

charname("🙂")

'Slightly Smiling Face'

charcategory("🙂")

'Symbol'

charsubcategory("🙂")

'Other'

charblock("🙂")

'Emoticons'

blockfamily('Emoticons')

'Symbols'

1. Explore french dataset characters¶

French datasets often contain several thousands distinct Unicode characters.

We need to reduce the number of distinct characters fed to our natural language processing applications, for three reasons :

chars considered by the user as visually equivalent will often produce a different application behavior : this is a huge problem for the user experience
with so many chars, the designer of the NLP application will not be able to reason about all possible combinations : this could harm the explainability of the system
this huge number of distinct characters brings a significant amount complexity the NLP models will have to deal with

1.1 Characters frequency in french datasets¶

dfcharstats = pd.read_csv(chardatadir / "charsetstats_raw.csv", sep=";")
dfcharstats

1.2 Characters stats in Wikipedia dataset¶

35.6 billion chars

charsCountWikipedia = dfcharstats["CountWikipedia"].sum()
charsCountWikipedia

35682395281.0

13 502 distinct Unicode chars

distinctCharsWikipedia = len(dfcharstats[dfcharstats["CountWikipedia"]>0])
distinctCharsWikipedia

13502

Only 1316 chars more frequent than 1 in 100 million

frequentCharsWikipedia = len(dfcharstats[dfcharstats["CountWikipedia"]>356])
frequentCharsWikipedia

1316

Frequent chars represent 9.7 % of all distinct Unicode chars

pctFreqCharsWikipedia = frequentCharsWikipedia/distinctCharsWikipedia*100
pctFreqCharsWikipedia

9.74670419197156

99.9987 % of Wikipedia chars would be preserved if we only kept the frequent chars

pctPreservedCharsWikipedia = (1-dfcharstats[dfcharstats["CountWikipedia"]<=356]["CountWikipedia"].sum()/dfcharstats["CountWikipedia"].sum())*100
pctPreservedCharsWikipedia

99.99871204274157

1.3 Characters stats in Business dataset¶

27.5 billion chars

charsCountBusiness = dfcharstats["CountBusiness"].sum()
charsCountBusiness

27577304956.0

3 763 distinct Unicode chars

distinctCharsBusiness = len(dfcharstats[dfcharstats["CountBusiness"]>0])
distinctCharsBusiness

3763

Only 531 chars more frequent than 1 in 100 million

frequentCharsBusiness = len(dfcharstats[dfcharstats["CountBusiness"]>275])
frequentCharsBusiness

531

Frequent chars represent 14.1 % of all distinct Unicode chars

pctFreqCharsBusiness = frequentCharsBusiness/distinctCharsBusiness*100
pctFreqCharsBusiness

14.11108158384268

99.9996 % of Business chars would be preserved if we only kept the frequent chars

pctPreservedCharsBusiness = (1-dfcharstats[dfcharstats["CountBusiness"]<=275]["CountBusiness"].sum()/dfcharstats["CountBusiness"].sum())*100
pctPreservedCharsBusiness

99.9996564385093

99.985 % of Wikipedia chars would be preserved if we only kept the frequent Business chars

pctPreservedBizCharsInWikipedia = (1-dfcharstats[dfcharstats["CountBusiness"]<=275]["CountWikipedia"].sum()/dfcharstats["CountWikipedia"].sum())*100
pctPreservedBizCharsInWikipedia

99.9848317525845

1.4 Character stats after Unicode normalization¶

After applying the normalization process defined below in this notebook, here are the remaining chars :

dfcharsnorm = pd.read_csv(chardatadir / "charset-fr.csv", sep=";")
dfcharsnorm

Stats for the character families after normalization¶

The table below shows the number of chars in each category (after normalization) per 100 million characters :

dfblocks = dfcharsnorm.groupby(by=["Category","SubCategory"]).agg({"Char":["count","sum"],"CountBusiness":"sum"})
dfblocks["CountBusiness"] = (dfblocks["CountBusiness"] / charsCountBusiness * 100000000).astype(int)
dfblocks

2. Characters normalization pipeline¶

After a detailed study of all the frequent chars, the goal is to design a noramization pipeline which can retain as much information as possible while greatly reducing the number of dinstinct chars.

We saw before that it is possible to preserve 99.9996% of the original chars while keeping only 500 distinct chars. By being clever and replacing equivalent chars, we can divide this number by 2 and still retain the same amount of information.

It may then be useful to limit the number of distinct characters after normalization to 255 distinct characters :

if needed, french text chars can then be encoded with a single byte
the list of supported chars can be memorized by NLP application developers and users

The normalization pipeline applies the following 14 steps, which are explained and illustrated in the sections below.

Fix encoding errors
- fix windows1252 text read as iso8859-1
- fix utf8 text read as windows1252
- fix windows1252 text read as utf8
- merge Unicode combining chars
- ignore control chars
Remove display attributes
- replace latin letter symbols
- replace latin letter ligatures
- replace latin number symbols
Normalize visually equivalent chars
- replace equivalent chars
- replace cyrillic and greek chars looking like latin letters
Encode infrequent chars while losing a little bit of information
- replace infrequent latin letters with diacritics
- replace infrequent chars from other scripts
- replace infrequent symbols
- ignore remaining chars with no glyph

2.1 Frequent encoding errors : windows1252 read as iso8859-1¶

dfencodingwin1252 = pd.read_csv(chardatadir / "windows1252-iso8859-errors.csv", sep=";")
dfencodingwin1252.head(10)

print(f"{len(dfencodingwin1252)} frequent encoding errors seen in french datasets : a character encoded as windows1252 was incorrectly decoded as iso8859-1")

10 frequent encoding errors seen in french datasets : a character encoded as windows1252 was incorrectly decoded as iso8859-1

Columns :

Code/Char : incorrectly decoded control char seen in french text
DecodedCode/DecodedChare : properly decoded char which should replace the original control char

2.2 Frequent encoding errors : utf8 read as windows1252¶

dfencodingutf8 = pd.read_csv(chardatadir / "utf8-windows1252-errors.csv", sep=";")
dfencodingutf8.head(10)

print(f"{len(dfencodingutf8)} very unlikely substrings produced when text encoded with UTF-8 is decoded by mistake as iso8859-1 or windows1252")

117 very unlikely substrings produced when text encoded with UTF-8 is decoded by mistake as iso8859-1 or windows1252

Columns :

ErrorSubstring : unlikely substring of length 2 or 3 characters produced when UTF-8 text is decoded by mistake as windows1252
DecodedCode/DecodedChar : properly decoded char which should be used to replace the unlikley substring

2.3 Frequent encoding errors : windows1252 read as utf8¶

dfencodingwin1252utf8 = pd.read_csv(chardatadir / "windows1252-utf8-errors.csv", sep=";")
dfencodingwin1252utf8.head()

print(f"{len(dfencodingwin1252utf8)} char very unlikely in french text produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8")

1 char very unlikely in french text produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8

Columns :

Char : unlikely char produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8
DecodedCodes/DecodedChars : properly decoded substring which should be used to replace the unlikley char

2.4 Unicode combining chars¶

dfcombiningchars = pd.read_csv(chardatadir / "combiningdiacritics.csv", sep=";")
dfcombiningchars.head()

print(f"{len(dfcombiningchars['Char'].unique())} combining chars {list(dfcombiningchars['Diacritic'].unique())} should be recombined with {len(dfcombiningchars)} base latin characters to produce standard latin characters with diacritics")

12 combining chars ['Acute', 'Grave', 'Circumflex', 'Cedilla', 'Tilde', 'Diaeresis', 'Long Stroke Overlay', 'Macron', 'Caron', 'Dot Below', 'Dot Above', 'Ring Above'] should be recombined with 274 base latin characters to produce standard latin characters with diacritics

Columns :

BaseChar : latin char encountered first in the string, which will be modified by the combining char immediately following it
Code/Char : combining char immediately following BaseChar, which should be combined with it to produce CombinedChar
Diacritic : type of accent / diacritic applied by the combining char
CombinedChar : latin char with diacritic produced by the combination of BaseChar and the combining Char following it

2.5 Control chars¶

dfcontrolchars = pd.read_csv(chardatadir / "controlchars.csv", sep=";")
dfcontrolchars.loc[0,"Char"] = chr(0) # chr(0) can't be saved in CSV file
dfcontrolchars

print(f"{len(dfcontrolchars)} control chars seen in french datasets, which can't be displayed and should be ignored")

125 control chars seen in french datasets, which can't be displayed and should be ignored

Columns :

Code : Unicode code point for the character
Char : control character
CharName : name of the character in the Python Unicode database

2.6 Latin letter symbols¶

dflatinsymbols = pd.read_csv(chardatadir / "latinsymbols.csv", sep=";")
dflatinsymbols.head(10)

dflatinsymbols[230:240]

print(f"{len(dflatinsymbols)} Unicode symbols which represent latin letters with a specific layout like {list(dflatinsymbols['Layout'].unique())}")

917 Unicode symbols which represent latin letters with a specific layout like [nan, 'Double-Struck', 'Unit', 'Script', 'Black-Letter', 'Turned', 'Rotated', 'Turned Sans-Serif', 'Reversed Sans-Serif', 'Double-Struck Italic', 'Parenthesized', 'Circled', 'Mathematical Bold', 'Mathematical Italic', 'Mathematical Bold Italic', 'Mathematical Script', 'Mathematical Bold Script', 'Mathematical Fraktur', 'Mathematical Double-Struck', 'Mathematical Bold Fraktur', 'Mathematical Sans-Serif', 'Mathematical Sans-Serif Bold', 'Mathematical Sans-Serif Italic', 'Mathematical Sans-Serif Bold Italic', 'Mathematical Monospace', 'Tortoise Shell Bracketed', 'Circled Italic', 'Squared', 'Negative Circled', 'Negative Squared', 'Crossed Negative Squared', 'Regional Indicator']

Columns :

Code/Char/CharName : Unicode symbol representing a latin letter with a specific layout
NormString : normalized string using only very frequent chars
Layout : info about the specific layout applied to the latin char

2.7 Latin letters ligatures / Latin letters diacritics¶

dflatinletters = pd.read_csv(chardatadir / "latinletters.csv", sep=";")
dflatinletters[89:99]

print(f"{len(dflatinletters)} chars representing latin letters, {len(dflatinletters[dflatinletters['IsUpper']])} upper case and {len(dflatinletters[dflatinletters['IsLower']])} lower case, {len(dflatinletters[dflatinletters['IsDiacritic']])} with diacritics like {list(dflatinletters[dflatinletters['IsDiacritic']]['Diacritics'].unique())[0:20]}, {len(dflatinletters[dflatinletters['IsLigature']])} representing multiple letters in ligature")

1230 chars representing latin letters, 459 upper case and 704 lower case, 1031 with diacritics like ['Grave', 'Acute', 'Circumflex', 'Tilde', 'Diaeresis', 'Ring Above', 'Cedilla', 'Stroke', 'Macron', 'Breve', 'Ogonek', 'Dot Above', 'Caron', 'Dotless', 'Middle Dot', 'Preceded By Apostrophe', 'Double Acute', 'Long', 'Hook', 'Topbar'], 88 representing multiple letters in ligature

Columns :

Code/Char/CharName : Unicode character representing one or more latin letters
LetterName : name of the latin letter (without case and diacritics qualifiers)
IsUpper/UpperChar and IsLower/LowerChar : upper case or lower case equivalent chars
IsDiacritic => BaseChar : equivalent char without any diacritic (accents ...), Diacritics : description of all diacritics applied to the char
IsLigature => MultiChars : if the char represents multiple latin letters in a single ligature, string representing the equivalent list of letters
Block/Category/SubCategory : Unicode classification for each char

2.8 Latin numbers and number symbols¶

dflatinnumbers = pd.read_csv(chardatadir / "latinnumbers.csv", sep=";")
dflatinnumbers[30:40]

dflatinnumbers[200:210]

print(f"{len(dflatinnumbers)} chars representing latin digits, some with specific layouts like {list(dflatinnumbers['Layout'].unique())[1:]}")

302 chars representing latin digits, some with specific layouts like ['Superscript', 'Vulgar Fraction', 'Subscript', 'Roman Numeral', 'Small Roman Numeral', 'Circled', 'Parenthesized', ' Full Stop', 'Negative Circled', 'Double Circled', 'Dingbat Negative Circled', 'Dingbat Circled Sans-Serif', 'Dingbat Negative Circled Sans-Serif ', 'Circled On Black Square', 'Fullwidth', 'Mathematical Bold', 'Mathematical Double-Struck', 'Mathematical Sans-Serif', 'Mathematical Sans-Serif Bold', 'Mathematical Monospace', 'Full Stop', 'Comma']

Columns :

Code/Char/CharName : Unicode char representing on or more latin digits
NormString : normalized string representing the equivalent number, plus punctuation if needed
Layout : info about the specific layout applied to the latin digits

2.9 Variations on frequent chars to normalize¶

dfnormchars = pd.read_csv(chardatadir / "normalizedchars.csv", sep=";")
dfnormchars.head()

dfnormchars[49:59]

dfnormchars[25:35]

print(f"{len(dfnormchars)} alternative chars which are sometimes used as equivalent visual representations for {len(dfnormchars['NormChar'].unique())} other very frequent chars")

171 alternative chars which are sometimes used as equivalent visual representations for 53 other very frequent chars

Columns :

Code/Char/CharName : alternative Unicode char often used as a visual equivalent of a more frequent char
NormCode/NormChar/NormCharName : more frequent char which should be used to normalize text

2.10 Cyrillic and greek chars looking like latin letters¶

dfcgnormchars = pd.read_csv(chardatadir / "cyrillic-greek-chars.csv", sep=";")
dfcgnormchars[5:15]

print(f"{len(dfcgnormchars)} cyrillic and greek chars used as equivalent visual representations for {len(dfcgnormchars['NormChar'].unique())} latin letters")

27 cyrillic and greek chars used as equivalent visual representations for 18 latin letters

NOTE : this standardization step is optional.

Even if it sounds strange, all cyrillic and greek letters in the table above are most often used as equivalent of latin letters in the french datasets.

Columns :

Code/Char/CharName : cyrillic or greek char often used as a visual equivalent of a latin letter
NormCode/NormChar/NormCharName : more frequent char which should be used to normalize text

2.11 Replace infrequent latin letters with diacritics¶

supportedchars = dfcharsnorm["Char"].values[1:]
' '.join(supportedchars)

'  \n \t , \' . - : / " ) ( ? ! » « | … ; [ ] } { • ¿ ¡ 0 1 2 3 5 4 9 8 7 6 a b c d e f g h i j k l m n o p q r s t u v w x y z à â ä ç è é ê ë î ï ô ö ù û ü ÿ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z À Â Ä Ç È É Ê Ë Î Ï Ô Ö Ù Û Ü Ÿ _ & @ \\ # á ã å ć č ė ğ ı í ì ń ñ ó ò õ ø š ş ß ú Á Å Š Ú Ž λ π Ã � ￼ % ° § µ Ø ‰ € $ ¤ £ ¥ ¢ = > + < ^ ~ × ≤ ÷ ≥ ± ≠ ∞ √ * ✓ ⇒ ♥ ¦ → ★ ¯ ↓ ❌ ❐ † ↑ ← ↔ © ® ™ 🙂 😉 😀 😂 😁 😊 🙁 😅 😍 😃 😡 🤣 😄 🤔 😎 😭 👹 😱 😜 😋 \U0001f929 🙄 😆 😛 \U0001f92a 😢 😇 🤦 💪 👉 👍 👏 🙏 🙌 👇 👊 👎 👌 ✌ ✊ ⚠ 🔴 🔥 🏆 ⚽ 💡 🚨 💥 ⚡ ♫ ♂ ♀ 🎉 ✍ ✉ ✝'

latinlettersnodiacritics = {}
for rowidx,row in dflatinletters.iterrows():
    if row["IsDiacritic"]:
        latinlettersnodiacritics[row["Char"]] = row["BaseChar"]

for idx,letter in enumerate(latinlettersnodiacritics):
    if not letter in supportedchars:
        print(f"{letter} => {latinlettersnodiacritics[letter]}")
    if idx >= 60 : break

Ì => I
Í => I
Ñ => N
Ò => O
Ó => O
Õ => O
Ý => Y
ý => y
Ā => A
ā => a
Ă => A
ă => a
Ą => A
ą => a

2.12 Replace infrequent chars from other scripts¶

 def replaceotherscripts(charset, chariterator):
    for char in chariterator:
        if char in charset:
            yield char
        else:
            family = blockfamily(charblock(char))
            if not family in ("Symbols","Ignore"):
                resStr = chr(65532) + str(ord(char)) + '_'
                for outchar in resStr:
                    yield outchar
            else:
                yield char

''.join(replaceotherscripts(supportedchars,"Guānhuà (官话/官話)"))

'Gu￼257_nhuà (￼23448_￼35805_/￼23448_￼35441_)'

All characters from non latin scripts are preserved by encoding them with the following sequence :

[object replacement character] + unicode char number + [underscore]

This is necessary to preserve the distinct entity names in unsupported scripts, and enables decoding with full fidelity at a later stage of the pipeline.

2.13 Replace infrequent symbols¶

def replacesymbols(charset, chariterator):
    for index,char in enumerate(chariterator):
        if char in charset:
            yield char
        else:
            family = blockfamily(charblock(char))
            if family == "Symbols":
                resStr ='$' + charname(char).replace(' ','') + '_'
                for outchar in resStr:
                    yield outchar
            else:
                yield char

''.join(replacesymbols(supportedchars,"😀😈😎🙈🙌"))

'😀$SmilingFaceWithHorns_😎$See-No-EvilMonkey_🙌'

All unsupported symbols are preserved by encoding them with the following sequence :

[dollar] + unicode char name + [underscore]

This enables a NLP pipeline to add english words to its vocabulary if some symbols are used frequently in the context of a sentiment analysis task.

2.14 Ignore remaining chars with no glyph¶

unicodefamilies[unicodefamilies["CharFamily"]=="Ignore"]

3. Text normalization¶

3.1 Normalization functions¶

We need to apply several replacement functions in a row, each replacement function building on the replacements already applied by the previous ones.

We can't simply use replace statements on immutable strings to do this : we would need to allocate new strings for each replacement at each level, and this would put a high load on the garbage collector.

A better solution is to implement our normalization function as a chain of iterators on chars.

Examples :

def ignorechars(chariterator, charset):
    for char in chariterator:
        if not char in charset:
            yield char
            
def replacechars1to1(chariterator, chardict):
    for char in chariterator:
        if char in chardict:
            yield chardict[char]
        else:
            yield char
            
def replacechars1toN(chariterator, chardict):
    for char in chariterator:
        if char in chardict:
            for outchar in chardict[char]:
                yield outchar
        else:
            yield char

To match several chars in an iterator, we have to build a hierarchical dictionary structure.

For example, if we want to implement the following replacements :

ABC => 1
ABD => 2
AC  => 3
BC  => 4

We build the following dictionary structure :

A : { B : { C : 1
            D : 2

      C : 3 }

B : { C : 4 }

The normalization functions are then chained in a chars replacement pipeline.

3.2 Normalization class with change tracking¶

%time norm = TextNormalizer()
norm

CPU times: user 2 s, sys: 0 ns, total: 2 s
Wall time: 2.14 s

1 - Fix encoding errors : windows1252 read as iso8859-1
2 - Fix encoding errors : utf8 read as windows1252
3 - Fix encoding errors :  windows1252 read as utf8
4 - Merge Unicode combining chars
5 - Ignore control chars
6 - Replace latin letter symbols
7 - Replace latin letter ligatures
8 - Replace latin number symbols
9 - Normalize equivalent chars
10 - Replace cyrillic and greek chars looking like latin letters
11 - Replace infrequent chars : latin letters with diacritics
12 - Replace infrequent chars : other scripts
13 - Replace infrequent chars : symbols
14 - Replace infrequent chars : chars to ignore

teststring = chr(127995)+"① l`"+chr(156)+"uv"+chr(127)+"re est¨ "+chr(147)+"belle"+chr(148)+"¸ Ã  Â½ â‚¬ énième â€° "+chr(133)+" ⁽🇪ﬃc🇦ce⁾ ！"
teststring

'🏻① l`\x9cuv\x7fre est¨ \x93belle\x94¸ Ã  Â½ â‚¬ énième â€° \x85 ⁽🇪ﬃc🇦ce⁾ ！'

result = norm(teststring)
result

(1) l'oeuvre est «belle», Ã  1/2 € énième ‰ … (EfficAce) !

print(result.describeChanges())

Fix encoding errors : windows1252 read as iso8859-1
 < 🏻① l` [] uvre est¨  [] belle [] ¸ Ã  Â½ â‚¬ énième â€°  []  ⁽🇪ﬃc🇦ce⁾ ！
 < 🏻① l` [œ] uvre est¨  [“] belle [”] ¸ Ã  Â½ â‚¬ énième â€°  […]  ⁽🇪ﬃc🇦ce⁾ ！
Fix encoding errors : utf8 read as windows1252
 < 🏻① l`œuvre est¨ “belle”¸ Ã   [Â½]   [â‚¬]  énième  [â€°]  … ⁽🇪ﬃc🇦ce⁾ ！
 < 🏻① l`œuvre est¨ “belle”¸ Ã   [½_]   [€__]  énième  [‰__]  … ⁽🇪ﬃc🇦ce⁾ ！
Merge Unicode combining chars
 < 🏻① l`œuvre est¨ “belle”¸ Ã  ½ €  [é] ni [è] me ‰ … ⁽🇪ﬃc🇦ce⁾ ！
 < 🏻① l`œuvre est¨ “belle”¸ Ã  ½ €  [é_] ni [è_] me ‰ … ⁽🇪ﬃc🇦ce⁾ ！
Ignore control chars
 <  [🏻] ① l`œuv [] re est [¨]  “belle”¸ Ã  ½ € énième ‰ … ⁽🇪ﬃc🇦ce⁾ ！
 <  [_] ① l`œuv [_] re est [_]  “belle”¸ Ã  ½ € énième ‰ … ⁽🇪ﬃc🇦ce⁾ ！
Replace latin letter symbols
 < ① l`œuvre est “belle”¸ Ã  ½ € énième ‰ … ⁽ [🇪] ﬃc [🇦] ce⁾ ！
 < ① l`œuvre est “belle”¸ Ã  ½ € énième ‰ … ⁽ [E] ﬃc [A] ce⁾ ！
Replace latin letter ligatures
 < ① l` [œ ] uvre est “belle”¸ Ã  ½ € énième ‰ … ⁽E [ﬃ  ] cAce⁾ ！
 < ① l` [oe] uvre est “belle”¸ Ã  ½ € énième ‰ … ⁽E [ffi] cAce⁾ ！
Replace latin number symbols
 <  [①  ]  l`oeuvre est “belle”¸ Ã   [½  ]  € énième ‰ … ⁽EfficAce⁾ ！
 <  [(1)]  l`oeuvre est “belle”¸ Ã   [1/2]  € énième ‰ … ⁽EfficAce⁾ ！
Normalize equivalent chars
 < (1) l [`] oeuvre est  [“] belle [”]  [¸]  Ã  1/2 € énième ‰ …  [⁽] EfficAce [⁾]   [！] 
 < (1) l ['] oeuvre est  [«] belle [»]  [,]  Ã  1/2 € énième ‰ …  [(] EfficAce [)]   [!]

result.output[0:12]

"(1) l'oeuvre"

result.input[result.mapOutputIndexToInput(0):result.mapOutputIndexToInput(12)]

'🏻① l`\x9cuv\x7fre'

result.output[3:10]

" l'oeuv"

result.input[result.mapOutputIndexToInput(3):result.mapOutputIndexToInput(10)]

' l`\x9cuv\x7f'

%timeit -n100 norm(teststring)

344 µs ± 89.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

3.3 Normalization pipeline stats¶

The statistics below count the number of chars normalized for 1 million chars in 4 distinct parts of the french datasets : business websites, forums, news, wikipedia.

The first line of the table below shows that :

in 1 million chars extracted from forum pages (raw users input), 41.8 chars will be encoding errors (windows1252 read as iso8859-1)
in 1 million chars extracted from business websites (curated content), only 0.5 chars will be encoding errors

normstats = pd.read_csv(chardatadir / "stats" / "normalization.total.stats.csv")
normstats[["Transform","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]]

Most frequent chars replaced from equivalent characters :

replacestats = pd.read_csv(chardatadir / "stats" / "normalization.layer8.stats.csv")
replacestats[["Char","CharName","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]].head(20)

Frequency of characters from other scripts :

scriptsstats = pd.read_csv(chardatadir / "stats" / "normalization.layer11.stats.csv")
scriptsstats[["CharFamily","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]]

Detailed stats for the 14 layers of the normalization pipeline :

layersstats = pd.read_csv(chardatadir / "stats" / "normalization.stats.csv")
layer=8
layersstats[layersstats["Layer"]==layer][["Layer","Input","CharName","Output","CountBusiness","CountForum","CountPresse","CountWikipedia"]].head(15)

	Unnamed: 0	Code	Char	Name	Category	Subcategory	Block	CountBusiness	CountWikipedia	Count
0	0	101	e	Latin Small Letter E	Letter	Lowercase	Basic Latin	3.503992e+09	4.595437e+09	8.099428e+09
1	1	115	s	Latin Small Letter S	Letter	Lowercase	Basic Latin	1.960554e+09	2.534105e+09	4.494658e+09
2	2	97	a	Latin Small Letter A	Letter	Lowercase	Basic Latin	1.865590e+09	2.447239e+09	4.312829e+09
3	3	110	n	Latin Small Letter N	Letter	Lowercase	Basic Latin	1.819350e+09	2.388609e+09	4.207959e+09
4	5	105	i	Latin Small Letter I	Letter	Lowercase	Basic Latin	1.766427e+09	2.331461e+09	4.097888e+09
...	...	...	...	...	...	...	...	...	...	...
13497	13495	37294	醮	Cjk Unified Ideograph-91Ae	Letter	Other	CJK Unified Ideographs	0.000000e+00	1.000000e+00	1.000000e+00
13498	13496	35824	诰	Cjk Unified Ideograph-8Bf0	Letter	Other	CJK Unified Ideographs	0.000000e+00	1.000000e+00	1.000000e+00
13499	13497	26634	栊	Cjk Unified Ideograph-680A	Letter	Other	CJK Unified Ideographs	0.000000e+00	1.000000e+00	1.000000e+00
13500	13498	31787	簫	Cjk Unified Ideograph-7C2B	Letter	Other	CJK Unified Ideographs	0.000000e+00	1.000000e+00	1.000000e+00
13501	13501	28378	滚	Cjk Unified Ideograph-6Eda	Letter	Other	CJK Unified Ideographs	0.000000e+00	1.000000e+00	1.000000e+00

		Char		CountBusiness
		count	sum	sum
Category	SubCategory
emoticon	hand	12	💪👉👍👏🙏🙌👇👊👎👌✌✊	42
	head	28	🙂😉😀😂😁😊🙁😅😍😃😡🤣😄🤔😎😭👹😱😜😋🤩🙄😆😛🤪😢😇🤦	233
	object	16	⚠🔴🔥🏆⚽💡🚨💥⚡♫♂♀🎉✍✉✝	60
letter	digit	10	0123549876	3271115
	encoding	3	Ã�	249
	greek	2	λπ	2
	latin-fr	84	abcdefghijklmnopqrstuvwxyzàâäçèéêëîïôöùûüÿABCD...	91437146
	latin-other	25	áãåćčėğıíìńñóòõøšşßúÁÅŠÚŽ	712
	other	5	_&@\#	40814
separator	control	0	0	0
	punctuation	23	,'.-:/")(?!»«\|…;[]}{•¿¡	4684722
	space	3	\n\t	361183
symbol	currency	6	€$¤£¥¢	21099
	math	14	=>+<^~×≤÷≥±≠∞√	50056
	shape	15	*✓⇒♥¦→★¯↓❌❐†↑←↔	7954
	sign	3	©®™	1754
	unit	6	%°§µØ‰	102213

	Code	DecodedCode	DecodedChar
0	146	8217	’
1	128	8364	€
2	133	8230	…
3	150	8211	–
4	156	339	œ
5	149	8226	•
6	147	8220	“
7	148	8221	”
8	151	8212	—
9	145	8216	‘

	ErrorSubstring	DecodedCode	DecodedChar
0	â‚¬	8364	€
1	â€š	8218	‚
2	Æ’	402	ƒ
3	â€ž	8222	„
4	â€¦	8230	…
5	â€	8224	†
6	â€¡	8225	‡
7	Ë†	710	ˆ
8	â€°	8240	‰
9	Å	352	Š

	Code	Char	CharName
0	0		Char 0
1	1		Char 1
2	2		Char 2
3	3		Char 3
4	4		Char 4
...	...	...	...
120	65532		Object Replacement Character
121	127995	🏻	Emoji Modifier Fitzpatrick Type-1-2
122	127996	🏼	Emoji Modifier Fitzpatrick Type-3
123	127997	🏽	Emoji Modifier Fitzpatrick Type-4
124	127998	🏾	Emoji Modifier Fitzpatrick Type-5

chars

Character set normalization for french¶

0. Unicode characters properties¶

`charname`[source]

`charcategory`[source]

`charsubcategory`[source]

`charblock`[source]

`blockfamily`[source]

1. Explore french dataset characters¶

1.1 Characters frequency in french datasets¶

1.2 Characters stats in Wikipedia dataset¶

1.3 Characters stats in Business dataset¶

1.4 Character stats after Unicode normalization¶

Stats for the character families after normalization¶

2. Characters normalization pipeline¶

2.1 Frequent encoding errors : windows1252 read as iso8859-1¶

2.2 Frequent encoding errors : utf8 read as windows1252¶

2.3 Frequent encoding errors : windows1252 read as utf8¶

2.4 Unicode combining chars¶

2.5 Control chars¶

2.6 Latin letter symbols¶

2.7 Latin letters ligatures / Latin letters diacritics¶

2.8 Latin numbers and number symbols¶

2.9 Variations on frequent chars to normalize¶

2.10 Cyrillic and greek chars looking like latin letters¶

2.11 Replace infrequent latin letters with diacritics¶

2.12 Replace infrequent chars from other scripts¶

2.13 Replace infrequent symbols¶

2.14 Ignore remaining chars with no glyph¶

3. Text normalization¶

3.1 Normalization functions¶

3.2 Normalization class with change tracking¶

`class` `NormChange`[source]

`NormChange.init`[source]

`class` `NormResult`[source]

`NormResult.describeChanges`[source]

`NormResult.mapOutputIndexToInput`[source]

`class` `TextNormalizer`[source]

`TextNormalizer.call`[source]

3.3 Normalization pipeline stats¶

	FrCode	Category	SubCategory	Code	Char	CharName	CountBusiness
0	0	separator	control	0	NaN	Reserved - End of string	0
1	1	separator	space	32		Space	88494564
2	2	separator	space	10	\n	Char 10	9588147
3	3	separator	space	9	\t	Char 9	1522053
4	4	separator	punctuation	44	,	Comma	286106887
...	...	...	...	...	...	...	...
251	251	emoticon	object	9792	♀	Female Sign	515
252	252	emoticon	object	127881	🎉	Party Popper	356
253	253	emoticon	object	9997	✍	Writing Hand	157
254	254	emoticon	object	9993	✉	Envelope	55
255	255	emoticon	object	10013	✝	Latin Cross	22

	BaseChar	Code	Char	Diacritic	CombinedChar
0	A	769	́	Acute	Á
1	E	769	́	Acute	É
2	I	769	́	Acute	Í
3	O	769	́	Acute	Ó
4	U	769	́	Acute	Ú

	Code	Char	CharName	NormString	Layout
0	8253	‽	Interrobang	?!	NaN
1	8265	⁉	Exclamation Question Mark	!?	NaN
2	8448	℀	Account Of	a/c	NaN
3	8449	℁	Addressed To The Subject	a/s	NaN
4	8450	ℂ	Double-Struck Capital C	C	Double-Struck
5	8451	℃	Degree Celsius	°C	Unit
6	8453	℅	Care Of	c/o	NaN
7	8454	℆	Cada Una	c/u	NaN
8	8457	℉	Degree Fahrenheit	°F	Unit
9	8458	ℊ	Script Small G	g	Script

	Code	Char	CharName	NormString	Layout
230	119908	𝑤	Mathematical Italic Small W	w	Mathematical Italic
231	119909	𝑥	Mathematical Italic Small X	x	Mathematical Italic
232	119910	𝑦	Mathematical Italic Small Y	y	Mathematical Italic
233	119911	𝑧	Mathematical Italic Small Z	z	Mathematical Italic
234	119912	𝑨	Mathematical Bold Italic Capital A	A	Mathematical Bold Italic
235	119913	𝑩	Mathematical Bold Italic Capital B	B	Mathematical Bold Italic
236	119914	𝑪	Mathematical Bold Italic Capital C	C	Mathematical Bold Italic
237	119915	𝑫	Mathematical Bold Italic Capital D	D	Mathematical Bold Italic
238	119916	𝑬	Mathematical Bold Italic Capital E	E	Mathematical Bold Italic
239	119917	𝑭	Mathematical Bold Italic Capital F	F	Mathematical Bold Italic

	Code	Char	LetterName	IsUpper	UpperChar	IsLower	LowerChar	IsDiacritic	BaseChar	Diacritics	IsLigature	MultiChars	CharName	Block	Category	SubCategory
89	230	æ	Ae	False	Æ	True	æ	False	NaN	NaN	True	ae	Latin Small Letter Ae	Latin-1 Supplement	Letter	Lowercase
90	231	ç	C	False	Ç	True	ç	True	c	Cedilla	False	NaN	Latin Small Letter C With Cedilla	Latin-1 Supplement	Letter	Lowercase
91	232	è	E	False	È	True	è	True	e	Grave	False	NaN	Latin Small Letter E With Grave	Latin-1 Supplement	Letter	Lowercase
92	233	é	E	False	É	True	é	True	e	Acute	False	NaN	Latin Small Letter E With Acute	Latin-1 Supplement	Letter	Lowercase
93	234	ê	E	False	Ê	True	ê	True	e	Circumflex	False	NaN	Latin Small Letter E With Circumflex	Latin-1 Supplement	Letter	Lowercase
94	235	ë	E	False	Ë	True	ë	True	e	Diaeresis	False	NaN	Latin Small Letter E With Diaeresis	Latin-1 Supplement	Letter	Lowercase
95	236	ì	I	False	Ì	True	ì	True	i	Grave	False	NaN	Latin Small Letter I With Grave	Latin-1 Supplement	Letter	Lowercase
96	237	í	I	False	Í	True	í	True	i	Acute	False	NaN	Latin Small Letter I With Acute	Latin-1 Supplement	Letter	Lowercase
97	238	î	I	False	Î	True	î	True	i	Circumflex	False	NaN	Latin Small Letter I With Circumflex	Latin-1 Supplement	Letter	Lowercase
98	239	ï	I	False	Ï	True	ï	True	i	Diaeresis	False	NaN	Latin Small Letter I With Diaeresis	Latin-1 Supplement	Letter	Lowercase

	Code	Char	CharName	NormString	Layout
30	8327	₇	Subscript Seven	(7)	Subscript
31	8328	₈	Subscript Eight	(8)	Subscript
32	8329	₉	Subscript Nine	(9)	Subscript
33	8528	⅐	Vulgar Fraction One Seventh	1/7	Vulgar Fraction
34	8529	⅑	Vulgar Fraction One Ninth	1/9	Vulgar Fraction
35	8530	⅒	Vulgar Fraction One Tenth	1/10	Vulgar Fraction
36	8531	⅓	Vulgar Fraction One Third	1/3	Vulgar Fraction
37	8532	⅔	Vulgar Fraction Two Thirds	2/3	Vulgar Fraction
38	8533	⅕	Vulgar Fraction One Fifth	1/5	Vulgar Fraction
39	8534	⅖	Vulgar Fraction Two Fifths	2/5	Vulgar Fraction

	Code	Char	CharName	NormString	Layout
200	12881	㉑	Circled Number Twenty One	(21)	Circled
201	12882	㉒	Circled Number Twenty Two	(22)	Circled
202	12883	㉓	Circled Number Twenty Three	(23)	Circled
203	12884	㉔	Circled Number Twenty Four	(24)	Circled
204	12885	㉕	Circled Number Twenty Five	(25)	Circled
205	12886	㉖	Circled Number Twenty Six	(26)	Circled
206	12887	㉗	Circled Number Twenty Seven	(27)	Circled
207	12888	㉘	Circled Number Twenty Eight	(28)	Circled
208	12889	㉙	Circled Number Twenty Nine	(29)	Circled
209	12890	㉚	Circled Number Thirty	(30)	Circled

	Code	Char	CharName	NormCode	NormChar	NormCharName
0	11		Char 11	10	\n	Char 10
1	13	\r	Char 13	10	\n	Char 10
2	182	¶	Pilcrow Sign	10	\n	Char 10
3	8232		Line Separator	10	\n	Char 10
4	160		No-Break Space	32		Space

	Code	Char	CharName	NormCode	NormChar	NormCharName
49	8209	‑	Non-Breaking Hyphen	45	-	Hyphen-Minus
50	8210	‒	Figure Dash	45	-	Hyphen-Minus
51	8211	–	En Dash	45	-	Hyphen-Minus
52	8212	—	Em Dash	45	-	Hyphen-Minus
53	8213	―	Horizontal Bar	45	-	Hyphen-Minus
54	8259	⁃	Hyphen Bullet	45	-	Hyphen-Minus
55	8288	⁠	Word Joiner	45	-	Hyphen-Minus
56	8315	⁻	Superscript Minus	45	-	Hyphen-Minus
57	8331	₋	Subscript Minus	45	-	Hyphen-Minus
58	8722	−	Minus Sign	45	-	Hyphen-Minus

	Code	Char	CharName	NormCode	NormChar	NormCharName
25	697	ʹ	Modifier Letter Prime	39	'	Apostrophe
26	699	ʻ	Modifier Letter Turned Comma	39	'	Apostrophe
27	700	ʼ	Modifier Letter Apostrophe	39	'	Apostrophe
28	702	ʾ	Modifier Letter Right Half Ring	39	'	Apostrophe
29	703	ʿ	Modifier Letter Left Half Ring	39	'	Apostrophe
30	712	ˈ	Modifier Letter Vertical Line	39	'	Apostrophe
31	714	ˊ	Modifier Letter Acute Accent	39	'	Apostrophe
32	715	ˋ	Modifier Letter Grave Accent	39	'	Apostrophe
33	729	˙	Dot Above	39	'	Apostrophe
34	8216	‘	Left Single Quotation Mark	39	'	Apostrophe

	UnicodeBlock	CharFamily
61	Combining Diacritical Marks	Ignore
62	Private Use Area	Ignore
63	Supplementary Private Use Area-A	Ignore
64	Supplementary Private Use Area-B	Ignore
65	Specials	Ignore
66	Tags	Ignore

	Char	CharName	FreqBusiness	FreqForum	FreqPresse	FreqWikipedia
0	'	Apostrophe	486.034805	160.264219	376.104982	134.658673
1		Space	310.411117	1082.845985	288.635983	87.877649
2	-	Hyphen-Minus	14.431203	2.903761	12.828203	16.223154
3	«	Left-Pointing Double Angle Quotation Mark	1.429478	0.680513	3.002426	0.559632
4	»	Right-Pointing Double Angle Quotation Mark	1.323524	0.533926	2.461880	0.544134
5	\|	Vertical Line	0.003452	0.001018	0.005488	0.875894
6	•	Bullet	0.204104	0.243295	0.189664	0.543237
7	.	Full Stop	0.059280	0.078893	0.856230	0.069278
8	"	Quotation Mark	0.085093	0.023413	0.011504	0.292385
9	:	Colon	0.000150	0.000509	0.000053	0.169047
10	°	Degree Sign	0.148726	0.181199	0.014618	0.078302
11	é	Latin Small Letter E With Acute	0.001651	0.006108	0.003166	0.101114
12	←	Leftwards Arrow	0.000000	0.000000	0.000158	0.047194
13	=	Equals Sign	0.004802	0.029012	0.000686	0.041589
14	→	Rightwards Arrow	0.026113	0.002545	0.034302	0.015862
15	d	Latin Small Letter D	0.000000	0.024940	0.000000	0.036405
16	<	Less-Than Sign	0.004202	0.142007	0.001267	0.024073
17	,	Comma	0.006453	0.101288	0.004538	0.022756
18	↓	Downwards Arrow	0.007504	0.001527	0.011188	0.021888
19	★	Black Star	0.001351	0.013743	0.022006	0.011686

	CharFamily	FreqBusiness	FreqForum	FreqPresse	FreqWikipedia
0	ChineseJapaneseKorean	0.012456	0.177127	0.194677	4.059173
1	Arabic	0.012306	0.026467	0.460280	3.140120
2	Cyrillic	0.024462	0.166438	0.237159	3.118961
3	Greek	0.016058	0.022904	0.031347	2.423996
4	Hebrew	0.000150	0.000000	0.184914	1.132155
5	Other	0.000750	0.029012	0.004063	0.800871
6	Indian	0.000750	0.037665	0.033458	0.737955
7	Phonetic	0.002401	0.001527	0.001636	0.298579
8	Latin	0.013507	0.006108	0.007283	0.269377
9	Math	0.001801	0.000509	0.000528	0.240707
10	LaoThai	0.000000	0.001018	0.033194	0.217867
11	Armenian	0.001051	0.000000	0.004011	0.172382

	Layer	Input	CharName	Output	CountBusiness	CountForum	CountPresse	CountWikipedia
639	8	’	Right Single Quotation Mark	'	3232470	311603	6944753	4755813
640	8		No-Break Space		2057917	2127216	4892006	3122348
641	8		Thin Space		8088	116	549846	5363
642	8	–	En Dash	-	80049	4540	172657	189791
643	8	—	Em Dash	-	13928	329	63048	157402
644	8	·	Middle Dot	-	958	565	4021	202542
645	8	`	Grave Accent	'	1202	999	161302	5167
646	8	“	Left Double Quotation Mark	«	9518	1329	56880	19728
647	8	”	Right Double Quotation Mark	»	8808	1040	46632	19173
648	8	‘	Left Single Quotation Mark	'	3557	952	12041	12981
649	8	│	Box Drawings Light Vertical	\|	0	0	0	25990
650	8		Hair Space		69	1	19774	246
651	8	►	Black Right-Pointing Pointer	•	570	356	913	17134
652	8	−	Minus Sign	-	336	21	1705	15828
653	8	´	Acute Accent	'	986	1308	8649	6423

	Code	Char	CharName	NormCode	NormChar	NormCharName
5	949	ε	Greek Small Letter Epsilon	101	e	Latin Small Letter E
6	1077	е	Cyrillic Small Letter Ie	101	e	Latin Small Letter E
7	1108	є	Cyrillic Small Letter Ukrainian Ie	101	e	Latin Small Letter E
8	1085	н	Cyrillic Small Letter En	104	h	Latin Small Letter H
9	953	ι	Greek Small Letter Iota	105	i	Latin Small Letter I
10	1082	к	Cyrillic Small Letter Ka	107	k	Latin Small Letter K
11	1084	м	Cyrillic Small Letter Em	109	m	Latin Small Letter M
12	951	η	Greek Small Letter Eta	110	n	Latin Small Letter N
13	959	ο	Greek Small Letter Omicron	111	o	Latin Small Letter O
14	963	σ	Greek Small Letter Sigma	111	o	Latin Small Letter O

	Transform	FreqBusiness	FreqForum	FreqPresse	FreqWikipedia
0	Fix encoding errors : windows1252 read as iso8...	0.510560	41.818746	0.813485	0.006025
1	Fix encoding errors : utf8 read as windows1252	0.126815	0.058024	0.072456	0.001037
2	Fix encoding errors : windows1252 read as utf8	0.000000	0.000000	0.019315	0.000000
3	Merge Unicode combining chars	2.811983	0.432638	0.568146	0.000140
4	Ignore control chars	6.450737	349.052995	6.454367	4.118586
5	Replace latin letter symbols	0.019360	0.039701	0.297372	0.150550
6	Replace latin letter ligatures	6.603815	6.541480	10.097290	17.204422
7	Replace latin number symbols	2.528338	4.162482	2.560933	0.429792
8	Normalize equivalent chars	814.327384	1248.410777	684.333730	242.391239
9	Replace cyrillic and greek chars looking like ...	0.062432	0.760424	0.491996	7.479907
10	Replace infrequent chars : latin letters with ...	0.063782	0.078384	0.099106	9.124948
11	Replace infrequent chars : other scripts	0.085694	0.468776	1.192548	16.612142
12	Replace infrequent chars : symbols	0.139271	0.159821	0.399064	0.073566
13	Replace infrequent chars : chars to ignore	0.018910	0.044282	0.021320	0.016423

chars

Character set normalization for french¶

0. Unicode characters properties¶

charname[source]

charcategory[source]

charsubcategory[source]

charblock[source]

blockfamily[source]

1. Explore french dataset characters¶

1.1 Characters frequency in french datasets¶

1.2 Characters stats in Wikipedia dataset¶

1.3 Characters stats in Business dataset¶

1.4 Character stats after Unicode normalization¶

Stats for the character families after normalization¶

2. Characters normalization pipeline¶

2.1 Frequent encoding errors : windows1252 read as iso8859-1¶

2.2 Frequent encoding errors : utf8 read as windows1252¶

2.3 Frequent encoding errors : windows1252 read as utf8¶

2.4 Unicode combining chars¶

2.5 Control chars¶

2.6 Latin letter symbols¶

2.7 Latin letters ligatures / Latin letters diacritics¶

2.8 Latin numbers and number symbols¶

2.9 Variations on frequent chars to normalize¶

2.10 Cyrillic and greek chars looking like latin letters¶

2.11 Replace infrequent latin letters with diacritics¶

2.12 Replace infrequent chars from other scripts¶

2.13 Replace infrequent symbols¶

2.14 Ignore remaining chars with no glyph¶

3. Text normalization¶

3.1 Normalization functions¶

3.2 Normalization class with change tracking¶

class NormChange[source]

NormChange.__init__[source]

class NormResult[source]

NormResult.describeChanges[source]

NormResult.mapOutputIndexToInput[source]

class TextNormalizer[source]

TextNormalizer.__call__[source]

3.3 Normalization pipeline stats¶

`charname`[source]

`charcategory`[source]

`charsubcategory`[source]

`charblock`[source]

`blockfamily`[source]

`class` `NormChange`[source]

`NormChange.init`[source]

`class` `NormResult`[source]

`NormResult.describeChanges`[source]

`NormResult.mapOutputIndexToInput`[source]

`class` `TextNormalizer`[source]

`TextNormalizer.call`[source]