Set of functions used to preprocess french text characters.

External dependencies :

pip install pandas

Configure tabular data display in this notebook :

Character set normalization for french

The config object from frenchtext.core defines the directory where the character normalization tables are located :

!ls {chardatadir}
charset-fr.csv		  latinletters.csv     unicode_categories.csv
charsetstats_norm.csv	  latinnumbers.csv     unicode_families.csv
charsetstats_raw.csv	  latinsymbols.csv     unsupported.stats.csv
combiningdiacritics.csv   normalizedchars.csv  utf8-windows1252-errors.csv
controlchars.csv	  stats		       windows1252-iso8859-errors.csv
cyrillic-greek-chars.csv  unicode_blocks.csv   windows1252-utf8-errors.csv

0. Unicode characters properties

charname[source]

charname(char)

charcategory[source]

charcategory(char)

charsubcategory[source]

charsubcategory(char)

charblock[source]

charblock(char)

blockfamily[source]

blockfamily(block)

charname("🙂")
'Slightly Smiling Face'
charcategory("🙂")
'Symbol'
charsubcategory("🙂")
'Other'
charblock("🙂")
'Emoticons'
blockfamily('Emoticons')
'Symbols'

1. Explore french dataset characters

French datasets often contain several thousands distinct Unicode characters.

We need to reduce the number of distinct characters fed to our natural language processing applications, for three reasons :

  • chars considered by the user as visually equivalent will often produce a different application behavior : this is a huge problem for the user experience
  • with so many chars, the designer of the NLP application will not be able to reason about all possible combinations : this could harm the explainability of the system
  • this huge number of distinct characters brings a significant amount complexity the NLP models will have to deal with

1.1 Characters frequency in french datasets

dfcharstats = pd.read_csv(chardatadir / "charsetstats_raw.csv", sep=";")
dfcharstats
Unnamed: 0 Code Char Name Category Subcategory Block CountBusiness CountWikipedia Count
0 0 101 e Latin Small Letter E Letter Lowercase Basic Latin 3.503992e+09 4.595437e+09 8.099428e+09
1 1 115 s Latin Small Letter S Letter Lowercase Basic Latin 1.960554e+09 2.534105e+09 4.494658e+09
2 2 97 a Latin Small Letter A Letter Lowercase Basic Latin 1.865590e+09 2.447239e+09 4.312829e+09
3 3 110 n Latin Small Letter N Letter Lowercase Basic Latin 1.819350e+09 2.388609e+09 4.207959e+09
4 5 105 i Latin Small Letter I Letter Lowercase Basic Latin 1.766427e+09 2.331461e+09 4.097888e+09
... ... ... ... ... ... ... ... ... ... ...
13497 13495 37294 Cjk Unified Ideograph-91Ae Letter Other CJK Unified Ideographs 0.000000e+00 1.000000e+00 1.000000e+00
13498 13496 35824 Cjk Unified Ideograph-8Bf0 Letter Other CJK Unified Ideographs 0.000000e+00 1.000000e+00 1.000000e+00
13499 13497 26634 Cjk Unified Ideograph-680A Letter Other CJK Unified Ideographs 0.000000e+00 1.000000e+00 1.000000e+00
13500 13498 31787 Cjk Unified Ideograph-7C2B Letter Other CJK Unified Ideographs 0.000000e+00 1.000000e+00 1.000000e+00
13501 13501 28378 Cjk Unified Ideograph-6Eda Letter Other CJK Unified Ideographs 0.000000e+00 1.000000e+00 1.000000e+00

13502 rows × 10 columns

1.2 Characters stats in Wikipedia dataset

  • 35.6 billion chars
charsCountWikipedia = dfcharstats["CountWikipedia"].sum()
charsCountWikipedia
35682395281.0
  • 13 502 distinct Unicode chars
distinctCharsWikipedia = len(dfcharstats[dfcharstats["CountWikipedia"]>0])
distinctCharsWikipedia
13502
  • Only 1316 chars more frequent than 1 in 100 million
frequentCharsWikipedia = len(dfcharstats[dfcharstats["CountWikipedia"]>356])
frequentCharsWikipedia
1316
  • Frequent chars represent 9.7 % of all distinct Unicode chars
pctFreqCharsWikipedia = frequentCharsWikipedia/distinctCharsWikipedia*100
pctFreqCharsWikipedia
9.74670419197156
  • 99.9987 % of Wikipedia chars would be preserved if we only kept the frequent chars
pctPreservedCharsWikipedia = (1-dfcharstats[dfcharstats["CountWikipedia"]<=356]["CountWikipedia"].sum()/dfcharstats["CountWikipedia"].sum())*100
pctPreservedCharsWikipedia
99.99871204274157

1.3 Characters stats in Business dataset

  • 27.5 billion chars
charsCountBusiness = dfcharstats["CountBusiness"].sum()
charsCountBusiness
27577304956.0
  • 3 763 distinct Unicode chars
distinctCharsBusiness = len(dfcharstats[dfcharstats["CountBusiness"]>0])
distinctCharsBusiness
3763
  • Only 531 chars more frequent than 1 in 100 million
frequentCharsBusiness = len(dfcharstats[dfcharstats["CountBusiness"]>275])
frequentCharsBusiness
531
  • Frequent chars represent 14.1 % of all distinct Unicode chars
pctFreqCharsBusiness = frequentCharsBusiness/distinctCharsBusiness*100
pctFreqCharsBusiness
14.11108158384268
  • 99.9996 % of Business chars would be preserved if we only kept the frequent chars
pctPreservedCharsBusiness = (1-dfcharstats[dfcharstats["CountBusiness"]<=275]["CountBusiness"].sum()/dfcharstats["CountBusiness"].sum())*100
pctPreservedCharsBusiness
99.9996564385093
  • 99.985 % of Wikipedia chars would be preserved if we only kept the frequent Business chars
pctPreservedBizCharsInWikipedia = (1-dfcharstats[dfcharstats["CountBusiness"]<=275]["CountWikipedia"].sum()/dfcharstats["CountWikipedia"].sum())*100
pctPreservedBizCharsInWikipedia
99.9848317525845

1.4 Character stats after Unicode normalization

After applying the normalization process defined below in this notebook, here are the remaining chars :

dfcharsnorm = pd.read_csv(chardatadir / "charset-fr.csv", sep=";")
dfcharsnorm
FrCode Category SubCategory Code Char CharName CountBusiness
0 0 separator control 0 NaN Reserved - End of string 0
1 1 separator space 32 Space 88494564
2 2 separator space 10 \n Char 10 9588147
3 3 separator space 9 \t Char 9 1522053
4 4 separator punctuation 44 , Comma 286106887
... ... ... ... ... ... ... ...
251 251 emoticon object 9792 Female Sign 515
252 252 emoticon object 127881 🎉 Party Popper 356
253 253 emoticon object 9997 Writing Hand 157
254 254 emoticon object 9993 Envelope 55
255 255 emoticon object 10013 Latin Cross 22

256 rows × 7 columns

Stats for the character families after normalization

The table below shows the number of chars in each category (after normalization) per 100 million characters :

dfblocks = dfcharsnorm.groupby(by=["Category","SubCategory"]).agg({"Char":["count","sum"],"CountBusiness":"sum"})
dfblocks["CountBusiness"] = (dfblocks["CountBusiness"] / charsCountBusiness * 100000000).astype(int)
dfblocks
Char CountBusiness
count sum sum
Category SubCategory
emoticon hand 12 💪👉👍👏🙏🙌👇👊👎👌✌✊ 42
head 28 🙂😉😀😂😁😊🙁😅😍😃😡🤣😄🤔😎😭👹😱😜😋🤩🙄😆😛🤪😢😇🤦 233
object 16 ⚠🔴🔥🏆⚽💡🚨💥⚡♫♂♀🎉✍✉✝ 60
letter digit 10 0123549876 3271115
encoding 3 � 249
greek 2 λπ 2
latin-fr 84 abcdefghijklmnopqrstuvwxyzàâäçèéêëîïôöùûüÿABCD... 91437146
latin-other 25 áãåćčėğıíìńñóòõøšşßúÁÅŠÚŽ 712
other 5 _&@\# 40814
separator control 0 0 0
punctuation 23 ,'.-:/")(?!»«|…;[]}{•¿¡ 4684722
space 3 \n\t 361183
symbol currency 6 €$¤£¥¢ 21099
math 14 =>+<^~×≤÷≥±≠∞√ 50056
shape 15 *✓⇒♥¦→★¯↓❌❐†↑←↔ 7954
sign 3 ©®™ 1754
unit 6 %°§µØ‰ 102213

2. Characters normalization pipeline

After a detailed study of all the frequent chars, the goal is to design a noramization pipeline which can retain as much information as possible while greatly reducing the number of dinstinct chars.

We saw before that it is possible to preserve 99.9996% of the original chars while keeping only 500 distinct chars. By being clever and replacing equivalent chars, we can divide this number by 2 and still retain the same amount of information.

It may then be useful to limit the number of distinct characters after normalization to 255 distinct characters :

  • if needed, french text chars can then be encoded with a single byte
  • the list of supported chars can be memorized by NLP application developers and users

The normalization pipeline applies the following 14 steps, which are explained and illustrated in the sections below.

  • Fix encoding errors
    • fix windows1252 text read as iso8859-1
    • fix utf8 text read as windows1252
    • fix windows1252 text read as utf8
    • merge Unicode combining chars
    • ignore control chars
  • Remove display attributes
    • replace latin letter symbols
    • replace latin letter ligatures
    • replace latin number symbols
  • Normalize visually equivalent chars
    • replace equivalent chars
    • replace cyrillic and greek chars looking like latin letters
  • Encode infrequent chars while losing a little bit of information
    • replace infrequent latin letters with diacritics
    • replace infrequent chars from other scripts
    • replace infrequent symbols
    • ignore remaining chars with no glyph

2.1 Frequent encoding errors : windows1252 read as iso8859-1

dfencodingwin1252 = pd.read_csv(chardatadir / "windows1252-iso8859-errors.csv", sep=";")
dfencodingwin1252.head(10)
Code Char DecodedCode DecodedChar
0 146 ’ 8217
1 128 € 8364
2 133 8230
3 150 – 8211
4 156 œ 339 œ
5 149 • 8226
6 147 “ 8220
7 148 ” 8221
8 151 — 8212
9 145 ‘ 8216
print(f"{len(dfencodingwin1252)} frequent encoding errors seen in french datasets : a character encoded as windows1252 was incorrectly decoded as iso8859-1")
10 frequent encoding errors seen in french datasets : a character encoded as windows1252 was incorrectly decoded as iso8859-1

Columns :

  • Code/Char : incorrectly decoded control char seen in french text
  • DecodedCode/DecodedChare : properly decoded char which should replace the original control char

2.2 Frequent encoding errors : utf8 read as windows1252

dfencodingutf8 = pd.read_csv(chardatadir / "utf8-windows1252-errors.csv", sep=";")
dfencodingutf8.head(10)
ErrorSubstring DecodedCode DecodedChar
0 € 8364
1 ‚ 8218
2 Æ’ 402 ƒ
3 „ 8222
4 … 8230
5 †8224
6 ‡ 8225
7 ˆ 710 ˆ
8 ‰ 8240
9 Å 352 Š
print(f"{len(dfencodingutf8)} very unlikely substrings produced when text encoded with UTF-8 is decoded by mistake as iso8859-1 or windows1252")
117 very unlikely substrings produced when text encoded with UTF-8 is decoded by mistake as iso8859-1 or windows1252

Columns :

  • ErrorSubstring : unlikely substring of length 2 or 3 characters produced when UTF-8 text is decoded by mistake as windows1252
  • DecodedCode/DecodedChar : properly decoded char which should be used to replace the unlikley substring

2.3 Frequent encoding errors : windows1252 read as utf8

dfencodingwin1252utf8 = pd.read_csv(chardatadir / "windows1252-utf8-errors.csv", sep=";")
dfencodingwin1252utf8.head()
Code Char DecodedCodes DecodedChars
0 38971 [233, 160, 187] é »
print(f"{len(dfencodingwin1252utf8)} char very unlikely in french text produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8")
1 char very unlikely in french text produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8

Columns :

  • Char : unlikely char produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8
  • DecodedCodes/DecodedChars : properly decoded substring which should be used to replace the unlikley char

2.4 Unicode combining chars

dfcombiningchars = pd.read_csv(chardatadir / "combiningdiacritics.csv", sep=";")
dfcombiningchars.head()
BaseChar Code Char Diacritic CombinedChar
0 A 769 ́ Acute Á
1 E 769 ́ Acute É
2 I 769 ́ Acute Í
3 O 769 ́ Acute Ó
4 U 769 ́ Acute Ú
print(f"{len(dfcombiningchars['Char'].unique())} combining chars {list(dfcombiningchars['Diacritic'].unique())} should be recombined with {len(dfcombiningchars)} base latin characters to produce standard latin characters with diacritics")
12 combining chars ['Acute', 'Grave', 'Circumflex', 'Cedilla', 'Tilde', 'Diaeresis', 'Long Stroke Overlay', 'Macron', 'Caron', 'Dot Below', 'Dot Above', 'Ring Above'] should be recombined with 274 base latin characters to produce standard latin characters with diacritics

Columns :

  • BaseChar : latin char encountered first in the string, which will be modified by the combining char immediately following it
  • Code/Char : combining char immediately following BaseChar, which should be combined with it to produce CombinedChar
  • Diacritic : type of accent / diacritic applied by the combining char
  • CombinedChar : latin char with diacritic produced by the combination of BaseChar and the combining Char following it

2.5 Control chars

dfcontrolchars = pd.read_csv(chardatadir / "controlchars.csv", sep=";")
dfcontrolchars.loc[0,"Char"] = chr(0) # chr(0) can't be saved in CSV file
dfcontrolchars
Code Char CharName
0 0 Char 0
1 1  Char 1
2 2  Char 2
3 3  Char 3
4 4  Char 4
... ... ... ...
120 65532 Object Replacement Character
121 127995 🏻 Emoji Modifier Fitzpatrick Type-1-2
122 127996 🏼 Emoji Modifier Fitzpatrick Type-3
123 127997 🏽 Emoji Modifier Fitzpatrick Type-4
124 127998 🏾 Emoji Modifier Fitzpatrick Type-5

125 rows × 3 columns

print(f"{len(dfcontrolchars)} control chars seen in french datasets, which can't be displayed and should be ignored")
125 control chars seen in french datasets, which can't be displayed and should be ignored

Columns :

  • Code : Unicode code point for the character
  • Char : control character
  • CharName : name of the character in the Python Unicode database

2.6 Latin letter symbols

dflatinsymbols = pd.read_csv(chardatadir / "latinsymbols.csv", sep=";")
dflatinsymbols.head(10)
Code Char CharName NormString Layout
0 8253 Interrobang ?! NaN
1 8265 Exclamation Question Mark !? NaN
2 8448 Account Of a/c NaN
3 8449 Addressed To The Subject a/s NaN
4 8450 Double-Struck Capital C C Double-Struck
5 8451 Degree Celsius °C Unit
6 8453 Care Of c/o NaN
7 8454 Cada Una c/u NaN
8 8457 Degree Fahrenheit °F Unit
9 8458 Script Small G g Script
dflatinsymbols[230:240]
Code Char CharName NormString Layout
230 119908 𝑤 Mathematical Italic Small W w Mathematical Italic
231 119909 𝑥 Mathematical Italic Small X x Mathematical Italic
232 119910 𝑦 Mathematical Italic Small Y y Mathematical Italic
233 119911 𝑧 Mathematical Italic Small Z z Mathematical Italic
234 119912 𝑨 Mathematical Bold Italic Capital A A Mathematical Bold Italic
235 119913 𝑩 Mathematical Bold Italic Capital B B Mathematical Bold Italic
236 119914 𝑪 Mathematical Bold Italic Capital C C Mathematical Bold Italic
237 119915 𝑫 Mathematical Bold Italic Capital D D Mathematical Bold Italic
238 119916 𝑬 Mathematical Bold Italic Capital E E Mathematical Bold Italic
239 119917 𝑭 Mathematical Bold Italic Capital F F Mathematical Bold Italic
print(f"{len(dflatinsymbols)} Unicode symbols which represent latin letters with a specific layout like {list(dflatinsymbols['Layout'].unique())}")
917 Unicode symbols which represent latin letters with a specific layout like [nan, 'Double-Struck', 'Unit', 'Script', 'Black-Letter', 'Turned', 'Rotated', 'Turned Sans-Serif', 'Reversed Sans-Serif', 'Double-Struck Italic', 'Parenthesized', 'Circled', 'Mathematical Bold', 'Mathematical Italic', 'Mathematical Bold Italic', 'Mathematical Script', 'Mathematical Bold Script', 'Mathematical Fraktur', 'Mathematical Double-Struck', 'Mathematical Bold Fraktur', 'Mathematical Sans-Serif', 'Mathematical Sans-Serif Bold', 'Mathematical Sans-Serif Italic', 'Mathematical Sans-Serif Bold Italic', 'Mathematical Monospace', 'Tortoise Shell Bracketed', 'Circled Italic', 'Squared', 'Negative Circled', 'Negative Squared', 'Crossed Negative Squared', 'Regional Indicator']

Columns :

  • Code/Char/CharName : Unicode symbol representing a latin letter with a specific layout
  • NormString : normalized string using only very frequent chars
  • Layout : info about the specific layout applied to the latin char

2.7 Latin letters ligatures / Latin letters diacritics

dflatinletters = pd.read_csv(chardatadir / "latinletters.csv", sep=";")
dflatinletters[89:99]
Code Char LetterName IsUpper UpperChar IsLower LowerChar IsDiacritic BaseChar Diacritics IsLigature MultiChars CharName Block Category SubCategory
89 230 æ Ae False Æ True æ False NaN NaN True ae Latin Small Letter Ae Latin-1 Supplement Letter Lowercase
90 231 ç C False Ç True ç True c Cedilla False NaN Latin Small Letter C With Cedilla Latin-1 Supplement Letter Lowercase
91 232 è E False È True è True e Grave False NaN Latin Small Letter E With Grave Latin-1 Supplement Letter Lowercase
92 233 é E False É True é True e Acute False NaN Latin Small Letter E With Acute Latin-1 Supplement Letter Lowercase
93 234 ê E False Ê True ê True e Circumflex False NaN Latin Small Letter E With Circumflex Latin-1 Supplement Letter Lowercase
94 235 ë E False Ë True ë True e Diaeresis False NaN Latin Small Letter E With Diaeresis Latin-1 Supplement Letter Lowercase
95 236 ì I False Ì True ì True i Grave False NaN Latin Small Letter I With Grave Latin-1 Supplement Letter Lowercase
96 237 í I False Í True í True i Acute False NaN Latin Small Letter I With Acute Latin-1 Supplement Letter Lowercase
97 238 î I False Î True î True i Circumflex False NaN Latin Small Letter I With Circumflex Latin-1 Supplement Letter Lowercase
98 239 ï I False Ï True ï True i Diaeresis False NaN Latin Small Letter I With Diaeresis Latin-1 Supplement Letter Lowercase
print(f"{len(dflatinletters)} chars representing latin letters, {len(dflatinletters[dflatinletters['IsUpper']])} upper case and {len(dflatinletters[dflatinletters['IsLower']])} lower case, {len(dflatinletters[dflatinletters['IsDiacritic']])} with diacritics like {list(dflatinletters[dflatinletters['IsDiacritic']]['Diacritics'].unique())[0:20]}, {len(dflatinletters[dflatinletters['IsLigature']])} representing multiple letters in ligature")
1230 chars representing latin letters, 459 upper case and 704 lower case, 1031 with diacritics like ['Grave', 'Acute', 'Circumflex', 'Tilde', 'Diaeresis', 'Ring Above', 'Cedilla', 'Stroke', 'Macron', 'Breve', 'Ogonek', 'Dot Above', 'Caron', 'Dotless', 'Middle Dot', 'Preceded By Apostrophe', 'Double Acute', 'Long', 'Hook', 'Topbar'], 88 representing multiple letters in ligature

Columns :

  • Code/Char/CharName : Unicode character representing one or more latin letters
  • LetterName : name of the latin letter (without case and diacritics qualifiers)
  • IsUpper/UpperChar and IsLower/LowerChar : upper case or lower case equivalent chars
  • IsDiacritic => BaseChar : equivalent char without any diacritic (accents ...), Diacritics : description of all diacritics applied to the char
  • IsLigature => MultiChars : if the char represents multiple latin letters in a single ligature, string representing the equivalent list of letters
  • Block/Category/SubCategory : Unicode classification for each char

2.8 Latin numbers and number symbols

dflatinnumbers = pd.read_csv(chardatadir / "latinnumbers.csv", sep=";")
dflatinnumbers[30:40]
Code Char CharName NormString Layout
30 8327 Subscript Seven (7) Subscript
31 8328 Subscript Eight (8) Subscript
32 8329 Subscript Nine (9) Subscript
33 8528 Vulgar Fraction One Seventh 1/7 Vulgar Fraction
34 8529 Vulgar Fraction One Ninth 1/9 Vulgar Fraction
35 8530 Vulgar Fraction One Tenth 1/10 Vulgar Fraction
36 8531 Vulgar Fraction One Third 1/3 Vulgar Fraction
37 8532 Vulgar Fraction Two Thirds 2/3 Vulgar Fraction
38 8533 Vulgar Fraction One Fifth 1/5 Vulgar Fraction
39 8534 Vulgar Fraction Two Fifths 2/5 Vulgar Fraction
dflatinnumbers[200:210]
Code Char CharName NormString Layout
200 12881 Circled Number Twenty One (21) Circled
201 12882 Circled Number Twenty Two (22) Circled
202 12883 Circled Number Twenty Three (23) Circled
203 12884 Circled Number Twenty Four (24) Circled
204 12885 Circled Number Twenty Five (25) Circled
205 12886 Circled Number Twenty Six (26) Circled
206 12887 Circled Number Twenty Seven (27) Circled
207 12888 Circled Number Twenty Eight (28) Circled
208 12889 Circled Number Twenty Nine (29) Circled
209 12890 Circled Number Thirty (30) Circled
print(f"{len(dflatinnumbers)} chars representing latin digits, some with specific layouts like {list(dflatinnumbers['Layout'].unique())[1:]}")
302 chars representing latin digits, some with specific layouts like ['Superscript', 'Vulgar Fraction', 'Subscript', 'Roman Numeral', 'Small Roman Numeral', 'Circled', 'Parenthesized', ' Full Stop', 'Negative Circled', 'Double Circled', 'Dingbat Negative Circled', 'Dingbat Circled Sans-Serif', 'Dingbat Negative Circled Sans-Serif ', 'Circled On Black Square', 'Fullwidth', 'Mathematical Bold', 'Mathematical Double-Struck', 'Mathematical Sans-Serif', 'Mathematical Sans-Serif Bold', 'Mathematical Monospace', 'Full Stop', 'Comma']

Columns :

  • Code/Char/CharName : Unicode char representing on or more latin digits
  • NormString : normalized string representing the equivalent number, plus punctuation if needed
  • Layout : info about the specific layout applied to the latin digits

2.9 Variations on frequent chars to normalize

dfnormchars = pd.read_csv(chardatadir / "normalizedchars.csv", sep=";")
dfnormchars.head()
Code Char CharName NormCode NormChar NormCharName
0 11 Char 11 10 \n Char 10
1 13 \r Char 13 10 \n Char 10
2 182 Pilcrow Sign 10 \n Char 10
3 8232 Line Separator 10 \n Char 10
4 160 No-Break Space 32 Space
dfnormchars[49:59]
Code Char CharName NormCode NormChar NormCharName
49 8209 Non-Breaking Hyphen 45 - Hyphen-Minus
50 8210 Figure Dash 45 - Hyphen-Minus
51 8211 En Dash 45 - Hyphen-Minus
52 8212 Em Dash 45 - Hyphen-Minus
53 8213 Horizontal Bar 45 - Hyphen-Minus
54 8259 Hyphen Bullet 45 - Hyphen-Minus
55 8288 Word Joiner 45 - Hyphen-Minus
56 8315 Superscript Minus 45 - Hyphen-Minus
57 8331 Subscript Minus 45 - Hyphen-Minus
58 8722 Minus Sign 45 - Hyphen-Minus
dfnormchars[25:35]
Code Char CharName NormCode NormChar NormCharName
25 697 ʹ Modifier Letter Prime 39 ' Apostrophe
26 699 ʻ Modifier Letter Turned Comma 39 ' Apostrophe
27 700 ʼ Modifier Letter Apostrophe 39 ' Apostrophe
28 702 ʾ Modifier Letter Right Half Ring 39 ' Apostrophe
29 703 ʿ Modifier Letter Left Half Ring 39 ' Apostrophe
30 712 ˈ Modifier Letter Vertical Line 39 ' Apostrophe
31 714 ˊ Modifier Letter Acute Accent 39 ' Apostrophe
32 715 ˋ Modifier Letter Grave Accent 39 ' Apostrophe
33 729 ˙ Dot Above 39 ' Apostrophe
34 8216 Left Single Quotation Mark 39 ' Apostrophe
print(f"{len(dfnormchars)} alternative chars which are sometimes used as equivalent visual representations for {len(dfnormchars['NormChar'].unique())} other very frequent chars")
171 alternative chars which are sometimes used as equivalent visual representations for 53 other very frequent chars

Columns :

  • Code/Char/CharName : alternative Unicode char often used as a visual equivalent of a more frequent char
  • NormCode/NormChar/NormCharName : more frequent char which should be used to normalize text

2.10 Cyrillic and greek chars looking like latin letters

dfcgnormchars = pd.read_csv(chardatadir / "cyrillic-greek-chars.csv", sep=";")
dfcgnormchars[5:15]
Code Char CharName NormCode NormChar NormCharName
5 949 ε Greek Small Letter Epsilon 101 e Latin Small Letter E
6 1077 е Cyrillic Small Letter Ie 101 e Latin Small Letter E
7 1108 є Cyrillic Small Letter Ukrainian Ie 101 e Latin Small Letter E
8 1085 н Cyrillic Small Letter En 104 h Latin Small Letter H
9 953 ι Greek Small Letter Iota 105 i Latin Small Letter I
10 1082 к Cyrillic Small Letter Ka 107 k Latin Small Letter K
11 1084 м Cyrillic Small Letter Em 109 m Latin Small Letter M
12 951 η Greek Small Letter Eta 110 n Latin Small Letter N
13 959 ο Greek Small Letter Omicron 111 o Latin Small Letter O
14 963 σ Greek Small Letter Sigma 111 o Latin Small Letter O
print(f"{len(dfcgnormchars)} cyrillic and greek chars used as equivalent visual representations for {len(dfcgnormchars['NormChar'].unique())} latin letters")
27 cyrillic and greek chars used as equivalent visual representations for 18 latin letters

NOTE : this standardization step is optional.

Even if it sounds strange, all cyrillic and greek letters in the table above are most often used as equivalent of latin letters in the french datasets.

Columns :

  • Code/Char/CharName : cyrillic or greek char often used as a visual equivalent of a latin letter
  • NormCode/NormChar/NormCharName : more frequent char which should be used to normalize text

2.11 Replace infrequent latin letters with diacritics

supportedchars = dfcharsnorm["Char"].values[1:]
' '.join(supportedchars)
'  \n \t , \' . - : / " ) ( ? ! » « | … ; [ ] } { • ¿ ¡ 0 1 2 3 5 4 9 8 7 6 a b c d e f g h i j k l m n o p q r s t u v w x y z à â ä ç è é ê ë î ï ô ö ù û ü ÿ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z À Â Ä Ç È É Ê Ë Î Ï Ô Ö Ù Û Ü Ÿ _ & @ \\ # á ã å ć č ė ğ ı í ì ń ñ ó ò õ ø š ş ß ú Á Å Š Ú Ž λ π Ã �  % ° § µ Ø ‰ € $ ¤ £ ¥ ¢ = > + < ^ ~ × ≤ ÷ ≥ ± ≠ ∞ √ * ✓ ⇒ ♥ ¦ → ★ ¯ ↓ ❌ ❐ † ↑ ← ↔ © ® ™ 🙂 😉 😀 😂 😁 😊 🙁 😅 😍 😃 😡 🤣 😄 🤔 😎 😭 👹 😱 😜 😋 \U0001f929 🙄 😆 😛 \U0001f92a 😢 😇 🤦 💪 👉 👍 👏 🙏 🙌 👇 👊 👎 👌 ✌ ✊ ⚠ 🔴 🔥 🏆 ⚽ 💡 🚨 💥 ⚡ ♫ ♂ ♀ 🎉 ✍ ✉ ✝'
latinlettersnodiacritics = {}
for rowidx,row in dflatinletters.iterrows():
    if row["IsDiacritic"]:
        latinlettersnodiacritics[row["Char"]] = row["BaseChar"]
for idx,letter in enumerate(latinlettersnodiacritics):
    if not letter in supportedchars:
        print(f"{letter} => {latinlettersnodiacritics[letter]}")
    if idx >= 60 : break
Ì => I
Í => I
Ñ => N
Ò => O
Ó => O
Õ => O
Ý => Y
ý => y
Ā => A
ā => a
Ă => A
ă => a
Ą => A
ą => a

2.12 Replace infrequent chars from other scripts

 def replaceotherscripts(charset, chariterator):
    for char in chariterator:
        if char in charset:
            yield char
        else:
            family = blockfamily(charblock(char))
            if not family in ("Symbols","Ignore"):
                resStr = chr(65532) + str(ord(char)) + '_'
                for outchar in resStr:
                    yield outchar
            else:
                yield char
''.join(replaceotherscripts(supportedchars,"Guānhuà (官话/官話)"))
'Gu257_nhuà (23448_35805_/23448_35441_)'

All characters from non latin scripts are preserved by encoding them with the following sequence :

[object replacement character] + unicode char number + [underscore]

This is necessary to preserve the distinct entity names in unsupported scripts, and enables decoding with full fidelity at a later stage of the pipeline.

2.13 Replace infrequent symbols

def replacesymbols(charset, chariterator):
    for index,char in enumerate(chariterator):
        if char in charset:
            yield char
        else:
            family = blockfamily(charblock(char))
            if family == "Symbols":
                resStr ='$' + charname(char).replace(' ','') + '_'
                for outchar in resStr:
                    yield outchar
            else:
                yield char
''.join(replacesymbols(supportedchars,"😀😈😎🙈🙌"))
'😀$SmilingFaceWithHorns_😎$See-No-EvilMonkey_🙌'

All unsupported symbols are preserved by encoding them with the following sequence :

[dollar] + unicode char name + [underscore]

This enables a NLP pipeline to add english words to its vocabulary if some symbols are used frequently in the context of a sentiment analysis task.

2.14 Ignore remaining chars with no glyph

unicodefamilies[unicodefamilies["CharFamily"]=="Ignore"]
UnicodeBlock CharFamily
61 Combining Diacritical Marks Ignore
62 Private Use Area Ignore
63 Supplementary Private Use Area-A Ignore
64 Supplementary Private Use Area-B Ignore
65 Specials Ignore
66 Tags Ignore

3. Text normalization

3.1 Normalization functions

We need to apply several replacement functions in a row, each replacement function building on the replacements already applied by the previous ones.

We can't simply use replace statements on immutable strings to do this : we would need to allocate new strings for each replacement at each level, and this would put a high load on the garbage collector.

A better solution is to implement our normalization function as a chain of iterators on chars.

Examples :

def ignorechars(chariterator, charset):
    for char in chariterator:
        if not char in charset:
            yield char
            
def replacechars1to1(chariterator, chardict):
    for char in chariterator:
        if char in chardict:
            yield chardict[char]
        else:
            yield char
            
def replacechars1toN(chariterator, chardict):
    for char in chariterator:
        if char in chardict:
            for outchar in chardict[char]:
                yield outchar
        else:
            yield char

To match several chars in an iterator, we have to build a hierarchical dictionary structure.

For example, if we want to implement the following replacements :

ABC => 1
ABD => 2
AC  => 3
BC  => 4

We build the following dictionary structure :

A : { B : { C : 1
            D : 2

      C : 3 }

B : { C : 4 }

The normalization functions are then chained in a chars replacement pipeline.

3.2 Normalization class with change tracking

class NormChange[source]

NormChange(layer, index, charsInput, charsOutput, removedInfo=None)

NormChange.__init__[source]

NormChange.__init__(layer, index, charsInput, charsOutput, removedInfo=None)

Initialize self. See help(type(self)) for accurate signature.

class NormResult[source]

NormResult(inputText, transformsDescs)

NormResult.describeChanges[source]

NormResult.describeChanges()

NormResult.mapOutputIndexToInput[source]

NormResult.mapOutputIndexToInput(outputIndex)

class TextNormalizer[source]

TextNormalizer()

TextNormalizer.__call__[source]

TextNormalizer.__call__(inputText)

Call self as a function.

%time norm = TextNormalizer()
norm
CPU times: user 2 s, sys: 0 ns, total: 2 s
Wall time: 2.14 s
1 - Fix encoding errors : windows1252 read as iso8859-1
2 - Fix encoding errors : utf8 read as windows1252
3 - Fix encoding errors :  windows1252 read as utf8
4 - Merge Unicode combining chars
5 - Ignore control chars
6 - Replace latin letter symbols
7 - Replace latin letter ligatures
8 - Replace latin number symbols
9 - Normalize equivalent chars
10 - Replace cyrillic and greek chars looking like latin letters
11 - Replace infrequent chars : latin letters with diacritics
12 - Replace infrequent chars : other scripts
13 - Replace infrequent chars : symbols
14 - Replace infrequent chars : chars to ignore
teststring = chr(127995)+"① l`"+chr(156)+"uv"+chr(127)+"re est¨ "+chr(147)+"belle"+chr(148)+"¸ à  ½ € énième ‰ "+chr(133)+" ⁽🇪ffic🇦ce⁾ !"
teststring
'🏻① l`\x9cuv\x7fre est¨ \x93belle\x94¸ à  ½ € énième ‰ \x85 ⁽🇪ffic🇦ce⁾ !'
result = norm(teststring)
result
(1) l'oeuvre est «belle», Ã  1/2 € énième ‰ … (EfficAce) !
print(result.describeChanges())
Fix encoding errors : windows1252 read as iso8859-1
 < 🏻① l` [œ] uvre est¨  [“] belle [”] ¸ à  ½ € énième ‰  […]  ⁽🇪ffic🇦ce⁾ !
 < 🏻① l` [œ] uvre est¨  [“] belle [”] ¸ à  ½ € énième ‰  […]  ⁽🇪ffic🇦ce⁾ !
Fix encoding errors : utf8 read as windows1252
 < 🏻① l`œuvre est¨ “belle”¸ à   [½]   [€]  énième  [‰]  … ⁽🇪ffic🇦ce⁾ !
 < 🏻① l`œuvre est¨ “belle”¸ à   [½_]   [€__]  énième  [‰__]  … ⁽🇪ffic🇦ce⁾ !
Merge Unicode combining chars
 < 🏻① l`œuvre est¨ “belle”¸ à  ½ €  [é] ni [è] me ‰ … ⁽🇪ffic🇦ce⁾ !
 < 🏻① l`œuvre est¨ “belle”¸ à  ½ €  [é_] ni [è_] me ‰ … ⁽🇪ffic🇦ce⁾ !
Ignore control chars
 <  [🏻] ① l`œuv [] re est [¨]  “belle”¸ à  ½ € énième ‰ … ⁽🇪ffic🇦ce⁾ !
 <  [_] ① l`œuv [_] re est [_]  “belle”¸ à  ½ € énième ‰ … ⁽🇪ffic🇦ce⁾ !
Replace latin letter symbols
 < ① l`œuvre est “belle”¸ à  ½ € énième ‰ … ⁽ [🇪] ffic [🇦] ce⁾ !
 < ① l`œuvre est “belle”¸ à  ½ € énième ‰ … ⁽ [E] ffic [A] ce⁾ !
Replace latin letter ligatures
 < ① l` [œ ] uvre est “belle”¸ à  ½ € énième ‰ … ⁽E [ffi  ] cAce⁾ !
 < ① l` [oe] uvre est “belle”¸ à  ½ € énième ‰ … ⁽E [ffi] cAce⁾ !
Replace latin number symbols
 <  [①  ]  l`oeuvre est “belle”¸ à   [½  ]  € énième ‰ … ⁽EfficAce⁾ !
 <  [(1)]  l`oeuvre est “belle”¸ à   [1/2]  € énième ‰ … ⁽EfficAce⁾ !
Normalize equivalent chars
 < (1) l [`] oeuvre est  [“] belle [”]  [¸]  Ã  1/2 € énième ‰ …  [⁽] EfficAce [⁾]   [!] 
 < (1) l ['] oeuvre est  [«] belle [»]  [,]  Ã  1/2 € énième ‰ …  [(] EfficAce [)]   [!] 

result.output[0:12]
"(1) l'oeuvre"
result.input[result.mapOutputIndexToInput(0):result.mapOutputIndexToInput(12)]
'🏻① l`\x9cuv\x7fre'
result.output[3:10]
" l'oeuv"
result.input[result.mapOutputIndexToInput(3):result.mapOutputIndexToInput(10)]
' l`\x9cuv\x7f'
%timeit -n100 norm(teststring)
344 µs ± 89.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

3.3 Normalization pipeline stats

The statistics below count the number of chars normalized for 1 million chars in 4 distinct parts of the french datasets : business websites, forums, news, wikipedia.

The first line of the table below shows that :

  • in 1 million chars extracted from forum pages (raw users input), 41.8 chars will be encoding errors (windows1252 read as iso8859-1)
  • in 1 million chars extracted from business websites (curated content), only 0.5 chars will be encoding errors
normstats = pd.read_csv(chardatadir / "stats" / "normalization.total.stats.csv")
normstats[["Transform","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]]
Transform FreqBusiness FreqForum FreqPresse FreqWikipedia
0 Fix encoding errors : windows1252 read as iso8... 0.510560 41.818746 0.813485 0.006025
1 Fix encoding errors : utf8 read as windows1252 0.126815 0.058024 0.072456 0.001037
2 Fix encoding errors : windows1252 read as utf8 0.000000 0.000000 0.019315 0.000000
3 Merge Unicode combining chars 2.811983 0.432638 0.568146 0.000140
4 Ignore control chars 6.450737 349.052995 6.454367 4.118586
5 Replace latin letter symbols 0.019360 0.039701 0.297372 0.150550
6 Replace latin letter ligatures 6.603815 6.541480 10.097290 17.204422
7 Replace latin number symbols 2.528338 4.162482 2.560933 0.429792
8 Normalize equivalent chars 814.327384 1248.410777 684.333730 242.391239
9 Replace cyrillic and greek chars looking like ... 0.062432 0.760424 0.491996 7.479907
10 Replace infrequent chars : latin letters with ... 0.063782 0.078384 0.099106 9.124948
11 Replace infrequent chars : other scripts 0.085694 0.468776 1.192548 16.612142
12 Replace infrequent chars : symbols 0.139271 0.159821 0.399064 0.073566
13 Replace infrequent chars : chars to ignore 0.018910 0.044282 0.021320 0.016423

Most frequent chars replaced from equivalent characters :

replacestats = pd.read_csv(chardatadir / "stats" / "normalization.layer8.stats.csv")
replacestats[["Char","CharName","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]].head(20)
Char CharName FreqBusiness FreqForum FreqPresse FreqWikipedia
0 ' Apostrophe 486.034805 160.264219 376.104982 134.658673
1 Space 310.411117 1082.845985 288.635983 87.877649
2 - Hyphen-Minus 14.431203 2.903761 12.828203 16.223154
3 « Left-Pointing Double Angle Quotation Mark 1.429478 0.680513 3.002426 0.559632
4 » Right-Pointing Double Angle Quotation Mark 1.323524 0.533926 2.461880 0.544134
5 | Vertical Line 0.003452 0.001018 0.005488 0.875894
6 Bullet 0.204104 0.243295 0.189664 0.543237
7 . Full Stop 0.059280 0.078893 0.856230 0.069278
8 " Quotation Mark 0.085093 0.023413 0.011504 0.292385
9 : Colon 0.000150 0.000509 0.000053 0.169047
10 ° Degree Sign 0.148726 0.181199 0.014618 0.078302
11 é Latin Small Letter E With Acute 0.001651 0.006108 0.003166 0.101114
12 Leftwards Arrow 0.000000 0.000000 0.000158 0.047194
13 = Equals Sign 0.004802 0.029012 0.000686 0.041589
14 Rightwards Arrow 0.026113 0.002545 0.034302 0.015862
15 d Latin Small Letter D 0.000000 0.024940 0.000000 0.036405
16 < Less-Than Sign 0.004202 0.142007 0.001267 0.024073
17 , Comma 0.006453 0.101288 0.004538 0.022756
18 Downwards Arrow 0.007504 0.001527 0.011188 0.021888
19 Black Star 0.001351 0.013743 0.022006 0.011686

Frequency of characters from other scripts :

scriptsstats = pd.read_csv(chardatadir / "stats" / "normalization.layer11.stats.csv")
scriptsstats[["CharFamily","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]]
CharFamily FreqBusiness FreqForum FreqPresse FreqWikipedia
0 ChineseJapaneseKorean 0.012456 0.177127 0.194677 4.059173
1 Arabic 0.012306 0.026467 0.460280 3.140120
2 Cyrillic 0.024462 0.166438 0.237159 3.118961
3 Greek 0.016058 0.022904 0.031347 2.423996
4 Hebrew 0.000150 0.000000 0.184914 1.132155
5 Other 0.000750 0.029012 0.004063 0.800871
6 Indian 0.000750 0.037665 0.033458 0.737955
7 Phonetic 0.002401 0.001527 0.001636 0.298579
8 Latin 0.013507 0.006108 0.007283 0.269377
9 Math 0.001801 0.000509 0.000528 0.240707
10 LaoThai 0.000000 0.001018 0.033194 0.217867
11 Armenian 0.001051 0.000000 0.004011 0.172382

Detailed stats for the 14 layers of the normalization pipeline :

layersstats = pd.read_csv(chardatadir / "stats" / "normalization.stats.csv")
layer=8
layersstats[layersstats["Layer"]==layer][["Layer","Input","CharName","Output","CountBusiness","CountForum","CountPresse","CountWikipedia"]].head(15)
Layer Input CharName Output CountBusiness CountForum CountPresse CountWikipedia
639 8 Right Single Quotation Mark ' 3232470 311603 6944753 4755813
640 8 No-Break Space 2057917 2127216 4892006 3122348
641 8 Thin Space 8088 116 549846 5363
642 8 En Dash - 80049 4540 172657 189791
643 8 Em Dash - 13928 329 63048 157402
644 8 · Middle Dot - 958 565 4021 202542
645 8 ` Grave Accent ' 1202 999 161302 5167
646 8 Left Double Quotation Mark « 9518 1329 56880 19728
647 8 Right Double Quotation Mark » 8808 1040 46632 19173
648 8 Left Single Quotation Mark ' 3557 952 12041 12981
649 8 Box Drawings Light Vertical | 0 0 0 25990
650 8 Hair Space 69 1 19774 246
651 8 Black Right-Pointing Pointer 570 356 913 17134
652 8 Minus Sign - 336 21 1705 15828
653 8 ´ Acute Accent ' 986 1308 8649 6423