External dependencies :
pip install pandas
Configure tabular data display in this notebook :
Character set normalization for french¶
The config object from frenchtext.core defines the directory where the character normalization tables are located :
!ls {chardatadir}
0. Unicode characters properties¶
charname("🙂")
charcategory("🙂")
charsubcategory("🙂")
charblock("🙂")
blockfamily('Emoticons')
1. Explore french dataset characters¶
French datasets often contain several thousands distinct Unicode characters.
We need to reduce the number of distinct characters fed to our natural language processing applications, for three reasons :
- chars considered by the user as visually equivalent will often produce a different application behavior : this is a huge problem for the user experience
- with so many chars, the designer of the NLP application will not be able to reason about all possible combinations : this could harm the explainability of the system
- this huge number of distinct characters brings a significant amount complexity the NLP models will have to deal with
1.1 Characters frequency in french datasets¶
dfcharstats = pd.read_csv(chardatadir / "charsetstats_raw.csv", sep=";")
dfcharstats
1.2 Characters stats in Wikipedia dataset¶
- 35.6 billion chars
charsCountWikipedia = dfcharstats["CountWikipedia"].sum()
charsCountWikipedia
- 13 502 distinct Unicode chars
distinctCharsWikipedia = len(dfcharstats[dfcharstats["CountWikipedia"]>0])
distinctCharsWikipedia
- Only 1316 chars more frequent than 1 in 100 million
frequentCharsWikipedia = len(dfcharstats[dfcharstats["CountWikipedia"]>356])
frequentCharsWikipedia
- Frequent chars represent 9.7 % of all distinct Unicode chars
pctFreqCharsWikipedia = frequentCharsWikipedia/distinctCharsWikipedia*100
pctFreqCharsWikipedia
- 99.9987 % of Wikipedia chars would be preserved if we only kept the frequent chars
pctPreservedCharsWikipedia = (1-dfcharstats[dfcharstats["CountWikipedia"]<=356]["CountWikipedia"].sum()/dfcharstats["CountWikipedia"].sum())*100
pctPreservedCharsWikipedia
1.3 Characters stats in Business dataset¶
- 27.5 billion chars
charsCountBusiness = dfcharstats["CountBusiness"].sum()
charsCountBusiness
- 3 763 distinct Unicode chars
distinctCharsBusiness = len(dfcharstats[dfcharstats["CountBusiness"]>0])
distinctCharsBusiness
- Only 531 chars more frequent than 1 in 100 million
frequentCharsBusiness = len(dfcharstats[dfcharstats["CountBusiness"]>275])
frequentCharsBusiness
- Frequent chars represent 14.1 % of all distinct Unicode chars
pctFreqCharsBusiness = frequentCharsBusiness/distinctCharsBusiness*100
pctFreqCharsBusiness
- 99.9996 % of Business chars would be preserved if we only kept the frequent chars
pctPreservedCharsBusiness = (1-dfcharstats[dfcharstats["CountBusiness"]<=275]["CountBusiness"].sum()/dfcharstats["CountBusiness"].sum())*100
pctPreservedCharsBusiness
- 99.985 % of Wikipedia chars would be preserved if we only kept the frequent Business chars
pctPreservedBizCharsInWikipedia = (1-dfcharstats[dfcharstats["CountBusiness"]<=275]["CountWikipedia"].sum()/dfcharstats["CountWikipedia"].sum())*100
pctPreservedBizCharsInWikipedia
1.4 Character stats after Unicode normalization¶
After applying the normalization process defined below in this notebook, here are the remaining chars :
dfcharsnorm = pd.read_csv(chardatadir / "charset-fr.csv", sep=";")
dfcharsnorm
Stats for the character families after normalization¶
The table below shows the number of chars in each category (after normalization) per 100 million characters :
dfblocks = dfcharsnorm.groupby(by=["Category","SubCategory"]).agg({"Char":["count","sum"],"CountBusiness":"sum"})
dfblocks["CountBusiness"] = (dfblocks["CountBusiness"] / charsCountBusiness * 100000000).astype(int)
dfblocks
2. Characters normalization pipeline¶
After a detailed study of all the frequent chars, the goal is to design a noramization pipeline which can retain as much information as possible while greatly reducing the number of dinstinct chars.
We saw before that it is possible to preserve 99.9996% of the original chars while keeping only 500 distinct chars. By being clever and replacing equivalent chars, we can divide this number by 2 and still retain the same amount of information.
It may then be useful to limit the number of distinct characters after normalization to 255 distinct characters :
- if needed, french text chars can then be encoded with a single byte
- the list of supported chars can be memorized by NLP application developers and users
The normalization pipeline applies the following 14 steps, which are explained and illustrated in the sections below.
- Fix encoding errors
- fix windows1252 text read as iso8859-1
- fix utf8 text read as windows1252
- fix windows1252 text read as utf8
- merge Unicode combining chars
- ignore control chars
- Remove display attributes
- replace latin letter symbols
- replace latin letter ligatures
- replace latin number symbols
- Normalize visually equivalent chars
- replace equivalent chars
- replace cyrillic and greek chars looking like latin letters
- Encode infrequent chars while losing a little bit of information
- replace infrequent latin letters with diacritics
- replace infrequent chars from other scripts
- replace infrequent symbols
- ignore remaining chars with no glyph
2.1 Frequent encoding errors : windows1252 read as iso8859-1¶
dfencodingwin1252 = pd.read_csv(chardatadir / "windows1252-iso8859-errors.csv", sep=";")
dfencodingwin1252.head(10)
print(f"{len(dfencodingwin1252)} frequent encoding errors seen in french datasets : a character encoded as windows1252 was incorrectly decoded as iso8859-1")
Columns :
- Code/Char : incorrectly decoded control char seen in french text
- DecodedCode/DecodedChare : properly decoded char which should replace the original control char
2.2 Frequent encoding errors : utf8 read as windows1252¶
dfencodingutf8 = pd.read_csv(chardatadir / "utf8-windows1252-errors.csv", sep=";")
dfencodingutf8.head(10)
print(f"{len(dfencodingutf8)} very unlikely substrings produced when text encoded with UTF-8 is decoded by mistake as iso8859-1 or windows1252")
Columns :
- ErrorSubstring : unlikely substring of length 2 or 3 characters produced when UTF-8 text is decoded by mistake as windows1252
- DecodedCode/DecodedChar : properly decoded char which should be used to replace the unlikley substring
2.3 Frequent encoding errors : windows1252 read as utf8¶
dfencodingwin1252utf8 = pd.read_csv(chardatadir / "windows1252-utf8-errors.csv", sep=";")
dfencodingwin1252utf8.head()
print(f"{len(dfencodingwin1252utf8)} char very unlikely in french text produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8")
Columns :
- Char : unlikely char produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8
- DecodedCodes/DecodedChars : properly decoded substring which should be used to replace the unlikley char
2.4 Unicode combining chars¶
dfcombiningchars = pd.read_csv(chardatadir / "combiningdiacritics.csv", sep=";")
dfcombiningchars.head()
print(f"{len(dfcombiningchars['Char'].unique())} combining chars {list(dfcombiningchars['Diacritic'].unique())} should be recombined with {len(dfcombiningchars)} base latin characters to produce standard latin characters with diacritics")
Columns :
- BaseChar : latin char encountered first in the string, which will be modified by the combining char immediately following it
- Code/Char : combining char immediately following BaseChar, which should be combined with it to produce CombinedChar
- Diacritic : type of accent / diacritic applied by the combining char
- CombinedChar : latin char with diacritic produced by the combination of BaseChar and the combining Char following it
2.5 Control chars¶
dfcontrolchars = pd.read_csv(chardatadir / "controlchars.csv", sep=";")
dfcontrolchars.loc[0,"Char"] = chr(0) # chr(0) can't be saved in CSV file
dfcontrolchars