External dependencies :
pip install numpy
pip install pandas
pip install pyarrow
pip install matplotlib
Configure tabular data display :
French datasets for Natural Language Processing¶
The config object from frenchtext.core defines the directory where the datasets will be stored if you choose to download them :
config.datasets
You can change the default location if needed :
config["datasets_path"] = "/var/tmp"
1.1 List datasets¶
list_dataset_files()
datasetsdf
1) The text content of the main french websites in the domain of finance and business (+ wikipedia) were extracted in september 2019 using nlptextdoc.
This extraction was done as "politely" as possible:
- extract only freely and publicly available content
- respect the robots.txt directives of each website (pages forbidden for indexing, maximum extraction rate)
- detect when websites use tools to prevent indexing (like Datadome) and abort the crawl
IMPORTANT: The original authors of the websites own the copyright on all text blocks in this dataset.
To be able to link each text block to its original author, we track the origin URL of each text block throughout the whole process.
YOU CAN'T REUSE THE TEXT BLOCKS FOR ANY PURPOSE EXCEPT TRAINING A NATURAL LANGUAGE PROCESSING MODEL.
See the new European copyright rules : European Parliament approves new copyright rules for the internet
"The directive aims to make it easier for copyrighted material to be used freely through text and data mining, thereby removing a significant competitive disadvantage that European researchers currently face."
print(f"=> {len(datasetsdf)-23} websites and {datasetsdf['Pages'].sum()} HTML pages")
2) The text blocks were then:
- deduplicated to keep only distinct text blocks for each website (forgetting part of the original document structure),
- tagged (but not filtered) by language (using https://fasttext.cc/docs/en/language-identification.html),
- grouped in categories according to the main theme of the original website,
- split in Pandas dataframes of size < 2 GB.
print(f"=> {len(datasetsdf['Dataset'].unique())} categories: {list(datasetsdf['Dataset'].unique())}")
3) In each dataframe, the text blocks were additionnaly SHUFFLED IN A RANDOM ORDER to make it very difficult to reconstruct the original articles (safety measure to help protect the copyrights of the authors).
The results of this second step can be downloaded in the config.datasets directory, as dataframes serialized in the feather format, in files named according to the 'DatasetFile' column of the table above:
- ['DatasetFile'].dataset.feather
print(f"=> {len(datasetsdf['DatasetFile'].unique())} dataset files: {list(datasetsdf['DatasetFile'].unique())}")
The number of words in each text block was computed using the default french tokenizer from spaCy v2.1.
This business-oriented dataset contains 2 billion french words.
print(f"=> Total number of words : {datasetsdf['Words'].sum()}")
The detailed contribution of each website (number of pages and number of french words kept after all filters) to each category can be studied in the datasetsdf table :
Here is a summary of the number of words contributed by each category in millions:
np.floor(datasetsdf[["Dataset","Words"]].groupby(by="Dataset").sum()/1000000)
Detailed documentation for list_datasets() :
list_datasets() returns one row per dataset file.
Columns :
- Dataset : 10 categories ('Assurance', 'Banque', 'Bourse', 'Comparateur', 'Crédit', 'Forum', 'Institution', 'Presse', 'SiteInfo', 'Wikipedia')
- DatasetFile : 19 dataset file names, which should be passed to read_dataset_file() ('assurance', 'banque', 'bourse', 'comparateur', 'crédit', 'forum', 'institution', 'presse-1', 'presse-2', 'presse-3', 'presse-4', 'presse-5', 'presse-6', 'siteinfo', 'wikipedia-1', 'wikipedia-2', 'wikipedia-3', 'wikipedia-4', 'wikipedia-5')
- Website : [only used during extraction phase] unique id for each extraction job
- Url : base URL used to start crawling the website (with additional parameters when the crawl result was too big and needed to be split in several dataset files, ex : https://dumps.wikimedia.org/frwiki/latest/100000 => '/10000' was added at the end of the real base URL to register the fact that the corresponding dataset file contains the first 100000 pages of the wikipedia dump)
- Scope : the crawl starts at the base URL, ex : https://fr.wikipedia.org/wiki/Finance, and then is limited to one of 3 possible scopes : 'domain' = *.wikipedia.org/*, 'subdomain' : fr.wikipedia.org/*, 'path' : fr.wikipedia.org/wiki/Finance/*
- UrlsFile : name of the file where the original download URLs for each text block are tracked (used by the functions read_urls() and get_url_from_rowindex())
- Pages : number of pages extracted from this website
- Words : number of tokens extracted from this website (according to the default spaCy tokenizer for french), enables to see the contribution of each website to the aggregated dataset
1.2 Download datasets¶
download_dataset_file("assurance")
download_all_datasets()
!ls -l {config.datasets}
1.3 Read dataset files¶
datasetdf = read_dataset_file("assurance")
datasetdf
The dataset files contain the following columns :
- Website : unique id of the website (extraction job) from which the text block was extracted
- DocId : unique identifier for each page of the website
=> these two columns can be used to join with the datasets list table and the datasets urls table
DocEltType/DocEltCmd : Document structure delimiters and document content elements
- Document Start / Document End : beginning / end of a page
- Document Title : title of the page
- Document Url : url of the page
- Section Start / Section End : chapter or section delimited in the page
- Section Title : title of the section
- NestingLevel : sections can be nested at several levels of depth
- TextBlock Text : paragraph or block of text inside a section
- List Start / List End : list of elements
- NavigationList Start / NavigationList End : navigation menu
- ListItem Text : text of a list element
- NestingLevel : lists can be nested inside each other
- Table Start / Table End : beginning / end of a table
- TableHeader/TableCell Start / TableHeader/TableCell End : beginning / end of a table cell
- TextBlock Text : paragraph or block of text inside a table cell
- NestingLevel : tables can be nested inside each other
- TableHeader/TableCell Start / TableHeader/TableCell End : beginning / end of a table cell
=> a dataset file contains only content text blocks (Title and Text elements)
- Document Start / Document End : beginning / end of a page
- NestingLevel : the extraction algorithm (nlptextdoc) tries very hard to preserve the hierarchical structure of the text in the source web page
- Text : Unicode text for document content elements lines (DocEltCmd = 'Title' or 'Text')
- Lang : language of the text block in the Text column, as detected by fastText
- Words : number of words of the text block in the Text column,after tokenization by spaCy
Unique : True for the very first occurence of the text block in the current website (extraction file), False for all subsequent occurences of this same text block in this extraction file
=> always True because the deduplication filter was already applied when the dataset file was created
ADDITIONAL INFO :
- the text blocks are shuffled in random order : it is impossible to rebuild the original web pages from the dataset files (by design to protect the copyrights)
- only unique text blocks from each website are kept
- the origin each text block is tracked with the (Website,DocId) columns => can be used to join with the urls files (again by design to protect the copyrights)
Note : before using a dataset file to train a model, you should filter it on language and number of words.
datasetdf["Lang"].value_counts()[:3].plot.barh()
datasetdf["Lang"].value_counts()[2:10].plot.barh()
datasetdf["Words"].value_counts().sort_index()[:20].plot.barh()
datasetdf[datasetdf["Words"]<7]["Words"].count()
datasetdf[(datasetdf["Words"]>=7) & (datasetdf["Words"]<20)]["Words"].count()
datasetdf[(datasetdf["Words"]>=20) & (datasetdf["Words"]<100)]["Words"].count()
datasetdf[datasetdf["Words"]>=100]["Words"].count()
1.4 Read URLs files¶
During each website extraction, a corresponding urls file was created to track the urls and stats of all documents included in the extraction.
A single table containing all urls for all datasets can be read with the read_urls_file() function.
If the urls file wasn't downloaded before, it will be automatically downloaded.
read_urls_file()
read_urls_file() returns one row per page extracted from a given website.
Columns :
- Website : unique id of the website (extraction job) from which the text blocks were extracted
- DocId : unique identifier for each page of the website, same as 'DocId' column in extraction file
- DocUrl : ansolute url (with query string) from which the page contents were extracted
- Words : total number of words in the text block of this page
- fr/en/es/de/? : number of words in text blocks for each language
- %fr/%en/%es/%de/%? : % words in text blocks of the page for each language (can be used to filter datasets)
1.5 Utility functions to use dataset files¶
Filter and iterate over the rows of a dataset file¶
rowsiterator = get_rows_from_datasetdf(datasetdf)
show_first_rows(rowsiterator, skip=5)
rowsiterator = get_rows_from_datasetdf(datasetdf, minwords=None, maxwords=5, lang="?")
show_first_rows(rowsiterator,10)
Filter and iterate over the text blocks of a full dataset (across multiple files)¶
textiterator = get_textblocks_from_dataset("Assurance", minwords=None, maxwords=10, lang="fr")
show_first_textblocks(textiterator,skip=2000,count=10)
Access a specific row - Retrieve the Url from which this text block was extracted¶
get_text_from_rowindex(datasetdf,100)
get_url_from_rowindex(datasetdf,100)
Find text blocks with a specific char or substring¶
find_textblocks_with_chars(datasetdf,"rétroviseur",count=20,ctxsize=15)
find_textblocks_with_chars(datasetdf,64257,count=10,wrap=True)