Table of Contents
Archives parlementaires (May 2013)
June 10: MVO copied database to back-up machine.
May 8, 2013:Note: This dataset is UNCORRECTED OCR (Optical Character Recognition) output taken from the Stanford Github repository. Please use this as an alpha or "proof-of-concept" database at this time. This build contains all of the expected 82 volumes of text.
Current state information
Volumes 52, 49, 5, 71b, 34, 25, have suspiciously high rates of the unaccented form. (also many Latin-1 accents)
Volume 8 images need to be rotated.
I have my ingestion system to automatically to rename and modify the GitHub repository files. Modifications:
Added speaker identification to the "sp" tag in the TEI.
Added dates and volume dates to internal bibliographic metadata in the TEI
Added dates to "divs" identified as "session" in "div" level metadata in the TEI
To do: identify "cahiers" as "div" object type.