Table of Contents
Newberry French Revolution Collection
Newberry French Revolution Collection
NEW!! We have implemented a new version of the FRC combining PhiloLogic, Ranked Relevance and Topic Model browsing. Please consult our post Modeling Revolutionary Discourse on the ARTFL blog for information and access to this build.
The incredible richness of the Newberry Library’s French Revolution Collection has long been known. The challenge of course has been the ability to explore it. In 2016-17, the Newberry Library’s French Revolution Pamphlets put online the digitalized version of these archives. ARTFL is proud to introduce it under the power PhiloLogic4's research capabilities. The build we are releasing to the public should be considered as work in progress, but we believe that we have reached a point where public availability would be of use.
The Newberry Library’s French Revolution Collection (FRC) consists of more than 30,000 pamphlets and more than 23,000 issues of 180 periodicals published between 1780 and 1810. The collection represents the opinions of all the factions that opposed and defended the monarchy during the turbulent period between 1789-1799 and also contains innumerable ephemeral publications of the early Republic. The Newberry has released digital copies of more than 35,000 pamphlets totaling approximately 850,000 pages. Not only has the Newberry made the collection available to the public, but it has released a data feed of the entire collection, consisting of Library’s exceptional metadata describing each object, the OCR text data, and links to the digital facsimiles accessible from the Internet Archive, encouraging researchers and instructors to incorporate the digital collection in new kinds of scholarship and engagement.
In order to facilitate experimental work at ARTFL on this unparalleled resource, we have loaded two versions of this collection collection into PhiloLogic4. These are based on a download of the collection from the GitHub repository in November 2017. The full collection contains 38,377 documents from the 16th century to the end of the 19th century:
NEW!! Search/Browse the full FRC under Philologic4.7.
Search/Browse the full FRC under Philologic4.
We have a second build which attempts to eliminate duplicate documents and is restricted to the period 1787-1799 containing 25,595 documents (01/02/2020). We have eliminated 6,139 duplicates based on metadata, 5,661 as outside of our desired date range, and 132 undated documents, with a few documents important documents included from 19th century editions.
Search/Browse the selected FRC under Philologic4
It is our pleasure to acknowledge that the Newberry Library has released this extraordinary resource under the Open Data Commons Attribution License, ODC-BY 1.0. We believe that the this splendid collection and the Newberry’s release of all of the data will facilitate a generation of ground-breaking work in Revolutionary studies. If you find the collection useful, please do contact the Newberry Library to congratulate them on this wonderful initiative and how their efforts contribute to your research.
Robert Morrissey
Director, ARTFL Project
Caveat Emptor or Release notes
Our PhiloLogic4 builds of the FRC is based on the OCR data provided by the Newberry and page image links to the collection at the Internet Archive. The OCR is, in general, very solid. All quotations and citations, however, should be made from the PAGE IMAGES and NOT from the OCR’d text. Uncorrected or partially corrected OCR is an excellent way to find instances of words in page images and can be used for many tasks that might be considered “distant reading” but is of widely varying quality and should not be relied on for close reading or exhaustive research. The page links in each document are dynamically generated and should get you to the corresponding page image on the Internet Archive. We are not highlighting search terms as the IA has an excellent search document function which shows all of the results for a volume and highlights the search words on the page image.
We have processed the text data in both versions with a fuzzy match correction process which attempts to find nearest matches to strings which are not identified in a list of valid words from several of ARTFL collections. The initial data, for example, has 43,500 instances of the string “constitution”. After processing, the sample has more than 92,000 occurrences. The changes range from fairly obvious changes such as conftitution or conflitution, to less frequent by fairly dissimilar swaps: gonllitution or coiiilitution. It is important to note that distance measures can introduce errors, typically by changing strings into (usually) valid words that a human would not select: acllon to wallon or aclïon to alcyon. Words that are broken can result in odd resolutions, such as circonst -- > zircons. This has not been applied to short strings or to low frequency strings. This has significantly improved text mining and sequence alignment tasks, which is our primary objective, but introduces some errors.
We are doing the selected PhiloLogic4 build to facilitate quantitative work on this collection. The duplicate document selection is current based on matching metadata. This is a preliminary step. We have missed some duplicates, most notably titles with no authors, and eliminated some documents that are not duplicates. We plan to use our sequence aligners on the entire collection to match this against our metadata based selection. Expect to see some changes to this collection in the future.
We have installed both versions of this collection with selected metadata fields. The complete metadata for every document is available from the table of contents by pressing on the Show Header button. We have selected the FIRST author in each document rather than all of the possible “creators” in order to tighten up frequency reports by author. We have also not included all of the subject fields at this time.
We have only indexed higher frequency words. You will find that some low frequency words/strings which appear in the text cannot be searched for. This is to reduce the amount of computer time and storage space required to index primarily strings which are not words. We have indexed the top 400,000 words by frequency.
EXPERIMENTS
We have implemented a new version of the FRC combining PhiloLogic, Ranked Relevance and Topic Model browsing. Please consult our post Modeling Revolutionary Discourse on the ARTFL blog for information and access to this build.
Sequence Alignment Test: June 22, 2018 parameters.