• ATILF - CNRS
  • UNIVERSITY OF CHICAGO
  • DIVISION OF THE HUMANITIES
  • UNIVERSITY OF CHICAGO LIBRARY

The ARTFL Project

  • What's new
  • Papers & Presentations
  • PhiloLogic
  • Subscription Info
  • About ARTFL
  • Contact us
Home » About ARTFL

Table of Contents

  • ARTFL Resources
  • About ARTFL
    • Words from the Director
    • General Overview
    • What's New at ARTFL
    • ARTFL Collaborations
    • Papers and Presentations
    • PhiloLogic4
    • TextPAIR
    • Contact Us
  • Subscription Information

TextPAIR

TextPAIR (Pairwise Alignment for Intertextual Relations)


TextPAIR is a scalable and high-performance sequence aligner for humanities text analysis designed to identify "similar passages" in large collections of texts. These may include direct quotations, plagiarism and other forms of borrowings, commonplace expressions and the like. It is a complete rewrite of the original implementation released in 2009. See the official annoucement on our blog for details on this new implementation.

While TextPAIR was developed in response to the fairly specific phenomenon of similar passages across literary works, the sequence analysis techniques employed in TextPAIR were developed in widely disparate fields, such as bioinformatics and computer science, with applications ranging from genome sequencing to plagiarism detection. TextPAIR generates a set of overlapping word sequence shingles for every text in a corpus, then stores and indexes that information to be analyzed against shingles from other texts. For example, the opening declaration from Rousseau's Du Contrat Social,

"L'homme est né libre, est partout il est dans les fers. Tel se croit le maître des autres, qui ne laisse pas d'être plus esclave qu'eux,"

would be rendered in trigram shingles (with lemmatization, accents flattened and function words removed) as:

homme_libre_partout
libre_partout_fer
partout_fer_croire
fer_croire_maitre
croire_maitre_laisser
maitre_laisser_esclave.

Common shingles across texts indicate many different types of textual borrowings, from direct citations to more ambiguous and unattributed usages of a passage. Using a simple search form, the user can quickly identify similar passages shared between different texts in one database, or even across databases. Interested parties are encouraged to consult the release site for more documentation.

We have released several databases based on the results of our sequence aligner:

  • Practices and Legacy of 18th Century Culture
  • Frantext and the the French Revolutionary Collection
  • Reuses in ARFL-Frantext
  • Reuses in ARTFL-Frantext and the ARTFL-Encyclopédie
‹ PhiloLogic4 up Contact Us ›
Humanities Division Wordmark

The ARTFL Project
Department of Romance Languages and Literatures
Division of the Humanities
University of Chicago
1115 East 58th Street Chicago, IL 60637
tel: 773-702-8488 | email: artfl[at]artfl[dot]uchicago[dot]edu
Privacy Notice

  • What's new
  • Papers & Presentations
  • PhiloLogic
  • Subscription Info
  • About ARTFL
  • Contact us