Table of Contents
TextPAIR
TextPAIR (Pairwise Alignment for Intertextual Relations)
TextPAIR is a scalable and high-performance sequence aligner for humanities text analysis designed to identify "similar passages" in large collections of texts. These may include direct quotations, plagiarism and other forms of borrowings, commonplace expressions and the like. It is a complete rewrite of the original implementation released in 2009. See the official annoucement on our blog for details on this new implementation.
While TextPAIR was developed in response to the fairly specific phenomenon of similar passages across literary works, the sequence analysis techniques employed in TextPAIR were developed in widely disparate fields, such as bioinformatics and computer science, with applications ranging from genome sequencing to plagiarism detection. TextPAIR generates a set of overlapping word sequence shingles for every text in a corpus, then stores and indexes that information to be analyzed against shingles from other texts. For example, the opening declaration from Rousseau's Du Contrat Social,
"L'homme est né libre, est partout il est dans les fers. Tel se croit le maître des autres, qui ne laisse pas d'être plus esclave qu'eux,"
would be rendered in trigram shingles (with lemmatization, accents flattened and function words removed) as:
homme_libre_partout
libre_partout_fer
partout_fer_croire
fer_croire_maitre
croire_maitre_laisser
maitre_laisser_esclave.
Common shingles across texts indicate many different types of textual borrowings, from direct citations to more ambiguous and unattributed usages of a passage. Using a simple search form, the user can quickly identify similar passages shared between different texts in one database, or even across databases. Interested parties are encouraged to consult the release site for more documentation.
We have released several databases based on the results of our sequence aligner: