Table of Contents
Words from the Director
- Growing the Collection: ARTFL-FRANTEXT and other databases
- Our Search and Retrieval Engine: Philologic3
- Where to now ? From Words to Works : Philomine
- Conclusion
As most of our users know, the Project for American and French Research on the Treasury of the French Language (ARTFL) is a collaborative project with the French Centre national de la recherche scientifique (CNRS) laboratory Analyse et Traitement Informatique de la Langue française. With some 330 subscribing institutions across North America, the ARTFL project is now one of the oldest and most successful on-line full-text services serving the scholarly community of research and higher learning. With success have come new challenges and responsibilities. Over the last several years, the ARTFL project has been working to enhance both our collections and our software. This has been a collective effort involving discussions with our users as well as internal research and development efforts. On behalf of the whole ARTFL team, I am happy to announce this new release of both our search and retrieval engine, PhiloLogic, and of our main database, ARTFL - FRANTEXT.
Growing the Collection: ARTFL-FRANTEXT and other databases
We have been steadily augmenting our holdings. We have now consolidated them. We have integrated into our main database over 1,000 new works, bringing the number of works to over 2,600 and the total number of words to over 150 million. We have been growing our collections in several ways, among which the most important have been:
I. Collaborative projects. This has consistently been one of our priorities. While we are engaged in many such projects, a few stand out and I mention them for their importance and also as a means of inviting other researchers and institutions to joint to work with us collaboratively.
- Our collaboration with Maison de Balzac and Le Groupe International de Recherches Balzaciennes at the Université de Paris Diderot-7 has allowed us to incorporate the electronic edition of the famous Furne edition of Balzac’s Comédie humaine. ARTFL has made PhiloLogic available to the Maison de Balzac for its own site.(Project Homepage)
- Through our collaboration with the Université de Neuchâtel, we have integrated Mme de Scudéry’s massive Artamène ou le grand Cyrus (Project Homepage). We have entered into agreements with research teams at the Université de Paris IV Sorbonne which will lead to further collection developments.
- Our work with the Montaigne Project and the journal Montaigne Studies here at the University of Chicago has brought us the Villey edition of the Essais with the corresponding digital images from the Bordeaux copy (Project Homepage).
- A new undertaking entitled the Collaborative Initiative for French and North American Libraries (CIFNAL) has begun to take shape and has already joined with theMédiathèque de l’agglomération troyenne to make available more than 100 works of the famous Bibliothèque bleue de Troyes (Project Homepage).
II. In-house data entry projects. This type of data entry has allowed us to attenuate some of the more flagrant lacunae in the original corpus, most notably in the area of women writers where we have added some 140 new texts. Other additions include a full-text version of Bayle’s Dictionnaire historique et critique and a significant number of issues of the Journal de Trévoux.
III. Freely available on-line editions of suitable quality. Among our most important additions of texts falling into this category, one might mention the Mémoires of Saint-Simon, de Tocqueville’s De la démocratie en Amérique, and Rousseau’s Confessions. Any information or suggestions concerning freely available texts that users bring to our attention are most welcome.
While the works integrated into our main database all conform to high standards of correction, we have chosen to make some texts such as the Biblothèque bleue or Bayle’sDictionnaire historique et critique available in a much rougher state. While we or our collaborative partners are looking for funds to carry out thorough corrections, we believe that the interests of the research community are best served by making this data available in its current state. We welcome any offers for help in correcting these texts.
Lastly our digital edition of Diderot and D’Alembert’s Encyclopédie continues to develop. We have performed over 400,000 corrections and, with the help of collaborative partners both here and abroad, we regularly add improvements as well as archival material to this heavily used resource. For further information, I invite you to consult the Encyclopédie home page on the ARTFL website.
Our Search and Retrieval Engine: Philologic3
The ARTFL project is somewhat unique in that it involves the close collaboration of a set of scholars trained in scholarly analysis of text and a technical development team specializing in computational methods for organizing, storing, searching, retrieving and analyzing textual materials of many types and in many languages.
Recently completed under the direction of Mark Olsen, ARTFL’s Associate Director in charge of technical development, this new version of our search and retrieval engine offers several new enhancements and features. We have maintained our basic commitment to the implementation of easy-to-use, intuitively straightforward software that can be quickly mastered by scholars of various disciplines. But at the same time we have tried to increase what might be called the dimensionality of both the query formulations and the results. By this I mean that we attempted on the one hand to give simple means to limit or extend the searches and on the other to view the results from various points of view (date variations, collocations, position in sentence etc.).
New features include:
- Enhanced bibliographic searching. This includes full Boolean capabilities (any combination of AND, OR, NOT, etc.), the ability to search on genres and vastly improved capabilities of moving from the bibliographic references to the text and to information about the text (word-frequencies and the like).
- Onelook Dictionary query. A direct link from text to our dictionary collection that allows users to simply select any word and press ‘d’ to have a window furnishing the entry to the word chosen in the dictionary or dictionaries contemporary to the text. An automatic lemmatizer moves from the inflected form in the text (for example,arrivèrent) to the root word (arriver).
- New KWIC reporting. As before you can easily move from a KWIC occurrence to the full page. Now you can go to the bibliographic reference by clicking on ‘bib’. From there you can click on the title for access to the entire work (when not under copyright) or you can click on ‘word count’ to get a list of all word forms occurring in the text in alphabetical order and with their frequency.
- Similarity searching. If you are searching in older texts, spelling often varies. The new similarity button allows you to search for similar forms of the word. So, for example, if you type in Charlemagne, the program informs you that the following similar forms occur in the database and gives frequency of occurrence:
4 carlemaine
1930 charlemagne
6 charlemagnes
110 charlemaigne
4 charlemaine
3 charlemainne ... etc. - Frequency by Period reports. The user can now generate a table of word or word pattern frequencies (and the frequency per 10,000 words rate) over 100-year periods, or 25-year periods in any given century.
- Refined Search Results. There are several new ways of refining your search results which I will mention briefly.
- The user can now see how often (relative or absolute frequency) a word, a phrase, a group of words, etc. occur by title, by author, by date (year, decade, quarter century, half-century, or century).
- Philologic3 quickly generates collocation tables allowing you to know what words occur around a given word. In the near future we hope to be able to extend this capability to multiple words.
- The user can now sort the KWIC results for a given request by words to the right or left of the key word or words.
- PhiloLogic3 allows you to sort your results by the position your word(s) occupy in the phrase (Theme-Rheme).
PhiloLogic3 had been released to the open source community and a variety of digital text projects located elsewhere now use this software for its speed and facility. We invite you to experiment freely with the new system and to give us your comments.
Where to now? From Words to Works : Philomine
While we will continue to make improvements to Philologic3 – for example, we hope soon to allow users to move, with a simple click of the mouse, from the collocation table to seeing given collocations in context – we nevertheless believe that we are touching the limits of the traditional "single hit" approach to text analysis in the digital humanities.
As denoted by its name, the PhiloLogic system has been designed to support fairly customary notions of textual research informed by the long tradition of philology and historical semantics. Researches typically examine evolving word and concept use over time through key words, such as tradition, or clusters of terms that make up a topos, theme or commonplace, or patterns of word use. As powerful and useful as concordances, frequency counts, or collocation tables may be, the focus on small sets of words is subject to many limitations. For example, as the size of textual databases increases, even searches for relatively uncommon words can result in tens of thousands of occurrences, far beyond the capacity of the user to digest. The advent of massive textual digitization projects thus presents ARTFL with a whole set of challenges that will require enhancing the retrieval and analysis capabilities of PhiloLogic and opening new lines of inquiry into textual analysis.
To meet these challenges, we have embarked on a development program we have dubbed "From Words to Works." In this program, we are seeking means to leverage machine learning to move from single hit retrieval to large-scale results analysis. New techniques coming out of the field of Information and Computer Science such as information retrieval, machine learning, text data mining, and document clustering offer new ways to approach humanities text research that complement, but do not replace, more traditional approaches supported by systems like PhiloLogic.
To support our initial experimentation, the ARTFL Project has developed a set of drop-in extensions to the PhiloLogic system called Philomine. This is an interactive environment designed to allow us to run experiments easily and quickly on a large number of different databases using a variety of feature set selections. It also is designed to allow the user to link back to PhiloLogic text search reports or specific objects.
While the current, experimental implementations need to be refined and remain computationally too expensive to allow for public implementation, over time we hope to develop and release a revised version of PhiloLogic to support more coherently some of the Philomine functions with which we are currently experimenting. When we have tested these new functions, we will begin to make them available to the larger community of scholars as extensions to our PhiloLogic system.
Conclusion
In closing I would like emphasize that most of the development efforts we have undertaken, be it in the area of collections or software, would have been impossible without the help and collaboration of the rather extraordinary set of graduate students from a whole range of disciplines -- humanities, social sciences, computer science, mathematics -- who have worked with us over the years. I would like to express my deep recognition for their contributions as well as for the support of my colleagues in the Department of Romance Languages and Literatures at the University of Chicago.