We’re All MADS Here: Mapping TEI to the Metadata Authority Description Schema

Michael Andrew Gavin1, Jennifer Miller2

1University of South Carolina, United States of America; 2Rice University, United States of America;,

Guiding Questions:

How should scholarly editors markup textual references to real people and places? How can editors and cataloguers identify common references across different texts, databases, and description standards?


In 2006, the Personography Working Group was tasked with “investigat[ing] how other existing XML schemes and TEI customizations handle data about people, both to facilitate ‘cross-walking’ between these different schemata and to provide recommendations about how personography is handled in TEI P5.” The report and recommendations were completed later that year. Since 2006, The Library of Congress has adopted its Metadata Authority Description Standard (MADS) as a new XML format for canonical information about persons and places.

The MADS record is designed with three key features: 1) a unique identifier; 2) an unlimited number of variants; and 3) detailed source records and notes. The schema is designed for both flexibility and standardization. For example, here is an excerpt from the existing MADS record for Plato.

The bulk of the MADS record is a compilation of name variants:

<mads:name type=”personal” authority=”naf”>



<mads:variant type=”other”>

<mads:name type=”personal”>




<mads:variant type=”other”>

<mads:name type=”personal”>




<mads:variant type=”other”>

<mads:name type=”personal”>




Notes are collected at the bottom:

<mads:note type=”source”>Cresson, A. Platon … 1941</mads:note>

<mads:note type=”source”>

Yi, C.H. Pʻŭllatʻo chʻŏrhak yŏnʼgu, 1981: t.p. (Pʻŭllatʻo)


<mads:note type=”source”>

Google, 02-06-02 (Plato, Plato was

born in the spring of year 427 BC, in the island of Agina, near Athens. His

father was Ariston. His mother was Perictone who belonged to the highly

esteemed family of Solon, the early Greek philosopher; he died 347 B.C.)


<mads:note type=”source”>

Encyclopedia Britannica (Plato b. about 428 B.C., d. 347 B.C.)


Finally, the records are controlled by identifiers and classifications:

<mads:identifier type=”lccn”>n 79139459</mads:identifier>



In our micropaper, we will briefly introduce MADS, with an emphasis on the elements <mads:name>, <mads:variant>, <mads:note>, and <mads:identifier>. We will discuss similarities and differences between MADS and TEI and conclude by providing a conceptual mapping from TEI to MADS.

This talk will be of interest to scholarly editors and digital librarians who hope to maximize search capabilities across texts and databases.

Ambiguous Terms: Race and TEI Markup

Theresa Habbestad

As TEI standards continue to evolve, one element that is not entirely accounted for is the markup of race and the intricacies such markup would involve, despite the provision for numerous comparable elements. While this concern may not seem pressing within the context of TEI, aspects of the current academic environment call this issue into question. Most notably, concerns raised about issues of race, gender, and diversity within the Digital Humanities and the existence of digital projects focused on race relying on TEI markup bring this issue to the forefront. In this paper, I will discuss the need for the development of a methodology for marking race. I will begin by articulating the current concerns regarding race, cultural criticism, and the Digital Humanities. I will then discuss the ways in which TEI suggests the inclusion of racial mark-up. I will support this argument with a discussion of current relevant TEI tags (i.e. , , and ) along with the ways in which Critical Race Studies factors into this argument. I will go on to discuss projects that have incorporated racial markup, including work on 19th century African-American newspapers and American Civil Rights era texts as well as my own work on Seeta, a nineteenth-century novel, set in India, by Philip Meadows Taylor. I will discuss the reason for racial markup in each of these projects and the ways in which each accounts for race. Given the nature of this topic, I do not presume to offer a definitive answer for how TEI should address issues of race. Rather, I plan to contribute to an ongoing discussion of ways in which TEI can address this concern among users.

Working Bibliography

Gailey, Amanda. “A Case for Heavy Editing: The Example of Race and Children’s Literature in the Gilded Age.” The American Literature Scholar in the Digital Age. Ed. Amy E. Earhart and Andrew Jewell. Ann Arbor: U. of Michigan P, 2011. 125-144.

Omi, Michael and Howard Winant. “Racial Formation.” Racial Formation in the United States: From the 1960s to the 1990s. New York: Routledge, 1994. 53-77.

Outlaw, Lucius. “Toward a Critical Theory of ‘Race’.” Race and Racism. Boxill, Bernard ed. Oxford: Oxford UP, 2001. 58-82.

Spurr, David. The Rhetoric of Empire: Colonial Discourse in Journalism, Travel Writing, and Imperial Administration. Durham: Duke UP, 1996.

Todorov, Tzvetan. “Race.” On Human Diversity: Nationalism, Racism, and Exoticism in French Thought. Cambridge: Harvard UP, 1998

Electronic Text Services, from Projects to Portfolio Management

Michelle Dalmau, Randall Floyd, Julie Hardesty

Indiana University, United States of America;

Academic libraries have a long history of supporting electronic text projects. Usually these library-based projects follow a version of the Text Encoding Initiative Guidelines for Electronic Text Encoding and Interchange (TEI). For those of us grappling with legacy encoded texts, ongoing encoding projects or more likely a combination of both, we are faced with the need to improve our e-text workflow – from encoding to publishing – in ways that cultivate, promote and support all levels of encoded texts from mass digitization initiatives to scholarly encoding.

For many years now, we at Indiana University have been on a quest to integrate our multiple and varied projects in an effort to move away from unnecessarily customized encoding approaches, one-off web development, and silos of content. We are not alone. It seems that since the proliferation of electronic text centers in the nineties, the evolution from individual projects to portfolio management of projects has been slow-going, but the acknowledgement of a service oriented model has been ever-present: “the approach [to building digital collections] is that of building collections as a by-product of a service center rather than collection building being the primary purpose of the program activity” (Giesecke, McNiel & Minks, 2000, p. 77). As our open source efforts like Fedora digital object repository, eXtensible Text Framework (XTF) web delivery system, and homegrown systems (i.e., METS Navigator and Xubmit) have evolved, a services framework is finally emerging in a way that features the TEI in fundamental and prominent ways.

As Unsworth notes, in this age of Google, in which the TEI Consortium “competes … in intellectual and institutional ways with various disciplines and institutional commitments” it is important to convey and operationalize in cost-effective ways the value of the TEI (2011, para. 22). To this end, we are venturing on establishing a mid-level suite of services that sits somewhere between big data initiatives – like those underway as part of the HathiTrust Research Center – and smaller, carefully crafted electronic text research projects (Kowalczyk, Unsworth, Plale, McDonald & Sun, 2012). According to Crane et al., the progression to “fourth generation collections” of services, arguably similar to those being developed by the HathiTrust Research Center, requires at its simplest level access to scanned-images and full text (para. 8). Perhaps this mid-level service framework, with an emphasis on basic discovery and display of texts, is akin to the elusive interoperable layer that may serve as a “state or stage in the processing of data and not necessarily … the final state or stage” upon which more advanced interactions and mechanisms can be built while providing immediate access to content (Unsworth, para. 14).

At Indiana University, we have begun “bridging the gap between scholarly editing and mass digitization” following a loose adaptation of Crombez’s “hybrid model for interchange” that relies on three components: the core (e.g., scholarly encoded texts), the cloud (e.g., repositories of uncurated texts), and the interface (e.g., web services for automatic text analysis) (2011). Whereas Crombez’s vision is far more sophisticated algorithmically, the metaphor is generally applicable to our efforts in supporting essential e-text services. For us, the streamlining process emerged rather counter intuitively as part of web development, but has nonetheless provided guidance for how to model encoding approaches as we embark on future projects. Following Crombez’s framework, our equivalent of the core is a set of XSL transformations that can manipulate TEI-encoded texts regardless of their level of encoding. The cloud represents, in our library model, the encoded text repository. And the interface provides plug-and-play functionality in the form of modular building blocks that support browsing, searching, display and mobile access that can then be layered with more advanced forms of discovery or textual processing and manipulation.

For this presentation, we will explore building-block strategies and frameworks that will help us define a streamlined electronic text service model capable of supporting multiple textual markup use cases and levels of encoding that we commonly encounter in libraries. We are particularly interested in determining whether the recently overhauled “Best Practices for TEI in Libraries,” which now include a set of schemas, can serve as a basis for an e-text service model (Hawkins, Dalmau & Bauman, 2011). Finally, we are hoping to cultivate a community of TEI expert users of the California Digital Library’s open source eXtensible Text Framework (XTF) delivery system to ensure ongoing system development proceeds with actual TEI encoding practices and use cases at the forefront. The discussion will center on three recent e-text projects that serve to illustrate the new approaches we have taken to address some of these issues: Victorian Women Writers Project (, The Brevier Legislative Reports (, and Indiana Authors and Their Books (

This is work in progress to develop a balanced service model that attempts to accommodate production-level and research projects equally well, with an eye toward modular, reusable development and deployment of e-text projects. We believe that such a service model will help to “ensure that the emergent research environment is TEI-friendly” whether in support of non-consumptive research methods, advanced visualizations or just plain old browsing and searching of texts (2011, para. 24).

Works Cited

Crane, G., Babeu, A., Bamman, D., Breuel, T., Cerrato, L., Deckers, D., … Zeldes, A. (2009). Classics in the Million Book Library. Digital Humanities Quarterly, 3. Retrieved from

Crombez, T. (2011, March 4). Bridging the gap between scholarly editing and mass digitization. Retrieved from

Engle. M. (1998). The social position of electronic text centers. Library Hi Tech, 16(3/4), 15-20.

Giesecke, J., McNeil, B. & Minks, G. (2000). Electronic text centers: Creating research collections on a limited budget, the Nebraska experience. Journal of Library Administration, 31(2), 77-92.

Hawkins, K., Dalmau, M. & Bauman, S. (2011) Best practices for TEI in libraries. Retrieved from

Kowalczyk, S., Unsworth, J., Plale, B., McDonald, R.H., & Sun, Y. (2012). The HathiTrust Research Center: An overview. Retrieved from

Mallory, M. (1994). Directory of Electronic Text Centers. Retrieved from

Unsworth, J. (2011). Computational work with very large text collections. Journal of the Text Encoding Initiative. doi: 10.4000/jtei.215

Library Support for the TEI: Tutorials, Teaching, and Tools

Harriett Elizabeth Green

University of Illinois at Urbana-Champaign, United States of America;

The TEI has been at the heart of several research libraries’ pursuit of digital humanities support and research, and the University of Illinois Library is no exception. This paper will examine how the TEI is core to two facets of the University of Illinois Library research services in providing digital humanities tools, and teaching workshops taught to faculty and students on using Oxygen and TEI. This paper will explore how the TEI has been instrumental in library services for digital humanities, what potential the TEI holds for future projects, and the critical ways in which the TEI has been a powerful tool for libraries’ endeavors in digital scholarship.

TEI Boilerplate

John A. Walsh, Grant Leyton Simpson

Indiana University, United States of America;


John A. Walsh, Indiana University

Grant Leyton Simpson, Indiana University


TEI Boilerplate ( is a lightweight, HTML5 compliant framework for the publishing TEI documents. TEI Boilerplate is designed to bridge the gap between the browser-friendly features of HTML and the semantic richness of native TEI documents (Walsh, Simpson, & Moaddeli, 2012).

TEI is an XML-based language for describing and analyzing literary texts and other documents of interest to humanities scholars. Although TEI provides mechanisms for describing the design, presentational, and material features of the source document, projects and individual scholars that use TEI are responsible for developing their own methods, or implementing existing solutions, for converting the TEI to a presentation-ready state for the web or print (Rahtz, 2006). Two potential paths to reach this goal are:

1. Transforming TEI to HTML using XSLT and styling the HTML output with CSS.

2. Styling the TEI directly with CSS by referencing a CSS stylesheet from within the TEI document.

Both of these approaches have advantages and disadvantages. Although HTML is the language of the web and, as such, is well supported by browsers, HTML’s descriptive capabilities are much less expressive than TEI’s. When TEI is transformed to HTML, much of the richness of the TEI is lost or obscured in the resulting HTML. However, the browser understands HTML very well and knows, for example, when to initiate retrieval of a document based on certain user events, such as clicking a link. The second option, CSS-styled TEI, delivers the TEI document directly to the browser. However, while the browser may apply CSS to format and style a TEI document, the browser does not understand the semantics of TEI. For instance, the browser does not understand that TEI’s <ptr> and <ref> elements are linking elements.

TEI Boilerplate bridges the gap between these two approaches by making use of the built-in XSLT (1.0) capabilities of browsers to embed the TEI XML, with minimal modifications, within an HTML5 shell document. Features expected of web documents, such as clickable links and display of linked images, are enabled through selective transformation of a very small number of TEI elements and attributes. Both the HTML5 shell and the embedded TEI are styled using CSS.

TEI Boilerplate gives HTML/CSS/JavaScript documents direct access to original TEI content, and it gives TEI documents direct access to the substantial capabilities of HTML, CSS, and JavaScript—the dominant document format, styling language, and (client-side) programming language of the web. TEI Boilerplate aims for simplicity and elegance, but it also facilitates complexity and innovation by exposing TEI content directly to the capabilities of JavaScript, the many powerful JavaScript frameworks, and CSS.

In teaching contexts, TEI Boilerplate is a useful solution. Students and scholars new to TEI are rightly excited and enthused as they encode their first documents and experience the expressive power of TEI markup. But too often that initial excitement and enthusiasm dissipate when new TEI users inquire about publishing their documents on the web and are confronted with sometimes overwhelming details about an unfamiliar programming language, XSLT, and other information about server-side processing and transformation engines. With TEI Boilerplate, users can simply add a single line of code to their documents to instruct modern browsers to produce formatted output of their TEI content. And that default TEI Boilerplate output may be further tweaked and customized using CSS.

TEI Boilerplate also provides a simple solution for delivering TEI documents through [Omeka](, the popular open-source web-publishing platform (Omeka, 2012). No additional software or plugins are necessary. One can simply host the TEI Boilerplate XSLT, JavaScript, and CSS files on a server alongside Omeka and change a parameter in the TEI Boilerplate XSLT file to point to that server.

Theoretical Motivations

The power of TEI lies in the richness and expressiveness of its vocabulary. But much of that richness and expressiveness is lost in the translation to HTML. TEI Boilerplate largely preserves the integrity of the TEI document. Because the TEI document is delivered directly to the browser, that source TEI document—unchanged by any XSLT transformation—can be easily accessed and saved to the desktop or mobile device.

Scholars labor over the intricate encoding of TEI documents, encoding that may represent sophisticated readings and analysis. But with the typical XSLT publishing solution, which transforms the TEI to HTML, much or all of the richness of the TEI content is lost. Furthermore, the presentation of the document is targeted at the HTML surrogate rather than the intricately encoded TEI document. This results in a conceptual disconnect between the design of the document and the original TEI encoding. By exposing the TEI directly to the browser, one may format the TEI directly, applying intentional design to a sophisticated model of a source document. If one wishes to manipulate the document using JavaScript, one may manipulate the TEI directly. Scholars are likely to be intimately familiar with their TEI documents, and their engagement with and understanding of these documents may benefit when the formatting and processing is targeted at the TEI itself, rather than at an HTML surrogate.

TEI Boilerplate respects the integrity of the TEI document, and keeps the TEI document central throughout the publication process. TEI Boilerplate takes advantage of the separation of form and content inherent in XML, XSLT, CSS frameworks. However, like Liu (2004), Galey (2010), and others, the authors of TEI Boilerplate view that separation with suspicion. TEI Boilerplate attempts to weaken that separation of form and content in the typical TEI-to-web design and delivery model by largely removing the HTML layer, exposing the TEI-encoded text directly to the browser, and providing scholars with more immediate access to the readings, models, and analysis embedded in the TEI-encoded document.


Our proposed paper will provide an overview of the TEI Boilerplate system and explore in more detail the theoretical motivations behind the project.


* Galey, A. (2010). The human presence in digital artifacts. In W. McCarty (Ed.), _Text and genre in reconstruction: effects of digitization on ideas, behaviours, products, and institutions_ (pp. 93-117). Oxford: Open Book.

* Liu, A. (2004). Transcendental data: Towards a cultural history and aesthetics of the new encoded discourse. _Critical Inquiry, 31_, 49-84.

* Omeka (2012). Omeka: Serious web publishing. Retrieved from <>

* Rahtz, S. (2006). Storage, retrieval, and rendering. In L. Burnard, K. O’Brien O’Keeffe, & J. Unsworth (Eds.), _Electronic textual editing_ (pp. 310-333). New York, NY: Modern Language Association of America.

* Walsh, J., Simpson, G., & Moaddeli, S. (2012). TEI Boilerplate. Retrieved from <>

Resolving the Durand Conundrum

Lou Burnard

Consultancy, United Kingdom;

This paper proposes a minor but significant modification to the TEI ODD language and explores some of its implications. Can we improve on the present compromise whereby TEI content models are expressed in RELAXNG? I argue that a very small set of additional elements would permit the ODD language to cut its ties with existing schema languages, and thus permit it to support exacty and only the subset or intersection of their facilities which makes sense in the TEI context. It would make the ODD language an integrated and independent whole rather than an uneasy hybrid, and pave the way for future developments in the management of structured text beyond the XML paradigm

Rebooting TEI XPointers
Hugh Cayless

New York University, United States of America;

The TEI XPointer schemes ( have been in existence for several years now, but they have seldom been implemented and used. Recently a new working group on Stand-Off Markup was convened by the TEI Council to discuss what to do with this section of the Guidelines. This paper presents a proposal (currently under development) to reboot the TEI XPointer schemes, giving them a complete specification and sample implementations.

More Hack, More Yack: Juxta, Parallel Segmentation, and Opening the Black Box of One Textual Tool

Tanya Elizabeth Clement1, Dana Wheeles2

1University of Texas at Austin; 2NINES, University of Virginia;

The saying adopted by the THATCamp community (“More Hack, Less Yack”), points to a central, problematic issue for critical inquiry in the Digital Humanities: tools developed or “hacked” without the means to provoke critical discussion (“more yack”) can be “black boxes.” Bruno Latour defines the “black box” as a scientific classification for a piece of machinery that is too complex to explain, about which the user needs “to know nothing but its input and output” (3). Certainly, much literary theory discusses the value of “difficult” literature that resists simple evaluative resolutions, and black-boxed tools meant simply to produce access to one-stop textual surrogates can stall critical inquiries into the complexities of Text (in the Barthesian sense). This paper will discuss the “productive unease” (Flanders) or critical inquiry (“yack”) that new developments in Juxta Commons afford for literary scholars interested in interrogating how works—specifically including textual variants that correspond to TEI parallel segmentation — are represented and presented in a digital environment.

Getting critical with the apparatus: how to re-think the TEI encoding of critical editions?

Marjorie Burghart1, Roberto Rosselli del Turco2

1Ecole des Hautes Etudes en Sciences Sociales, France; 2Dipartimento di Scienze del Linguaggio – Universita’ di Torino, Italy;

The Critical Apparatus module/chapter of the TEI is both a central and controversial one : central because it is mandatory for each critical editor to master and use it ; controversial because most editors complain about its features. The module has not been revised since TEI P3, and is therefore in need of a rehauling keeping in mind the expectations and needs of users in a continuously evolving digital environment. The possibilities in terms of processing, displaying or even printing have to be balanced with the need for an encoding scheme manageable « manually » by scholars, without having to rely on third-party tools to handle the tagging process.

In this paper we will discuss the current state of the Critical Apparatus module/chapter, with its strengths and shortcomings. We will also explain the steps that have been taken so far towards a revision of the module, and the difficulties we have met. Finally, we will set out the aim we are willing to reach (and possible solutions towards that end): a module that would answer the various and varied needs of a wide range of users, whether they are encoding a print-based apparatus, a born-digital one, or working on automatic collation.

What does it mean to “support TEI” for manuscript transcription? The tool-maker’s perspective.

Ben Brumfield


Of twenty-fiive manuscript transcription systems cataloged in a crowdsourced tool directory, only six claim to support TEI. What explains such a low rate of adoption for a standard explicitly designed to represent text, including tools specifically tailored for the kind of documentary editing projects these tools address? To what extent do those tools which “support TEI” actually rely on TEI internally to support their application logic? How do tool-makers decide which tags they will present to their users? Finally, how do users react when they are asked to encode documents as part of a crowdsourced transcription project? This presentation will discuss several software systems that support TEI some form, describing the trade-offs toolmakers face when designing public-facing manuscript transcription software.

Foremost among TEI-enabled transcription systems is University College London’s Bentham Transcription Desk, a tool developed for the TranscribeBentham project. Like Wikisource and Scripto, this open-source system relies on MediaWiki as its underlying editorial framework. Unlike those tools, the TranscribeBentham team have added a TEI Toolbar to facilitate their volunteer users’ ability to indicate Bentham’s additions, deletions, marginalia, and other revisions to his manuscript papers. Careful planning went into the selection of tags supported by the TEI Toolbar and the associated tag renderer: Interviews via email with the participants involved with the project have shed some light on a transcription project’s main concerns.

Melissa Terras explains that it was difficult to imagine conducting transcription project that only solicited plaintext transcripts — as with other aspects of the TranscribeBentham project, the challenge was to see whether volunteers would be able to negotiate the entire documentary editing pipeline presented by the tool suite, and TEI encoding was a vital piece of this. Richard Davis describes the goal of achieving a “happy medium” between an intuitive WYSIWYG interface and the exposing of a full set of XML tags and the need to establish a kind of “profile” that precisely fits the job at hand. According to Justin Tonra, factors affecting the selection of tags included the determination that a limited subset of tags were adequate to capture the major authorial operations in the manuscript, concern over the difficulty of the mark-up required of volunteers, and a desire to plan for upgrades to future versions of TEI by limiting the tags to those simple ones which were unlikely to be changed much. On the operational side, Tim Causer believes that “the encoding has deterred a good number of people from participating despite it being made as straightforward as possible to add the tags”. Nevertheless, the the quality of the transcripts submitted remains quite high, with the exception of occasional missing end tags.

In addition to the Bentham Transcription Desk, the presenter plans to interview developers and project managers involved in work on the Itineranova-Editor at Stadsarchief Leuven/HKI Cologne, TextLab (used for the Melville Electronic Library) at Hoftsta University, T-PEN at St. Louis University Center for Digital Theology, Son of Suda On-Line/, and the Islandora TEI Editor from the Univerity of Prince Edward Island. [The final form of this abstract and its bibliography will be dependent on responses from people involved in those projects.

Constructing Analytic Data Categories for Corpus Analysis from TEI encoded sources

Serge Heiden, Alexei Lavrentiev

ICAR Research Lab – Lyon University and CNRS;

TEI tries to stay independent of the technologies that process it. It is nevertheless bound to the XML technology for the moment, which means that TEI source processing is geared toward XML processing today. In Computer Science terms, this means that the TEI source data model is currently often the XML tree of tags encoding texts or corpora. There are good standard technologies to process the XML data model (XSLT to transform documents, XQuery to extract and transform, etc.) often with the final goal of dynamic HTML display in the TEI context.

But XML is not the only data model available to process textual data, and XML technologies are not necessarily the best ones in terms of efficiency (storage size, processing time) and software engineering (access to available software components for a specific task, availability of programmers, maintainability, etc.) given a specific goal for using TEI sources. It should be noted that XML was initially designed by the W3C for HTML data exchange on the web, and not especially for application processing and optimization.

TEI focuses most of all on describing sources correctly and precisely. It is not its priority or goal to help the description of their physical appearance in a display technology, like CSS encoding and the corresponding data categories do. But display is only one type of processing, there are others concerned by textual sources: browsing a database of texts, full text search of words, collocation statistics, etc. And each of these processing types may use their own data categories.

In the same way as the TEI uses modules to organize various types of tags (e.g. core, header, tagdocs) and attribute classes to group similar attributes and to control data types (e.g. att.datable, att.linking), we have begun to design textual analytical categories to help describe what type of data may need to be processed in TEI sources, for what purpose and how. Those categories are designed to be mappable to data types used by any technology to be used to implement the intended processing, not only XML.

Those analytical categories are used in the form of an “import specification” document associated with the TXM platform. The TXM platform is an open-source framework currently under development and designed for the use and analysis of textual corpora with quantitative and qualitative tools [Heiden, 2010]. Being compatible with XML sources, it can process TEI sources if the relevant analytical data categories are described precisely in the sources using the TXM “import specification” (TXM IS) document. Currently, one has to adapt existing TEI related import modules made up of XSLT and Groovy scripts from the TXM IS. Some TEI constructs are already interpreted by default by the platform and more is to come in the future. As the TXM platform evolves, more data categories will be available to be described in the sources and processed.

TXM IS can be seen as a “translation guide” for the source texts into the analytical categories relevant for corpus analysis. It allows questions to be answered in the form of free text description and provides precise technical information (e.g. XPath of elements or attributes in the source document).

A TXM corpus is composed of one or more *text units* associated with various *metadata* that can be used to build subcorpora or partitions and compare the different parts using statistical tools. These are defined in the first section of the TXM IS. In a corpus of literary works the situation is relatively simple: each work is usually represented by single TEI file and the relevant metadata (title, author, date of creation, genre, etc.) can be found in the TEI header. In other cases, the definition of textual unit may be more complicated. For instance, in Bouvard & Pécuchet project, each file corresponds to a page of the archive of Flaubert’s preparatory files for the novel that he had not completed at the time of his death. These pages include notes written by various hands, quotations, etc. To analyze this kind of corpus, it is necessary to determine what kind of “textual units” can be built on the basis of those preparatory pages.

In addition to the *main metadata* used to organize the text units inside the corpus, it is possible to take into account some *secondary metadata* that will be used to build references for corpus citations and to run queries on specific parts of textual units (e.g. direct speech of characters or quotations vs. the author’s narration).

Modern electronic editions used as source texts for TXM corpora can hardly be reduced to “raw text” (series of character and white spaces) without loss of substantial information for corpus analysis. They contain more or less complex editorial markup (notes, corrections, variant readings) and can include text spans in different languages (from several words or phrases in a foreign language to systematic translation of the whole text more or less finely *aligned* with the primary language).

In order to ensure proper functioning of NLP tools and to obtain correct results in counting words, searching collocates, displaying contexts, etc., it is necessary to separate the *main text surface* in a given language from alternative readings, editorial notes, etc.

For instance, TEI <choice> element is used to encode alternative encodings of a single source text span. It is necessary to specify in the TXM IS which “child” of <choice> will be considered as the default text surface and which one(s) will be filtered out or recorded as a particular property of a word from the main text surface. In TXM IS terms, these are defined as *secondary text surface* Some elements can be straightforwardly eliminated from full-text search indexes (e.g. teiHeader content and editor’s notes in an edition of an Old French text). These elements are identified as *off text* in the TXM IS.

In TXM, text units are divided into *smaller structural units* (divisions, paragraphs or verse lines, sentences, words, etc.). All of these can be associated with some metadata. Some structural units can be pre-coded in the source texts, others are automatically added during the import process (typically, word and sentence tags). It should be noted that the only obligatory intra-textual unit is the word (*lexical unit*). The words are usually tagged using appropriate tokenization algorithms during the import process but explicit pre-coding of some or all words and punctuation marks (using TEI <w> and <pc> tags) is also possible with TXM.

If the source text contains word-internal tags (e.g. <ex> for letters added in abbreviation expansions or word-breaking <lb/>), it is particularly important to describe their usage precisely in the TXM IS. The Base de Français Médiéval encoding guidelines require that all words containing internal tags should be pre-coded using <w> tag. In other cases specific rules for processing word-internal tags may need to be worked out and applied as the first step of the import process.

Tagging *sentences* is optional, but it may be very helpful to improve the results of *annotation by NLP tools* and to permit precise queries on collocates. However, the risk of overlap between sentences and pre-coded structural elements is high. More details on the tokenization and sentence tagging in complex TEI-encoded editions of primary sources can be found in [Heiden & Lavrentiev , 2012].

Any structural unit or tagged text span of the source text can be used in queries and in contrasting parts of the corpus.

Although TXM is not primarily designed as a software for visualization of digital *editions*, it allows reading the source text in a convenient paginated form with all possible rendering effects provided by HTML and CSS technologies. Additional features include displaying linguistic annotation of each word, synoptic visualization of two or more text versions or primary source images. All the desired rendering effects and the pagination rules must be specified in the TXM IS. By default TEI <pb/> tag will be used for pagination and a maximum number of words per page rule will apply if there are no page breaks or if they are situated too far from each other.

The current TXM IS system has already been proven through successful application to four different TEI encoded corpora for their import into TXM for analysis and display:
– the BFM corpus:
– the BVH Epistemon corpus:
– the Bouvard & Pécuchet corpus:
– the free Frantext corpus:

The TXM IS will evolve as other corpora will be imported into TXM for analysis and as TXM will evolve itself with new tools incorporated.


Heiden, S. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In Proc. of PACLIC24 (p. 389-398). Sendai, Japan.

Heiden, S., Lavrentiev, A. (2012). The TXM Portal Software giving access to Old French Manuscripts Online. In Proc. of LREC 2012, Istanbul, Turkey.

Automatic Linguistic Annotation with TEI-Output

Damir Cavar

Eastern Michigan University, United States of America;

In this paper we describe our attempt to create a interoperable and more generic approach to linguistic markup of TEI-XML encoded natural language corpora. We are using NLP-components and wrappers from the output formats to TEI P5 XML. We have integrated various NLP-components, in particular the Stanford CoreNLP ( tools in such an automatic annotator, covering tokenization, sentence recognition, lemmatization, part-of-speech tagging, named entity recognition and syntactic parsing. We present the general strategies, the web-service architecture, as well as technical issues and performance evaluations.

The LINGUIST List Corpus: A Large Mailing List Corpus – Management, Annotation and Repository

Damir Cavar, Malgorzata E. Cavar, Helen Aristar Dry, Anthony Aristar


The LINGUIST List corpus (LLC) is a continuously growing mailing list corpus that contains professional linguistic content. We present the processes of corpus creation and maintenance, linguistic analysis and annotation, and storage and retrieval infrastructure for the LLC and other such restricted domain professional mailing lists. We make use of a hybrid storage and data structure that is used in the interface that indexes TEI P5 XML formatted mailings.

Why and how to encode word structures and word-formation formula of a word family dictionary ? A proposal based on <etym>.

Werner Wegstein1, Heike Stadler2

1University of Wuerzburg, Germany; 2Institut für Deutsche Sprache, Mannheim;

In a collaborative research project on ‘Relationships between linguistic and bioinformatic techniques, methods and algorithms: modeling and visualizing variance in language and genomic structures (, funded by a research initiative of the German ministry for Education and Research on ‘Interaction between the Natural Sciences and the Humanities’ we try to develop a valid TEI encoding structure for Jochen Splett’s monumental word family dictionary for the present-day German language: Jochen Splett, Deutsches Wortfamilienwörterbuch. Analyse der Wortfamilienstrukturen der deutschen Gegenwartssprache, zugleich Grundlegung einer zukünftigen Strukturgeschichte des deutschen Wortschatzes. 18 volumes, Berlin/New York, 2009. It is based on the German Duden dictionary in eight volumes, published between 1992 and 1995 and references form, sense and complex intricate word formation structures of about 160.000 entries, organized in 8264 of word familiy cores, in alphabetical order from A to ZYT-.

In doing so we want to test the usability of word family structures for modeling and visualizing variance (and combinability) in the lexical structure of the German language.

The standard dictionary structures pose no problem. We suggest to use the element <etym> to encode information about word structures and word-formation formula.

Clarin-D: TEI for Egyptologists

Norman Fiedler, Andreas Witt

Institut für Deutsche Sprache, Germany;,

This proposal indentifies strategies for applying TEI-based solutions to highly specialized disciplines in the humanities choosing Egyptology as an example. Since disciplines like Egyptology are not well positioned relating to staff and funding they often struggle with implementing complex technical solutions to data in addition to performing their primary research activities.

As member of the German section of the European joint research project CLARIN targeting at supporting digital data and tools in the humanities and social sciences the German Language Research Institute (Institut für Deutsche Sprache, Mannheim, Germany) will thus cooperate with the Heidelberg Institute for Egyptology to employ modern research infrastructures to the field.

Although Egyptian hieroglyphic signs have already been added to the Unicode standard (13000– 1342F), that has so far provided a sufficient basis of the relevant hieroglyphic character set and is supposed to be extended in the future, one of the principle technical challenges in Egyptology has not been met yet: reliable machine readability of hieroglyphic text editions.

The next step would be the transformation of egyptological editions into TEI. These usually consist of original text visualizations – available either as monumental hieroglyphic inscriptions or cursive hieratic handwriting – by means of a standardized computer-generated character set (transliteration) and its interpretation using a phonetic notion system (transcription).

While the transliteration may well be realized by Unicode, the implementation of the transcription into TEI poses some challenges:

1. Most digital transcriptions are encoded in proprietary formats

2. There is no standardized transcription system used in Egyptology

3. The standard has to be flexible enough to be adapted to the several stages of Ancient

Egyptian (Old, Middle, New Egyptian, Demotic)

4. A set of annotations has to be created to contain information about shape, position and,

most important, meaning (i.e. defective writing, determinatives, etc.) of single signs.

5. Furthermore, those annotations should provide information about spelling. Hence a nomination system has to be developed for every single hieroglyphic character.

In the end, transforming the variety of proprietary formats now in use into one TEI-Standard would have to be accomplished by Egyptologists themselves. CLARIN is in a unique position to support researchers in achieving this unifed standard.

Fully harmonizing the transcription systems, especially in scientific publications, is a task too ambitious to start with. But the use of TEI could be a catalyst in the adoption of a common standard.

Therefore we recommend alignment with the most conventional transcription standards known from the lexicographic corpora of the Berlin Thesaurus Linguae Aegyptiae when implementing a TEI standard. This convention works by paraphrasing complex special characters using simple fonts (X = x; o = a; # = A etc.). Additionally, an apparatus of text-critical symbol has to be established to mark grammatical elements like pronominal suffixes, prefixes, endings or omission and admission respectively.


For the identification of hieroglyphic characters within the annotations, this system may indeed be based upon the existing Unicode denomination, although from an egyptological point of view the much more common Gardiner reference list might be worth considering.

Curation, Exploration, and Collaboration in an Age of Big Data

Martin Mueller

Professor of English and Classics, Northwestern University

If you are an early modern scholar, the most interesting news of the decade from the TEI front will be the progressive release into the public domain of a TEI-encoded version of just about every distinct book published before 1700. This will take place between 2015 and 2020.  There is no archive anywhere that is comparable in terms of scale and comprehensiveness. How to explore and curate this resource will be a major challenge requiring new forms of collaboration between scholars, librarians,  IT professionals, and the public at large, ranging from ambitious high school students to retirees with time on their hands and the desire to be useful. The tasks include the correction of errors, the creation and coordination of metadata at the  level of bibliography, element structure, as well as words and sentences, the development of routines for human/machine interaction that will support collaborative work,  and last, but not least, the creation of corpus query tools that will support the exploration of the data and metadata separately and in combination.

TEI for Linguists: progress and perspectives

Piotr Banski2,1, Andreas Witt2

1University of Warsaw, Poland; 2Institut für Deutsche Sprache, Mannheim;

This presentation has as its aim the introduction of the SIG “TEI for Linguists” to those to whom it may be new, to report on its progress, and to invite the conference participants to its meeting, which, following the (fresh) tradition of Würzburg, will primarily consist of micropresentations loosely concerning the very broad topic of applying the TEI to things linguistic, be it linguistic articles that want to nicely encode and then visualize syntactic trees or feature matrices without the overhead of SVG, or full-fledged language resources such as various types of corpora (including treebanks and speech corpora), or various types of lexica (from simple glossaries to either robust end-user oriented dictionaries or equally robust but machine-oriented linguistic databases).

Apart from the report on the completed and ongoing initiatives of the SIG, such as:

* putting together a sizeable bibliography of articles related to the SIG’s field of interest (together with explaining the rationale for it and its guiding principles),

* putting together a special volume of the jTEI devoted to the SIG’s field of interest,

* introducing modifications into the upcoming (June-2012) release of the TEI Guidelines designed to facilitate their application to linguistic markup and explaining *how* they can be used and *why* they should be used (this also relates to making TEI-encoded resources interoperable within the framework proposed by ISO TC37 SC4),

we are also going to introduce the new experimental space for the LingSIG, meant to be both the ground for preparing more serious modifications of the Guidelines as well as to become a virtual home for the SIG members, who currently have no way to semi-officially congregate and cooperate. This new space, hosted at SourceForge, awaits the newest release of the Guidelines to branch off of it.

Being able to present this material as a full-time talk is, we believe, quite essential here: there are too many details (especially where the rationale for, and the application of, the new additions to the Guidelines is concerned) for this to be presentable as a poster, and, naturally, too much material for a micropresentation.

Delivering the Digital Edition: time and space on the screen

Elena Pierazzo

King’s College London, United Kingdom;

The introduction of elements for the encoding of documentary and genetic editions in December 2011 represents the biggest change to the TEI since the introduction of P5 (2007) and conceptually perhaps the biggest change since the invention of the TEI altogether. In fact the concept of text as abstraction and semantic unit has been now joined by the concept of text as ensemble of “geographically” defined traces of ink (or other materials) over a surface. This development has arrived after a journey of (at least) four years that has involved about thirty people between Europe and North America. We can now begin to evaluate this endeavour: was it worth it? Is it working? What is its scholarly significance? And, more to the point: which options does it enable?

In the months following the introduction of <sourceDoc>, <metalmark> and the other new elements I have been experimenting with the new encoding system, working also on how to present such encoding in a captivating, engaging way, with the purpose of stretching to the limits the new encoding and investigating ‘in the field’, so to speak, the real possibilities offered to scholars. The results have been very encouraging.

The experiment has been conducted over a handful of pages from a notebook of Proust, namely Chaier 46, with the substantial input of a proustian scholar, Julie André. The transcription has been conducted zone by zone, encoding also the presumed order in which such zones were progressively written by Proust. The output plots on-click the zones on top of the digital facsimiles, in the order in which they were presumably written or in the order according to which the final version is meant to be read. From a technical point of view, the output makes use of a combination of SVG (mainly), javascript, JQuery and HTML. The prototype is freely accessible on the web and the XML, the ODD and the XSLT script all available for download ( An embryonic version of this prototype has been presented at the conference Proust, l’œuvre des manuscrits which was held in Paris, 1-2 March, 2012. The conference was organized by the “Equipe Proust” of ITEM-CNRS (Institut des Textes et Manuscrits modernes), with funding by the ANR Program “CAHIERS-PROUST” (Nathalie Mauriac Dyer, ITEM, dir.). A more advanced version of the prototype has been used as a base for a workshop held during the first Australasian Digital Humanities conference. In all these venues the prototype has received very encouraging feedback.

The introduction of dimensions of space and time into transcription cannot be described in any other way other than “revolutionary”, despite the risk of being too rhetorical. From an encoding point of view, the transcription of zones does not have to follow any particular order, apart, that is from the one that is more convenient for the encoder, and this without attributing any semantic meaning to the order in which the transcription is actually made. The provision of coordinates to each zone makes the handling of any physical (i.e. on the page) overlapping of zones trivial: each zone can be considered a different layer which can be successively plotted onto the surface, creating any desirable configuration (and I salute here the relaxation of the rules that have made it possible to have polygonal and overlapping zones).

From a scholarly point of view, the handful of pages that have been encoded and output have demonstrated that not only is this new presentation system able to display the manuscript content in a meaningful way, but the fact that the textual fragments (the zones) are kept in the context in which they were inscribed makes the writing and authoring process intelligible in an intuitive way. The latter achievement was made possible by the introduction of the temporal dimension, thanks to the <changeSet> element which enables scholars to time any element (zone, lines, and even surfaces) within the transcribed document.

This mechanism is then able to capture writing as a process, and not simply as the result of that process, as traditional diplomatic and ultradiplomatic transcription have done so far. The investigation of the writing process is indeed a very promising and fecund area of study which has been only represented so far by erudite discussion and essays and by very few print editions (the Brepols editions of Proust’s Cahiers being one exception).

On balance this experience has been very positive, but it has also outlined a few drawbacks. For instance the handling of uncertainty in the ordering of the changes is very rudimental. The prototype makes a very idiosyncratic use of the <certainty/> element for that purpose, but this system seems totally inadequate to handle uncertainty of entire sequences or the handling of alternative sequences of changes.

On the output side, more research need to be done on the way we present manuscripts online: putting pages and even transcribed pages alongside facsimiles is not enough to engage users to read and take advantage of the scholarship offered by digital editions. The prototype represents an attempt to make more enjoyable and fun the user experience, but certainly this is only a very first tim

Encoding Financial Records for Historical Research

Kathryn Tomasek1, Syd Bauman2

1Wheaton College, United States of America; 2Brown University, United States of America;

Encoding Financial Records for Historical Research

Paper Proposal

Kathryn Tomasek, Wheaton College MA

Syd Bauman, Brown University RI


“[A]ccount books,” business historian Judith McGaw noted in 1985, “are among the most common but least accessible primary sources for historians…”. In the intervening twenty-six years, historians have made use of these rich sources to produce a range of articles and monographs, but few digitization projects have tackled financial records directly.

This paper reports on work in progress conducted by a group of historians, archivists, and technologists in the United States who have taken initial steps towards developing standards for markup and metadata for manuscript financial records using TEI. Drawing on the expertise developed by pioneering projects at the Massachusetts Historical Society, the University of Virginia, and MIT, this group met at Wheaton College in August 2011 to describe the current state of affairs through discussion of problems and case studies, to propose some paths forward, and to identify model projects that can be fostered to test those paths.

This paper will use examples from the Wheaton College Digital History Project to sketch out the problem space and to demonstrate mock-ups of two possible solutions: embedded markup using (among others) the TEI <measure> element, and stand-off markup using a new tagset, a so-called “transactionography”. It also highlights the need for further research and development to suggest methods better suited to expressing some of the more complex semantic values within historical financial records.

Features and Challenges of Financial Records in Historical Research

A genre of primary sources that includes such materials as bills and receipts, ledgers, and cashbooks, financial records are abundant in traditional archives, but current digitization projects either do not record the financial semantics of the records, or if they do, they have yet to develop a common method for fully expressing these semantic values. For example, the Railroads in the Making of Modern America Project at the University of Nebraska includes a few transcriptions of payroll records for railroad employees, and the Bethlehem Digital History Project includes a few extracts from business ledgers. In these cases, the financial information is neither comprehensive nor presented in a manner leverageable by researchers. Editors at the Massachusetts Historical Society, who have encoded the Farm Book of Thomas Jefferson using TEI, are dissatisfied with the limits of the current system. Editors of the Papers of George Washington at the University of Virginia are transcribing financial papers into a relational database.

Historical financial records contain numerous semantic details that go far beyond the standards for double-entry accounting described in John Mair’s influential eighteenth-century textbook, Book-keeping Methodiz’d; or, A methodical treatise of Merchant-accompts, according to the Italian form, 8th ed. (1765). Perhaps the most significant challenge in dealing with historical financial records is their apparent regularity. Manuscript financial records tend to be tabular in format, but they also tend to include information that cannot be represented through simple transcription of tabular layout. Such records in fact include significant variations and idiosyncrasies. In the eighteenth and nineteenth centuries, the impulse to record expenditures, transactions, and other such events related to exchanges of cash, property, and goods and services did not always translate into perfect adherence to Mair’s principles.

Some authors of historical financial records did not even attempt to employ double entry bookkeeping. Even very regular documents like the printed forms used by the Comédie Française include additional types of information such as cast lists on versos that extend beyond the theatre receipts recorded on rectos. Editors at the Massachusetts Historical Society have noted that John Adams included a list of expenditures for travel to and from Philadelphia during the Revolutionary War in the midst of simple prose in one of his diaries. Historical financial documents nevertheless represent efforts to keep records with some attention to structure, in many cases in reference to the principles of double entry bookkeeping.

Still other significant challenges in dealing with historical financial records lie in the question of standardization in representation of amounts of currency, commodities, and other items counted. In the case of currency, for example, standardization would seem desirable to make amounts processable across collections being digitized. But determining which standard to use presents a puzzle. Should references to a certain amount of a given currency be merely regularized, be normalized to a standard contemporaneous currency, be normalized to a standard modern currency, or be normalized to a reference currency?

Another quandary appears in the case of commodities traded. While each project could certainly create its own taxonomy for commodities mentioned in its collections, such a course would severely decrease the utility of the data produced for analysis across projects. Ideally, a controlled vocabulary shared across projects would maximize the value of harvestable data.

Some standardized data has long been available to social science historians. The Minnesota Population Center’s Integrated Public Use Microdata Series provides harmonized data for U.S. and international population studies. The National Historic Geographic Information System provides United States historical geography, as does the Great Britain Historical Geographic Information System for Britain. And while a strong tradition of monographs and journals in economic history has resulted from investigation of archival financial records, no comparable project exists for their digitization.

In our survey of the field, we have yet to find a historically sensitive standardized vocabulary to recommend for financial records. The Harmonized System established by the World Customs Organization, for example, does not seem optimal because it is designed to represent contemporary commodities. While this Harmonized System includes a code for muslin for instance, it is unlikely to provide one for salt peter. In addition, the nomenclature tool for this system is behind a pay wall and is thus inaccessible for most academics.

Possible Paths

The paper will include markup of sample documents from the Wheaton College Digital History Project, including

• a boarding receipt from 1862 that records a single transaction;

• a laundry list from 1862 that records a single transaction involving numerous items;

• a boarding account from 1862 that records itemized daily charges for meals connected to the boarding receipt described above;

• several pages from a more complex daybook for the period between 1828 and 1859 that include multiple transactions and references to account numbers recorded in a separate ledger.

These files demonstrate both the utility of the TEI <measure> element and the use of the “transactionography” for simple financial documents, and some of the challenges presented by more complex documents produced in the practice of manuscript double entry accounting.

Double entry bookkeeping as developed in Venice during the Renaissance and described by Mair in the eighteenth century represents a specialized vocabulary, a professional jargon that requires data modeling with attention to the special meanings of the terms “debtor” and “creditor” in this language, as well as the relationships between transactions recorded in the daybook and accounts kept in a separate ledger. We are developing a TEI customization for conveying such meanings and their expressions within the journals and ledgers of double entry account books through a “transactionography” that will represent the relationships among such records in abstracted form.

As currently conceived, a “transactionography” models transactions as a sequence of one or more transfers of anything of value from one account to another. Thus, the simple purchase of a candy bar from a convenience store is represented as two <transfer>s: one of a candy bar from the vendor’s stock account to the buyer, and one of $1.25 from the buyer’s cash account to the vendor.

We believe that this model is sufficient to represent double entry bookkeeping, though we have not yet tested it thoroughly. We have a (working) ODD file for a first cut at such a “transactionography”, and we hope to have a more refined version available at the TEI meeting in fall 2012.

Long Term Goals

Standardized digitization of this rich yet currently inaccessible genre of manuscript historical records has the potential to open significant new lines of inquiry about economic, social, and cultural history. With extensive application of standardized markup to such records from diverse places, researchers could compare data about continuities and changes in such factors over time and space. Digital collections created according to such standards could yield a large volume of harvestable data for examination of economic, social, and cultural conditions across national lines.

Aggregated data from digitized collections of financial records could reveal significant new information about the history of daily life across time and space. We now seek to stretch the TEI Guidelines to broaden access to manuscript financial records.

Note: Activities described in this proposal received funding from the National Endowment for the Humanities in the United States. Any views, findings, conclusions, or recommendations expressed in this proposal do not necessarily reflect those of the National Endowment for the Humanities.

Selected Bibliography

Note: We are grateful to John J. McCusker for sharing his bibliography of accounting in early British America, of which excerpts are included here.

Adams Family Papers: An Electronic Archive, Massachusetts Historical Society,

Baxter, William T. “Observations on Money, Barter, and Bookkeeping,” Accounting Historians Journal, XXXI (June 2004), 129-139.

Bethlehem Digital History Project,

Bloom, Robert, and John Solotko. “Barter Accounting in the US during the Late Eighteenth and Early Nineteenth Centuries,” Accounting History, New Series, IX (March 2004), 91-108.

Bloom, Robert, and John Solotko. “Elucidating Needs, Lifestyles, and Community: Researching a Late Eighteenth-Century Account Book from Lexington, Virginia,” Accounting History, New Series, XIII (August 2008), 333-352.

Burnard, Lou and Syd Bauman, eds. TEI P5: Guidelines for Electronic Text Encoding and Interchange. 2.0.2. 2012-02-02T17:24:24Z.

Comédie Française Register Project, Massachusetts Institute of Technology,

Great Britain Historical Geographic Information System (GBHGIS),

Hollister, Joan, and Sally M. Schultz. “The Elting and Hasbrouck Store Accounts: A Window into Eighteenth-Century Commerce,” Accounting History, New Series, XII (November 2007), 427-440.

Jefferson, Thomas. Farm Book. Massachusetts Historical Society.

Mair, John. Book-keeping Methodiz’d; or, A methodical treatise of Merchant-accompts, according to the Italian form. Wherein the theory of the art is fully explained,… To which is added, a large appendix. … 8th ed. Gale ECCO Print Edition reproduced from the National Library of Scotland. Edinburgh: printed by W. Sands, A. Murray, and J. Cochran, for W. Sands, A. Kincaid & J. Bell, and A. Donaldson, 1765.

McGaw, Judith A. “Accounting for Innovation: Technological Change and Business Practice in the Berkshire County Paper Industry,” Technology and Culture, 26/4 (October 1985), 703-725.

The Minnesota Population Center’s Integrated Public Use Microdata Series (IPUMS) and National Historic Geographic Information System (NHGIS),

Railroads in the Making of Modern America, University of Nebraska,

Schultz, Sally M. and Joan Hollister. “Jean Cottin, Eighteenth-Century Huguenot Merchant,” New York History, LXXXVI (Spring 2005), 133-167.

World Customs Organization. Harmonized System.

Conversation, Translation, Materiality:  English Graduate Student Reflections on TEI

Victor Del Hierro,  Bailey Kelsey,  Kate Ozment, Laura Smith, Christina Sumners
Texas A&M University

In Fall 2012,  students in an English graduate seminar that focused on recovering  writers and texts excluded from the traditional literary canon created a TEI-encoded poem by a 19th century woman writer.   The students — most of whom had no TEI training prior to this semester – will present their reflections not only on the process of learning TEI,  but on the ways in which that process raised important, often discipline-specific ,  issues and concerns.    The topics to be addressed include:  what is revealed/obscured by  the technology of the book;   how TEI functions as a re-embodiment of the material text;  and TEI encoding in relation to questions of access and inclusion in digital scholarship.

Romantic Circles and the Problem of Legacy Data; or, How to Make a Silk Purse from an HTML Site

David Rettenmaier, Michael Quilligan

Romantic Circles / University of Maryland, United States of America;

The logic behind the founding of the website Romantic Circles in 1995 falls under what we now may refer to as the paradigm of hypertext. It was part of those heady days when freedom from print meant not only freedom from the page limitations inherent in traditional print editions and journals, but also from the so-called linearity of book reading. In the intervening 15-plus years, the breathless rhetoric of that moment has shifted, and digital humanities has grown a much more codified set of practices that emphasize the richness and persistence of data. In large part, Romantic Circles remains a fundamentally static, HTML-driven site, full of the idiosyncrasies that characterize the “sandbox” mindset of early digital scholarship, where hundreds of editors and contributors each have had their hand in creating digital scholarship and its underlying code. In consequence, as the most recent site managers at RC, we have been presented with a number of unique opportunities and challenges, namely how to re-envision the site as semantically-rich and dynamic, while also fostering a richer experience both for casual visitors and for content-creators.

A representative case in point is the early Romantic Circles electronic edition of Mary Shelley’s The Last Man, edited by Steven Jones (created circa 1997). Containing more than 1,400 files, the edition is replete with paratexts: a bibliography, maps, image and audio files, a chronology of Mary Shelley’s life, a selection of related writings by Mary Shelley and other authors, and several full-text critical essays. In its current iteration, the edition presents these varying types of content in two ways: 1) Through a table of contents and menu system and 2) Through basic hypertextual linking within the text of Shelley’s novel. As effective as it has been over its 15-year existence, the edition suffers from a dearth of semantic data. What we’ve realized a new Romantic Circles needs is finer and finder grains of data: in this case, an edition that maintains a similar navigational structure but that also allows users to find content by date, by type (audio file, image, etc.), by author, and by location—as well as easy access to associated content both within and beyond the edition via taxonomic keywords. But how to accomplish this task, not only for the edition of Last Man, but across the entire site?

For solutions, we’ve looked not only to Content Management Systems (CMS) such as Drupal, but also to the TEI . Since the vast majority of the approximately 15,000 pages on the site are written in HTML, with its intermixing of presentation and content, TEI markup allows us not only to take presentation out of the equation, but also to develop a content-specific Romantic Circles “house style,” thus standardizing our encoding practices so that legacy problems are minimized for the future. We’re also developing ways to leverage our TEI-encoding, as well as our legacy metadata, in a database-driven architecture. At this juncture the TEI, along with our in-house XSLT transformations, becomes crucial: they act as interchange points that allow us to regularize both our current and legacy data for migration into a CMS (Drupal). This process also allows us to use the TEI as a metadata format, drawing out, for example the people and places mentioned in a given text as taxonomic keywords that can be, in turn, used to associate otherwise disparate pages in the CMS front end. The object of the CMS framework is to increase the granularity of Romantic Circles data. In CMSs like Drupal, the fundamental unit of data is the database field rather than a Web page. Hence, an individual page can be composed of many “nodes” served up by the database environment. This allows for easy customization of the site and facilitates wholesale changes to the site’s menu, header, footer, and sidebar without changing individual pages. While this was doable in static html using includes files, in this case, the navigational elements of the page are completely separated from the content—analogous to the move away from HTML because its mixing of presentation and content.

This presentation will discuss the problems inherent in conforming legacy content to modern encoding standards, and also will also detail the ways in which we’ve leveraged the semantics of TEI for a modern rebuild of the site, including the structuring of metadata; the development and use of visualization tools like prosopographies, geotagging, and collation software; and the incorporation of Romantic Circles into aggregator sites like NINES. We will also demonstrate key features of the new database-driven Romantic Circles and its XML foundation.

The Project Gutenberg book archive as a TEI P5 XML text corpus

Malgorzata E. Cavar1, Damir Cavar2

1Institute for Language Information and Technology; 2Eastern Michigan University, United States of America;

The Project Gutenberg book archive is an invaluable recourse of natural language resources in form of textual books in many languages. It has been a goal for years for us to convert the Gutenberg archive to the TEI XML format. Our objectives are to annotate it wrt. meta information, as well as linguistically, and make the content available not only as a searchable and researchable collection of books and content, but also as a text corpus for linguistic analysis, and also as a resource for qualitative and quantitative linguistic models. These goals not only involve the annotation and conversion process, but also the development of new interfaces for various quantitative and qualitative studies of the content. In the following we describe our approach to the corpus creation, the automatic and semi-automatic conversion process, and the initial online interface for the corpus analysis.

TEI and the Early Modern OCR Project (eMOP)

Laura C Mandell

Texas A&M University, United States of America;

Collectively, the US, the UK, and scholars around the globe face a problem: rare books and pamphlets from the early modern era which have not yet been made available digitally threaten to become invisible to future scholars. The mode of finding materials in special collections has not happened via metadata alone, and thus, insofar as finding aids and collection catalogs are supplanted by digital databases, much that has a value not reflected in its metadata—books bound with other books, authorship attributions made by readers and librarians—much of value could be lost. With the mountain of digital research materials growing ever larger, to use Vannevar Bush’s metaphor, early modern documents—everything from pamphlets to ballads to multi-volume poetry collections—preserved only by metadata records and page images could fall beneath notice, becoming very difficult for even the most devoted researchers to locate. But we can give early modern texts a higher digital profile. Optical Character Recognition software (OCR) could be used to create machine readable versions of these texts, making them more findable through being made fully searchable—increasing, as it were, their digital trail. OCR technology is now excellent, but when dealing with the vagaries of early modern printing technology and practices, as well as page images that have been digitized from microfilm, automated transcription can only go so far. Adequately transcribing early modern texts from the image-resources already at hand can be accomplished via carefully orchestrated human-machine interaction, the purpose of a project being launched at Texas A&M University: the Early Modern OCR Project (eMOP).

OCR engines fall short in many ways with early modern texts, unable to discern ligatures, lines, layout, pictures, and blotches on images from printed texts. Although OCR engines introduce many more errors than readable text, somehow we trust the mechanical transcription above any attempts at fixing them, including post-processing substitutions that do fix many errors but of course necessarily introduce some as well. For example, we know that there is no word in the English language has the combination “fh” in it, but substituting “sh” for “fh” may turn “the” into “she.” Also, paradoxically, libraries are loath to accept hand-corrected OCR, especially when such corrections have been done by “the crowd.” The Australian Newspaper Digitisation Program provides a welcome exception, and others, such as Transcribe Bentham, are following suit.

Our OCR engines and post-processing routines will be built in the eMOP project to output TEI-A. In addition, we will make use of TEI tags to record multiple corrections by many agents, from post-processing routines to human beings using various tools that we are building for the purpose of correction. Using TEI to document the automated and human correction of 45 million pages of texts allows us to be bold as we work, producing a highly readable and searchable text, while preserving the history of each step of our process so that, should future generations of digitizers wish to undo, redo, or alter our process, they can do so. Moreover, per an historic agreement with ProQuest and Gale Cengage Learning, owners of the data collections of digital page images that we’ll be using (EEBO—Early English Books Online—and ECCO—Eighteenth-Century Collections Online), the TEI-encoded document of corrected text is given to the human beings who correct them. This will allow scholars to create TEI-encoded digital editions. Our project effectively mixes “boutique” and big-data approaches to digital preservation.

In this talk and demonstration, I illustrate where the project is so far and show the TEI outputs, ideally getting input from the audience to help us better formulate our tagging system.

Teaching the TEI: from training to academic curricula

Elena Pierazzo1, Susan Schreibman2, Marjorie Burghart3, James Cummings4, Florence Clavaud5

1King’s College London, United Kingdom; 2Trinity College Dublin, Ireland; 3Ecole des Hautes Etudes en Sciences Sociales; 4University of Oxford; 5Ecole Nationale des Chartes;,,,

TEI has been accused of having a very steep learning curve and, as a matter of fact, for most people the way into TEI is by being taught by some “initiated” expert rather than by teaching oneself. In this respect, it is crucial to devote a great deal of attention to the way TEI is taught around the world, in order to ensure the highest standards and, ultimately, the survival of TEI itself. For most of us this taught initiation takes the form of a workshop, a summer school, a few days intensive training focussed on the need of a specific research project, but, as Digital Humanities curricula at both BA and MA levels develop around the world, for an increasing number of people the first rudiments of the TEI are obtained instead via a fully credited course within an academic programme.

The purpose of this panel will be an analysis of the differences and the similarities in teaching the TEI in a dedicated intensive course, often professionally oriented, tailored to the specific needs of a project, or the teaching of it within an academic course. We will look at overlaps and differences in aims, methods and coverage between the two types of teaching, asking, for instance, which TEI chapters feature in the respective syllabi. We will also investigate the success rate of the different types of teaching: is the teaching in a generalised, academic framework more likely to be effective in the long term? Are students of academic courses more or less likely to use the TEI in different circumstances in their future professional life with respect to the attenders of focussed training programmes aimed at specific research projects? To put it better: does the TEI work better as a tool to solve a problem (a digital edition of a manuscript, a dictionary, a corpus…) or as an analytic, modelling tool? What can be learned from experience of the two types of delivery to enhance teaching in the different frameworks?

The panellists are mainly based in Europe and will relate their experiences in teaching TEI at King’s College London (BA and MA level), Trinity College Dublin (MA), Ecole Nationale des Chartes in Paris (MA), Ecole des Hautes Etudes en Sciences Sociales in Lyon (BA and MA level), and the University of Oxford (Summer School and various workshops and training). The panellists hope that this session will include a larger discussion, in particular with their North American colleagues.