Rudder Tower 601
Friday, November 9, 2012
4:00 PM – 5:30 PM
Chair: Syd Bauman
“Announcing Juxta Commons”
“Do you TEI? A Survey of TEI Practices in Libraries”
“Toward a Can-do-TEI Attitude: Refining the TEI Display plugin for Omeka”
Tanya Elizabeth Clement
Poster Slam Abstracts
Following on papers, presentations and discussions that resulted from 2009 TEI conference on “Text Encoding in the Era of Mass Digitization,” the launch of the AccessTEI program in 2010, and the recent release of the “Best Practices for TEI in Libraries” in 2011, stewards of text encoding initiatives in academic libraries and cultural heritage organizations need to better understand if and how text encoding approaches and practices have changed as a result of mass digitization by Google and an increased emphasis on streamlined digital library operations. As Unsworth notes, in this age of distant reading and non-consumptive research, in which the TEI Consortium “competes … in intellectual and institutional ways with various disciplines and institutional commitments” it is important that we understand in quantitative and qualitative ways the role of text encoding as supported by academic libraries so that we can clearly convey to our partners, colleagues and administrators the value of text encoding following the TEI Guidelines (2011, para. 22).
The TEI Consortium’s SIG on Libraries released a major revision to the “Best Practices for TEI in Libraries” in October 2011. The current version of the “Best Practices” are stored in ODD files and continue to be based on the widely adopted “encoding levels” as described in earlier versions of this document, which span from fully automated reformatting of print content (level 1) to deep encoding to support content analysis and scholarly uses (level 5). A substantially revised TEI Header section supports greater interoperability with library cataloging codes and the MARC format. In addition, the “Best Practices” now include schemas for encoding levels 1-4, derived from the appropriate ODD, providing a mechanism to better ensure conformance and interoperability of digital texts at the various levels.
A little over a year since the release of the “Best Practices,” we have devised a survey to better understand how libraries foster and support text encoding following the TEI Guidelines, and in particular, how the latest version of the “Best Practices,” arguably more library-friendly and robust than its predecessor and with schemas corresponding to the levels of encoding, is used in project planning and encoding workflows. The survey builds upon survey data collected by the TEI-C in 2009 when developing the AccessTEI program. We will unveil the survey at the Digital Library Federation Forum and the TEI Conference, both occurring within days of each other in November 2012. Our poster presentation will detail the methodology established and provide a venue by which we can begin collecting data, including anecdotal data to help us frame our approaches to data analysis.
AccessTEI: A Digitization Benefit for Members of the Text Encoding Initiative. Rerieved from http://www.tei-c.org/AccessTEI/
Hawkins, K., Dalmau, M. & Bauman, S. (2011). Best Practices for TEI in Libraries. Retrieved fromhttp://purl.org/TEI/teiinlibraries
Unsworth, J. (2011). Computational work with very large text collections. Journal of the Text Encoding Initiative. doi: 10.4000/jtei.215
This poster examines the technical background to the Stationers’ Register Online (SRO) project at the University of Oxford. This pilot SRO project received institutional funding from the Lyell Research Fund to transcribe, digitise in full the book-entry Register of the Stationers’ Company. This is arguably the most important primary source for the study of the history of the book in Britain apart from the books themselves. The Stationers’ Register was the primary means through which ownership of texts was asserted, disputed, regulated and monitored from 1577 until 1924, and survives intact in two series now held in Stationers’ Hall and at the National Archives. The SRO project has the support of the Register’s custodians: the Worshipful Company of Stationers.
The pilot SRO project is currently underway has commenced by digitising Edward Arber’s 1894 printed transcripts of the earliest volumes of the Register (from 1557 until 1640) and the Eyre and Rivington 1914 transcripts of the registers (from 1640 until 1708); cost the proofing and correction of the resulting transcription against the manuscript originals; and cost the transcription of the later series from both manuscript and printed sources. The eventual intention is to publish an indexed, searchable database containing full-text transcription of the Register. It would be possible to link this transcription to images of the original manuscript materials; however, neither the digitisation of the manuscript books, nor such a web interface, is in scope of the pilot project.
As part of the preparation for the digitisation of the earliest volumes of the Register by a keying company, the project created a byte-reduced schema for concise digitisation without loss of intellectual content. The motivating factor for this is that the keying company charges per kilobyte of output and while XML output was desired there were significant savings to be had from reducing file size. In this case the TEI ODD meta-schema language was used to constrain and rename TEI P5 XML elements and attributes to reduce the size of output files. The resulting SRO schema (called tei_corset) provides a reduction of up to 40% of file size compared to its TEI equivalent. The TEI ODD mechanisms of renaming of elements and attributes are linked to an XSLT stylesheet for reverting any document instances to their expanded TEI P5 XML form. Moreover, this stylesheet also up-converts some aspects of the keyed documents to more sophisticated markup forms. (Particularly this includes the automatic recognition and transformation of dates, and roman numeral fees.) The poster will examine the creation of this schema and the benefits and challenges of upconverting basic presentational markup to richer semantic encoding. Additional technical topics under consideration for this poster will be the management of encoding guidelines for the keying company, and quality assurance of the resulting data that they provide.
Future developments for the SRO project may include linking the Register with other parts of the Company archives and resources elsewhere. Interoperability opportunities include, on the one hand, bibliographical resources such as ESTC, NSTC, Worldcat, EEBO and ECCO; on the other, biographical and topographical resources such as the British Book Trade Index and Michael Turner’s Index of London Book Trades (on which the Bibliographical Society and the Oxford Bibliographical Society are currently working to publish online). The SRO resource could thus conjoin descriptive bibliography with the sociology of the text. It would enable the possibility of analysis of the pattern of book-registration over the entire term of its existence, across the periods of statutory Licensing and various intellectual property regimes, including that of statutory copyright from its beginnings. Cross-reference against the corpus of surviving books will permit assessment of patterns of registration and ownership of texts alongside queries into possible patterns of production. It will facilitate inquiry into the impact of intellectual property and media regulation regimes on British cultural and intellectual life, focussing and deepening what has been an intense debate within book history in recent years. Lastly, the resource will shed light on the operations of the Stationers’ Company itself, an important institution in the economic, social and cultural fabric of London and its dominions for several centuries.
1. Specification of a platform-agnostic byte-reduced TEI P5 XML markup schema for data-capture and associated stylesheets and scripts for data-transformation, conversion to a richer TEI P5 XML schema for including rules for capturing and regularising book titles and individuals’ names; converting between Old and New Calendars; and recording entrance fees, conditions and licenses.
2. Management of the transcription of Arber by a specialist keying company, including quality assurance, into a sparsely marked-up in an XML format (tei_corset).
3. Specification of a suitable web-database publishing platform capable of processing complex searches, with the ability to associate search-results with images of the manuscript originals and to interoperate with a range of biographical and bibliographical databases.
4. A detailed statistical, palaeographical and analytical survey of the entire run of the original manuscript Registers. The survey will assess the consistency of the entry format over time, identify any idiosyncrasies and tabulate numbers of pages, entries and average character-counts by year. The intention is to extend the transcription and markup schema developed to capture Arber’s transcript to cover both its manuscript originals and those of a later date. The schema has also extended to cover the Eyre transcripts of the Registers from 1640 to 1708 and its manuscript originals. The resulting dataset will allow searches by Arber or Eyre page numbers, with which bibliographers are familiar, as well as allowing access to both regularised and diplomatic transcriptions.
5. Historical research on the institution of registration, the Stationers’ Company and the book trades; survey of historical and current uses of the Register in bibliographical and other research. Enquiry into complementary research agendas in the history and future of intellectual property law and media regulation.
6. A workshop to be held in Oxford bringing together scholars of the Stationers’ Company and the history of the book trades in Britain with project personnel, scoping possible use-cases for the envisaged database and soliciting feedback on the proposed editorial and technical standards.
This poster will introduce not only the project, but especially the technical development of a TEI ODD file customisation to generate a byte-reduced schema and the mechanisms for reverting and up-scaling the files. The TEI ODD customisation (tei_corset) and XSLT Stylesheets for reverting back to pure TEI will be made freely available under an open licence.
TXM is a free and open-source cross-platform Unicode & XML based text/corpus analysis environment and graphical client, supporting Windows, Linux and Mac OS X. It can also be used online as a J2EE standard compliant web portal (GWT based) with built in access control.
It offers a comprehensive range of analysis tools (concordances, collocate search, frequency lists, etc.) based on the powerful CQP full text search engine (http://cwb.sourceforge.net) and a range of statistical functions (factorial analysis, classification, cooccurrency analysis, etc.) based on R packages (http://www.r-project.org).
Read a full description at the TEI Tools wiki http://wiki.tei-c.org/index.php/TXM.
Read the scientific background at the Textométrie project web site http://textometrie.ens-lyon.fr/?lang=en
The demo will show:
– several representative imports from various sources (TXT, XML and XML-TEI) into the TXM Windows desktop application (http://sourceforge.net/projects/txm) with NLP tools applied on texts on the fly (lemmatizer)
– several tools available in the TXM portal (http://txm.risc.cnrs.fr/demo/?locale=en) and applied to those sources: frequency lists, kwic concordances, text edition browsing, etc.
T-PEN is a web-based transcription tool that permits its users to transcribe from digital surrogates of unpublished manuscripts. T-PEN acts as a service in between digital repositories and the scholars who wish to transcribe their collections. Using a line parsing algorithm, T-PEN identifies the location of each line on the digital image (which can be modified by the user). This permits the user to transcribe line by line.
T-PEN also allows users to encode as they transcribe. Users can create their own XML buttons which appear in the transcription UI, and can be inserted into the transcription line. Users can also elect to link a schema (such as TEI) to a project from which an editable XML button set can be created. This allows the scholar who best knows the text in question to also encode at the same time.
T-PEN also provdes access to other transcription tools, such as electronic dictionaries, textbases. The current built-in tools support pre-modern manuscript work, but users can also add their own tools.
T-PEN supports collaborative projects and crowd-sourcing projects (which are designated at public projects in T-PEN). Users can also export their transcriptions in RTF, PDF and XML formats (and there is some basic transformation for XML tags). Finally, T-PEN exposes transcription data as linked data using OAC (Open Annotation Collaboration), an RDF-based technology.
At the poster presentation, I will present a large poster which outlines the basic features of T-PEN and visitors can experiment with T-PEN on an accompanying laptop computer.
Given that the digital humanities propose new conceptualizations and new methodologies for humanities research, there has always been a strong tradition of tool building within the field—aimed at developing software that could express and help carry out such new approaches. Yet the outcomes of much digital humanities tool building, when these are considered strictly in terms of software produced, must be rated as mixed in terms of uptake and use. The ANGLES Project, a one-year pilot effort, proposes a solution to the adoption gap that has developed between scholars with digital materials and technical developers designing the applications scholars are using for their research. By combining the model of intensive code development (the “code sprint”) with participatory design exercises, testing and feedback from domain experts gathered at internationally-recognized disciplinary conferences, ANGLES will develop a web-based editor for working with XML markup through sustained engagement with the large and active community of scholars, teachers and developers who work with the Text Encoding Initiative (TEI) standard. The aim of the project will be to contribute not only a working prototype of a lightweight XML editor tuned to the needs of the TEI community but also another model (the “community roadshow”) for tool building in the digital humanities.
“[A]ccount books,” business historian Judith McGaw noted in 1985, “are among the most common but least accessible primary sources for historians….” In the intervening twenty-six years, historians have made use of these rich sources to produce a range of articles and monographs, but few digitization projects have tackled financial records directly.
This poster/demo presents a report on work in progress, conducted by a group of historians, archivists, and technologists in the United States who have taken initial steps towards developing standards for markup and metadata for manuscript financial records from projects that use TEI. Drawing on the expertise developed by pioneering projects at the Massachusetts Historical Society, the University of Virginia, and MIT, this group met at Wheaton College in August 2011 to describe the current state of affairs through discussion of problems and case studies, to propose some paths forward, and to identify model projects that can be fostered to test those paths.
The poster/demo displays examples from the digital collections of the Massachusetts Historical Society and from the Wheaton College Digital History Project to demonstrate embedded markup of historical financial records using the TEI measure element. It also highlights the need for further research and development to suggest elements better suited for expressing some of the more complex semantic values within historical financial records.
Features and Challenges of Financial Records in Historical Research
A genre of primary sources that includes such materials as bills and receipts, ledgers, and cashbooks, financial records are abundant in traditional archives, but current practices for digitization fail to capture their meanings as fully as possible. The inaccessibility of these rich sets of resources stands in the way of their use by scholars, students, and the general public.
Digitized tabular records have long been in use by social science historians. The Minnesota Population Center’s Integrated Public Use Microdata Series provides harmonized data for U.S. and international population studies. The National Historic Geographic Information System provides United States historical geography, as does the Great Britain Historical Geographic Information System for Britain. And while a strong tradition of monographs and journals in economic history has resulted from investigation of archival financial records, no comparable project exists for their digitization.
Like historical census data, manuscript financial records tend to be tabular in format, but they also tend to include information with semantic values that cannot be represented through simple transcription of tabular layout. Such records in fact include significant variations and idiosyncrasies. Eighteenth-century attendance records from the Comédie Française, for example, resemble U.S. manuscript census forms from 1840 onward in their combination of pre-printed forms with hand-written entries. In contrast, daybooks and ledgers from the eighteenth and nineteenth centuries often contain column divisions drawn by hand to organize handwritten double-entry accounts. In many instances from earlier centuries, various bits of financial information are interspersed with prose in diaries and manuscript advice to children.
Though few digital history projects have focused directly on financial records, numerous projects include them. The Railroads in the Making of Modern America Project at the University of Nebraska, for example, includes transcriptions of payroll records for railroad employees. The Bethlehem Digital History Project includes a few extracts from business ledgers. To date, however, the percentage of such resources in the mass of material digitized has not been large enough to precipitate the creation of a TEI module designed to capture their unique semantic value.
A look at semantic values to be found in financial records surfaces numerous details that go far beyond the standards for double-entry accounting described in John Mair’s influential eighteenth-century textbook, Book-keeping Methodiz’d; or, Merchant-accompts by double entry, according to the Italian form, 8th ed. (1765). Perhaps the most significant challenges in dealing with historical financial records lie in their apparent regularity. The impulse to record expenditures, transactions, and other such events related to exchanges of cash, property, and goods and services did not always translate into perfect regularity in practice. In some cases, accountings of expenditures appear in the midst of simple prose in diaries. Other such cases appear, for instance, in a medieval codex that includes a father’s advice to his son about how to handle money and property. Authorial idiosyncrasies abound in such records. Historical accounting documents nevertheless represent efforts to keep structured records, in many cases in reference to the principles of double entry bookkeeping. Establishing some standards for the markup of such records remains a desirable and attainable goal.
The poster/demo will display markup of sample documents from the collections of the Massachusetts Historical Society and from the Wheaton College Digital History Project. These files demonstrate the utility of the TEI measure element for simple financial documents as well as the challenges presented by the more complex documents that historical subjects produced when they attempted to use the practices of double entry accounting.
Long Term Goals
The activities described in this poster/demo build on the natural constituencies that exist among archivists and historians who recognize the value of financial records for historical research, and we now seek to expand the community of scholars, archivists, and technologists who are prepared to stretch TEI Guidelines to broaden access to manuscript financial records. Standardized digitization of this rich yet currently inaccessible genre of manuscript historical records has the potential to open significant new lines of inquiry about economic, social, and cultural history. Digitization of local financial records could allow researchers to explore the interplay of cultural, social, and economic factors in particular localities.
With extensive application of standardized markup to such records from diverse places, researchers could compare data about continuities and changes in such factors over time and space. Digital collections created according to such standards could yield a large volume of harvestable data for examination of economic, social, and cultural conditions across national lines. Aggregated data from digitized collections of financial records could reveal significant new information about the history of daily life across time and space. We welcome additional participants as we continue to model markup and draft recommendations for guidelines.
Note: Activities described in this proposal received funding from the National Endowment for the Humanities in the United States. Any views, findings, conclusions, or recommendations expressed in this proposal do not necessarily reflect those of the National Endowment for the Humanities.
Note: We are grateful to John J. McCusker for sharing his bibliography of accounting in early British America, of which excerpts are included here.
Adams Family Papers: An Electronic Archive, Massachusetts Historical Society, http://www.masshist.org/digitaladams/aea/.
Baxter, William T. “Observations on Money, Barter, and Bookkeeping,” Accounting Historians Journal, XXXI (June 2004), 129-139.
Bethlehem Digital History Project, http://bdhp.moravian.edu/community_records/business/busact.html.
Bloom, Robert, and John Solotko. “Barter Accounting in the US during the Late Eighteenth and Early Nineteenth Centuries,” Accounting History, New Series, IX (March 2004), 91-108.
Bloom, Robert, and John Solotko. “Elucidating Needs, Lifestyles, and Community: Researching a Late Eighteenth-Century Account Book from Lexington, Virginia,” Accounting History, New Series, XIII (August 2008), 333-352.
Burnard, Lou, and Syd Bauman. TEI Guidelines P5, 2007, http://www.tei-c.org/Guidelines/P5/.
Comédie Française Register Project, Massachusetts Institute of Technology, http://web.mit.edu/hyperstudio/cfr/.
Great Britain Historical Geographic Information System (GBHGIS), http://www.gbhgis.org/.
Hollister, Joan, and Sally M. Schultz. “The Elting and Hasbrouck Store Accounts: A Window into Eighteenth-Century Commerce,” Accounting History, New Series, XII (November 2007), 427-440.
Jefferson, Thomas. Farm Book. Massachusetts Historical Society. http://www.masshist.org/thomasjeffersonpapers/farm/.
Mair, John. Book-keeping Methodiz’d; or, A methodical treatise of Merchant-accompts, according to the Italian form. Wherein the theory of the art is fully explained,… To which is added, a large appendix. … 8th ed. Gale ECCO Print Edition reproduced from the National Library of Scotland. Edinburgh: printed by W. Sands, A. Murray, and J. Cochran, for W. Sands, A. Kincaid & J. Bell, and A. Donaldson, 1765.
McGaw, Judith A. “Accounting for Innovation: Technological Change and Business Practice in the Berkshire County Paper Industry,” Technology and Culture, 26/4 (October 1985), 703-725.
The Minnesota Population Center’s Integrated Public Use Microdata Series (IPUMS) and National Historic Geographic Information System (NHGIS), http://ipums.org/.
Railroads in the Making of Modern America, University of Nebraska, http://railroads.unl.edu/views/item/rrwork.
Schultz, Sally M. and Joan Hollister. “Jean Cottin, Eighteenth-Century Huguenot Merchant,” New York History, LXXXVI (Spring 2005), 133-167.