Data Collection, Tales from the Front Lines

As any DHer will spiritedly confirm, the most laborious part of any substantial DH project is, of course, the front-end data collection. This holds true to an even greater extent when your project model aims to accomodate potentially comprehensive data on a given topic, and offers not simply a tool for the analysis of such data, but also a full complement of data to analyze with the tool—all of which, in our case, concerns literary translation into the English language. Although we were given a considerable head-start courtesy of the University of Rochester’s Three-Percent Translation Database, our own project goals required an expansion of much of their data, which in turn required many hours of arduous data-filling.

I have been focused primarily on the translator set of nodes, and alongside a number of research assistants at UCLA—under the direction of Dawn—was tasked with filling out additional information about the translators of our dataset. This included translator gender, nationality, type (freelance, professional, academic, author), and affiliation (university, press, organization, business).

A number of weeks ago I addressed some of the theoretical concerns elicited by this process, but now that we are nearing the end of the data collection process, some of the more practical questions that we’ve encountered are equally worth remarking upon. Each of these informational expansions gave rise to a different series of problems, and to address them required both reflecting on the nature of the project and anticipating the probable uses of the data.

For example, identifying the genders of the translators began with some well-meaning (but nevertheless problematic) assumptions, and we almost decided to automate the classification process fully to save time—there are a number of simple tools that will associate gender to an entry based on name, and others that attempt do so more accurately by contextualizing names with historical naming data. We chose to do so manually, but even then, as we sorted through the data, we unsurprisingly came across a number of misattributions. Further, there were many translators for whom we could not definitively confirm gender, and we eventually left the stereotyped identifications for lack of a better option.

That turned out to be one of the simpler questions, however. As we identified types of translator, we came across many folks who fit into multiple categories, and at different points in their careers. Translators went from self-identified freelancing to editorial positions, overseeing larger translation or publishing endeavors; many translators live abroad, and both “freelance” lecture on translation and “freelance” translate; others worked in university libraries and translated extensively on their own time or when they were in graduate school; and the list goes on. In many of these scenarios, multiple “types” apply to each translator, and there’s no easily defensible criteria that would accurately categorize them all.

Nationality was no better. For example, one of the general practices in assigning nationality in biographical statements or translator notes is to consider someone, say, a “Korean-American,” or a “Russian-American”—intending to show a heritage and a current nationality. What do we do about this? Is this identification different than someone who self-identifies as, say, a “German” but who was born in Switzerland, grew up in France, but lived the majority of their life in Germany? Are they most accurately Swiss-French-German? This also raised the question about whether or not we should simply apply the same criteria to each translator—e.g. listing birth country as nationality—or if we should apply a hybrid model, and allow translators who identified as a particular nationality to trump our own estimation of their nationality. Many English translators consider themselves British, while Socttish or Welsh translators predominantly use their particular national identifiers. But, others don’t, and if the latter practice is more precise, do we reassign the self-claimed Brits to their native nations?

There were also questions about translator affiliation. If the translator is an academic, what university do they research at? What if they’ve changed jobs and translated included works at both? What if this happened more than once? What if they held concurrent appointments? What if they freelanced, were associated with a government translation profession, and then later started their own translation firm?

Of course, this is to say nothing of the late-stage collection methods, which don’t exactly fit the reliable, peer-reviewed variety. The first set of information about translators, authors, and translated works were done using a scraper, which means some of the entries weren’t even verifiable humans let alone verifiable translators. Translation businesses (and even a law firm) named after people caused some trouble, as did translators with traditional first names as last names or traditional last names as first names (e.g. Smith Alan or Allen Smith). Public networking sites like LinkedIn and Facebook—and even Ratemyprofessor—became invaluable to verifying items that were otherwise very difficult to verify, but I suspect most wouldn’t be willing to bet the farm on the accuracy of a few of the final entries.

But, that’s okay. Because we expect to make this interactive graph database publicly accessible on all levels, we also expect researchers interested in using the database to add to the data as their designs require, and adjust errors as they are realized.

More importantly, what have become evident throughout this data collection process are the very real, however incidental, acts of scholarly censorship and erasure that occur during information gathering. Each of the above questions required us to make compromises between overall accuracy and utilization—it doesn’t matter how accurate the dataset is if it doesn’t contain useful information that will be utilized, and regardless of how useful the information might be, it won’t be utilized if it’s wildly inaccurate. What the above examples should make clear is the difficulty of compiling data without applying a particular agenda, political or otherwise, that in part determines what’s useful or accurate, and thus correspondingly, what should be added or removed.

As we move into the next stage of the project and toward the development of our graph database, these questions will continue to challenge and shape the way we pursue the collection and representation of data.

Translating Networks at SHARP 2016, Paris

Immediately after attending the mid-July, 2016 DH conference in Kraków, Poland, Dawn hopped on a plane to Paris for the annual meeting of the Society for the History of Authorship, Reading & Publishing (SHARP). There, she gave a presentation and overview of our Translating Networks project and offered a number of insights into future directions of the project.

To read the paper and view the slides, visit: http://dawnchildress.com/2016/08/13/sharp16/.

One of the more involved and interesting aspects of that future is how we plan to ultimately organize the data and make it accessible, manipulable, and analyzable. Rather than using a traditional series of networks—with their own individual lists of nodes and edges—as a visualization technique, we’ll be using an open-sourced, NoSQL development suite called Neo4j to create a graph database.

For those not familiar with the concept (as I wasn’t when I joined the project), the graph database basically reorganizes/reorients the information of the relationships between the nodes and edges of a traditional network, allowing users to locate, manipulate, and then analyze those relationships easier.

The SHARP presentation was part of a panel that considered the “Status of Translators,” with presentations from Kenneth Carpenter (Harvard) on Translators and Translations of Economic Literature before 1851 and Anthony Cordingley (U of Sydney) on Translation Archives: The Advent and the Cultural Politics of Collecting. There were a number of common threads throughout the papers, especially related to the representation of translators in the scholarly record, from archival finding aids and catalog records, to other existing datasets for discovering and studying literary history.

The best of all possible (DH) worlds

As our Translating Networks project begins to unfold, and the long-term vision becomes more necessary (and in ways less clear), I’ve become interested—vis-à-vis the thoughts and directions of Dawn and Tom—in musing about and trying to unravel the slew of concerns that plague the development of new, expansive digital resources.

It’s no secret that large-scale digital projects, especially those focused first on public resource creation (e.g. the digitization, collection, and/or presentation of data) and only then on individualized research and argumentation, frequently suffer from a kind of enterprising overzealousness. For all of their optimistic daring, those DH projects that cater to Stephen Ramsay’s “Screwmeneutical Imperative” are no doubt at a disadvantage to those who emulate more traditional, business model-esque routes.

Read more

New funding, new teammate

Due to a generous grant from The Pennsylvania State University’s Center for Humanities and Information (CHI), Tom and Dawn’s once proof-of-concept project on world literature in English translation and network analysis has evolved into a much larger, collaborative and community-focused research venture. Building on the University of Rochester’s work on their Three-Percent Translation Database, our project has broadened its scope and is working on the creation of a comprehensive and (eventually) publicly accessible database of translation data for English translations of literary works.

Read more

Network Analysis of Literary Translation in the U.S.

Tom and I both attended HILT a few weeks ago, taking the week-long Introduction to Network Analysis course with Elijah Meeks. We wanted some quick data to work with, so I pulled the spreadsheets from Three Percent’s Translation Database, a data set of world literature translated into English in the U.S. market.

Read more