Data Collection, Tales from the Front Lines

As any DHer will spiritedly confirm, the most laborious part of any substantial DH project is, of course, the front-end data collection. This holds true to an even greater extent when your project model aims to accomodate potentially comprehensive data on a given topic, and offers not simply a tool for the analysis of such data, but also a full complement of data to analyze with the tool—all of which, in our case, concerns literary translation into the English language. Although we were given a considerable head-start courtesy of the University of Rochester’s Three-Percent Translation Database, our own project goals required an expansion of much of their data, which in turn required many hours of arduous data-filling.

I have been focused primarily on the translator set of nodes, and alongside a number of research assistants at UCLA—under the direction of Dawn—was tasked with filling out additional information about the translators of our dataset. This included translator gender, nationality, type (freelance, professional, academic, author), and affiliation (university, press, organization, business).

A number of weeks ago I addressed some of the theoretical concerns elicited by this process, but now that we are nearing the end of the data collection process, some of the more practical questions that we’ve encountered are equally worth remarking upon. Each of these informational expansions gave rise to a different series of problems, and to address them required both reflecting on the nature of the project and anticipating the probable uses of the data.

For example, identifying the genders of the translators began with some well-meaning (but nevertheless problematic) assumptions, and we almost decided to automate the classification process fully to save time—there are a number of simple tools that will associate gender to an entry based on name, and others that attempt do so more accurately by contextualizing names with historical naming data. We chose to do so manually, but even then, as we sorted through the data, we unsurprisingly came across a number of misattributions. Further, there were many translators for whom we could not definitively confirm gender, and we eventually left the stereotyped identifications for lack of a better option.

That turned out to be one of the simpler questions, however. As we identified types of translator, we came across many folks who fit into multiple categories, and at different points in their careers. Translators went from self-identified freelancing to editorial positions, overseeing larger translation or publishing endeavors; many translators live abroad, and both “freelance” lecture on translation and “freelance” translate; others worked in university libraries and translated extensively on their own time or when they were in graduate school; and the list goes on. In many of these scenarios, multiple “types” apply to each translator, and there’s no easily defensible criteria that would accurately categorize them all.

Nationality was no better. For example, one of the general practices in assigning nationality in biographical statements or translator notes is to consider someone, say, a “Korean-American,” or a “Russian-American”—intending to show a heritage and a current nationality. What do we do about this? Is this identification different than someone who self-identifies as, say, a “German” but who was born in Switzerland, grew up in France, but lived the majority of their life in Germany? Are they most accurately Swiss-French-German? This also raised the question about whether or not we should simply apply the same criteria to each translator—e.g. listing birth country as nationality—or if we should apply a hybrid model, and allow translators who identified as a particular nationality to trump our own estimation of their nationality. Many English translators consider themselves British, while Socttish or Welsh translators predominantly use their particular national identifiers. But, others don’t, and if the latter practice is more precise, do we reassign the self-claimed Brits to their native nations?

There were also questions about translator affiliation. If the translator is an academic, what university do they research at? What if they’ve changed jobs and translated included works at both? What if this happened more than once? What if they held concurrent appointments? What if they freelanced, were associated with a government translation profession, and then later started their own translation firm?

Of course, this is to say nothing of the late-stage collection methods, which don’t exactly fit the reliable, peer-reviewed variety. The first set of information about translators, authors, and translated works were done using a scraper, which means some of the entries weren’t even verifiable humans let alone verifiable translators. Translation businesses (and even a law firm) named after people caused some trouble, as did translators with traditional first names as last names or traditional last names as first names (e.g. Smith Alan or Allen Smith). Public networking sites like LinkedIn and Facebook—and even Ratemyprofessor—became invaluable to verifying items that were otherwise very difficult to verify, but I suspect most wouldn’t be willing to bet the farm on the accuracy of a few of the final entries.

But, that’s okay. Because we expect to make this interactive graph database publicly accessible on all levels, we also expect researchers interested in using the database to add to the data as their designs require, and adjust errors as they are realized.

More importantly, what have become evident throughout this data collection process are the very real, however incidental, acts of scholarly censorship and erasure that occur during information gathering. Each of the above questions required us to make compromises between overall accuracy and utilization—it doesn’t matter how accurate the dataset is if it doesn’t contain useful information that will be utilized, and regardless of how useful the information might be, it won’t be utilized if it’s wildly inaccurate. What the above examples should make clear is the difficulty of compiling data without applying a particular agenda, political or otherwise, that in part determines what’s useful or accurate, and thus correspondingly, what should be added or removed.

As we move into the next stage of the project and toward the development of our graph database, these questions will continue to challenge and shape the way we pursue the collection and representation of data.