Google books are dirty

Today is the first of a series of post I am anticipating writing as I work through the first semester of HDCC106, the second part of a two-semester class titled “Introduction to Digital Cultures and Creativity.” As I have mentioned in several places, we’ll focus this semester on creating a digital archive of letters from the Civil War. The letters, from the Marion Theresa (M. T.) Biddle papers, the Brooke Family Papers, and the John E. Rastall are currently housed in Special collections at the University of Maryland. We’ll take the students from soup to nuts–-from creating transcriptions and metadata to developing databases and designing interactive interfaces. The semester will culminate in an Omeka-powered (and student-created) site featuring the Civil War letters and will eventually become part of the University of Maryland Library’s permanent Fedora repository.

The program behind this class, Digital Cultures and Creativity, comprises humanities and computer science students. The class is intended to familiarize students with these technologies and to help them develop some basic skill sets they can use towards possible final projects around the archive which will include Omeka exhibits and plug-ins as well as associated multi-media projects and possibly a conceptual plan for an Omeka mobile app.

As I mentioned, today was the first discussion (last week’s was introductions and snow), and I focused the conversation on three primary questions we should always asked when facing a digital collections:

  1. What does the collection include?
  2. How were the materials digitized?
  3. Who is the intended audience?
  4. What is the purpose?
  5. How were the materials digitized?

We asked these questions in the context of a few links to various Civil War projects as a precursor to figuring out these matters for our own projects. The stickiest part of the discussion, however, seemed to be when I introduced the idea of errors and data ambiguities behind the recently unveiled Google Ngram Viewer. The students lit up over the fact that the OCR data is dirty (look up “iSchool”) and that the data is affected by the fact that it includes only certain books (look up zip code 02138).

Literacies: data sources and data management.

This entry was posted in Teaching. Bookmark the permalink.