I signed up for the Data Scientist Training for Librarians (DST4L) course because I wanted to learn more about Python; to learn to use emerging tools and technologies; to get involved with projects that are cultivating new roles, opening new possibilities, and identifying new challenges within the field of escience librarianship; and to explore my own research questions concerning the relationships among data, power, and social justice. As a radical librarian who has worked in technology, education, health advocacy, racial justice, above- and underground libraries, and movement publishing, I have had many experiences working with ‘homegrown librarians’ in a variety of different contexts. Through this work I have come to recognize, again and again, how essential radical ideas and literature—their production, preservation, accessibility, and discoverability—are for social change and liberation movements. Hence, my questions about metadata often center around the critiques of open access (“is it really “open?”), universal access (“what could that look like?”), and safeguarding [print] books (and their ideas) from obsolescence. (See http://caribbean2012.thatcamp.org/11/12/is-this-the-illusion-of-open-access-liberation-and-the-future-of-knowledge-and-the-book-in-the-digital-era/)
Conversations about these questions with the radical women of color publishing collective South End Press, with whom I have been working, led us to create the Metadata Almanac Project. There were many long nights when we found ourselves talking about the open access movement–how it is not enough. We have to try and move towards a universal access. Many ideas flew about what universal access could be, and there was at least one element that we all agreed was essential: ensuring multiple formats of an item (print, audio, ebook(s), Braille, multilingual, free books, paid books, etc.) are available. But by available, and this is key, we mean for anyone who needs or wants it.
Entering DST4L, I didn’t have a clue how to conceptualize a comprehensive metadata almanac, let alone how to go in and clean up these data. The collective’s members had started the process by doing a ‘data dump’ from their website. These data were recorded in an unwieldy Excel format, and many datasets from the dump were inaccurate, incomplete, or inconsistent, and we had downloaded the html tags and other confusing information. I knew I had to aggregate all the potential useful metadata elements that I learned in cataloging class or information organizing, but Jocelyn and Asha wanted more than simple accession points. As publishers/editors, they understood that there are many non-traditional accession points that the reader uses to obtain information; folksonomy, or user-generated tags, and user reviews now join the more familiar OCLC numbers, LC subject headings (though at times incorrect), and even cutter numbers as meaningful descriptors. South End Press, by capturing even the most mundane data in the almanac, is ensuring that the literature, and access to it, survives.
Political literature is particularly vulnerable to marginalization and outright disappearance, and even the most notable of literature can go out of print (see This Bridge Called My Back)—especially literature that is focused both on liberation and an experience that is not, for instance, white, middle-class, heteronormative, and able-bodied (see Betty Friedan’s Feminist Mystique). Amazon and the other big corporate conglomerates are choking collective publishers and independent booksellers and now even pose the threat of being the only platform we have left to offer ebooks (or any kind of book, frankly). The reality posed by Stallman (“free speech not free beer”) could come to fruition if we don’t intervene. This is why the goal of universal access has to be pursued, and not just open access. We know that it is not enough to just open the door–some people might not even get to the door.
We’ve entertained the idea that lots of copies keep stuff safe (LOCKSS) and lots of authors keep stuff alive, such as hope. Thus, the publishing collective decided to put its acquisitions on hold to digitize its entire backlist to ensure its permanence, whether or not the press should survive (it’s no secret that small publishing houses are not faring well in the age of publishing’s consolidation, the disappearance of independent bookstores and libraries [or simply “shelf space”], and the migration of book buyers to Amazon). As I have discovered, it’s easier to make the decision to start publishing the next book in digital format, but digitizing the entire backlist, especially for a press founded over 35 years ago, many years before desktop publishing caught fire, is not an easy task. That’s about 70,000 pages and covers to locate, verify, scan, proof, and correct. And, of course, tag with metadata!
This led us to the metadata almanac. In digitizing the entire backlist, we are also creating a companion metadata almanac that will allow information caretakers to select whatever metadata they need to satisfy their users’ information seeking habits and needs.
After the collective did its data dump off the website even looking at the spreadsheet was overwhelming (columns A to ZZZ!). Plus, it was incomplete! Part of the process began with a website data dump that Asha Tall of the South End Press Collective had managed to do with some hacking. We had multiple copies of that .csv file in a number of locations. We opened it in Excel and uploaded a copy to Google Docs, and found a cataloger to fill in some of the missing authorities and metadata points to complete this metadata almanac. Bless this (volunteer) cataloger’s soul, who dove in and filled in the subject headings, the bios of the authors when missing, etc. But still the metadata almanac had html tags in it, trailing white spaces, and there were 247 books! Enter a variation of Moore’s law: the collective made the decision to add even more authorities to it. The more technology changes, the amount of metadata increases exponentially. Remember, lots of authorities keep stuff permanent.
The Current Project
1) Metadata Bombs (ongoing) – We are grateful for Rebecca Martin, a radical librarian at Boston University and Community Change, Inc., who helped us to organize a metadata bomb*. A metadata bomb is exactly that – a group of predominantly radical librarians, community activists, and academics gathered to fill out a Google spreadsheet that lists all the items/data points missing from the almanac. We have done two. Both began with lots of really good questions: “are we seeking positive reviews only?” “Does some particular number that only a cataloger understands have importance?” By the way, we are still not done – do let me know if you want to participate in the next one!
2) Data Tools I have been using:
- Python, scripts to recall OCLC numbers of individuals items to add to the metadata almanac.
- OpenRefine – can we say “yay, Tom Morris!”? Admittedly, I did not listen to his talk (though he was teaching us how to use it!), but only because after he showed us an example of using this tool, I immediately began thinking about its implications for the metadata almanac. It handles small datasets pretty well, and in this case I am grateful that it takes out all the html tags and trailing white spaces, and for its ability to see if any titles were misspelled or otherwise inaccurate. It definitely gave me a lot more ideas and the confidence to handle this project without the dreaded feeling of doing unnecessary work.
What kind of implications does a class like DST4L have on moving towards universal access, especially in the context of a small political press whose impact factor is consistently growing as people realized that the radical ideas to prevent the societal downfall weren’t so farfetched? What kind of implications do tools like Churnalism and classes like DST4L, which afforded us conversations about how data are used and REUSED, have for a more equitable society and greater possibility of universal access?
*First called a metadatathon, a radical librarian by the name of Susie Husted affectionately named it the “metadata bomb” and therein stuck.
**I know this is a long blogpost, but there were many folks who assisted and continue to assist us in this process and conversations. And I really wanted to honor the process, so as soon as I get people’s permission to use their names, I’ll add them to make it even longer. Lots of authors keeps it permanent, no?