Day 6: Mining Data

I had high hopes for the applicability of data mining to my current/future project and my long-term research on the Sacred Heart. I’ll largely discuss my research on the Sacred Heart because I’m familiar with the material, having worked with it/on it for the past decade. I thought it would be useful to have a “safety” to see how well these data mining tools work. Verdict: so far, I’ve not been impressed with Google N-grams or Bookworm or Voyant or Open Calais. I hesitated to write this, if only because I imagine some of my cohort found at least one of these programs useful. Or so I hope. I felt frustrated with Google N-grams and Bookworm in particular. I couldn’t find useful sources that relate to my project, so I decided to try them with material related to the Sacred Heart. The results came back from both, and I noticed how skewed the results were. No texts published in Mexico between 1730-1748? Incorrect. And where were the spikes in the early nineteenth century? My excitement turned to skepticism. What was Google using to gather this data? How was it sorting it? Did accent/diacritic marks make a difference? How can I use data that is skewed, if at all? How do I know when the data isn’t skewed? I felt similarly with Bookworm. These programs definitely seem to have an inherent Anglocentrism, which is not to say that they cannot be improved to correct that in the future. But for now I don’t feel I can use them in any meaningful way.

Voyant similarly saddened me. What high hopes I had for mining my PDFs! Alas, they were dashed. Instead, I inserted by book manuscript on the Sacred Heart. While not relevant to my project, I was delighted to see how the program mined my manuscript and visualized my top words choices. See:


And another graph that Voyant generated for me displays where certain keywords are used most often in specific chapters.


Overall, I left today realizing that text mining still needs development in many areas. I also believe that it is not as relevant to many art historians because text mining doesn’t apply–at least as it was defined to day and I’m using it here–as data mining of say archival documents or images (if that’s even really possible at this time).

A major point I did take home today–thank you Lisa Rhody–is to make sure I have tidy data. After our discussion today of structured vs. unstructured data, I began to think of ways to create tidy data for my current book project on the Sacred Heart. Long ago I created a .doc that contained important events, object production dates, publication date of texts, and more that I arranged chronologically. Today I began placing it in a spreadsheet and making it into tidy data. My goal is to map this data–or at least some of it. While not related to my deathways project, this is immediately relevant to my book manuscript. I might even find that it directly affects some of my ideas.

And just for fun…

My Animoto video (I didn’t post one last week, so I quickly made one for show-and-tell):

It’s nothing fancy, but it gives you an idea of how Animoto looks as well as how it might work for a project.

While Animoto doesn’t appear to have any immediate relevance to my project, I do think it offers students a wonderful way to engage with material.

Source: Day 6: Mining Data