Day 6: Delving into Data Mining

Back after a weekend break, ready to stuff more new information into my organic content management system (a.k.a. brain). This week we will see just how much will fit in there!

Brain.old.pictureRobert Fludd, Utriusque cosmi maioris scilicet et minoris […] historia, tomus II (1619), tractatus I, sectio I, liber X, De triplici animae in corpore visione (Wikipedia Commons)

I like that the day often begins with an introduction to a tool that is super easy to use. Today that was Google Ngram Viewer that you can use to compare word frequency in all the books that Google has scanned as part of its Google Books project. I tried “Latin America” and “South America,” and it looks like this:

(Ugh. That’s A LOT of empty space there between the chart and the text, and I can’t seem to fix it. Clearly I do not get all the ins and outs of WordPress.) Although the tool is limited and even problematic (how to determine exactly what the data set was? Does Google publish a list of its books?), I can definitely see using it as a teaching tool, for example to show my students that the term “Latin America” is recent, and that scholarship on the topic has peaked at particular times that correspond with (or follow) political interest in the region. We also looked at Bookworm, which performs in a similar way on different repositories of digitized texts.

So then we moved on to slightly more complex tools for data analysis, Voyant and Open Calais. Spent a good bit of time playing with Voyant. Again, I can see the potential, perhaps especially as a teaching tool. I am intrigued by what these tools can potentially reveal about what’s in the text.

Screen Shot 2014-07-14 at 9.57.28 PM



Accessed at Mining

Okay, so I love the idea of using data mining for analyzing internationalism vis à vis biennials in Latin America, but like most other participants at this institute, I don’t have a lot of digitized text available to analyze. I’d like to analyze biennial reviews, for example, but those I have are PDFs and would need to be OCRed (I’m sure I’m not using the terminology correctly!) before I could use digital tools to analyze them, and even then, I’m not sure that the volume of text would be enough to make it worthwhile. I’ve learned, for example, that to do a good topic analysis, I’d need a minimum of 1,000 texts. So reserving judgement of data mining as it might apply to my project and looking forward to talking about visualization tomorrow.

(A small mystery that’s bugging me: why does this post show up as having been put there at 2:02 am on Tuesday, July 15 when it’s 10:11 pm on Monday, July 14? It’s not my computer clock–must be the clock linked to the host?)

Source: Day 6: Delving into Data Mining