A very deep, dark, treacherous mine.

Tidy data will be in my mind as a guideline as I begin my data collection–because I do not have my data yet. So I won’t be cleaning house, as there’s no house to clean yet, but I will try to build and keep a clean house as I begin to collect and organize my data.

The biggest question in my mind is still: how will I collect my data in the first place? The books I want to analyze have not been digitized yet for the most part. I know it sounds ambitious to make it into part of my project, but I also see this as a potential benefit–making these works available in digital form to a large public, and allowing these works to be known and preserved in a different way from their current analog formats (small editions, many out of print, languishing in public libraries…?)

I don’t want to choose my project based on what data is already available! And I can see my scholarship (and that of many others) as a way of correcting and complementing the Eurocentric biases of digitized collections and platforms. For example, in addition to digitizing the works, I realize I will have to find a good platform to process them. Many of those shown in class won’t work fully because their lexicons will be missing important words. When using Voyant, I had to input many “stop words” in Portuguese (pronouns etc) because even their multilingual option didn’t have them.

I realize that a huge part of my project will be finding ways to build the tools and platforms necessary to begin and run the project in the first place–maybe that’s too much to aim for, but I will at least try.

I must confess I also still have some lingering questions about text mining. My project on the urban world of Brazilian modernists began with old-fashioned manual text mining, which I carried out over two years as an undergraduate (this was an independent research project I came up with at the time, and got funding to do). I read the books and wrote down all the things I was looking for (mentions to urban life broadly defined–from the words “city” and “urban” to specific locations and sites to aspects of modern urban life such as cars, elevators, machines etc.) I was on the lookout for certain terms a priori, but I also discovered most of the terms and themes by reading the books. I would have missed out most of them if I had used a previously prepared lexicon (even if it had been a very perfect lexicon for Sao Paulo in the 1920s). Many of those “mentions to urban life” were also figurative, and I discovered them by reading whole passages of poems, or by analyzing the plot of a short story or novel. Our discussion of text mining today made me realize that it might not be exactly the tool I thought I needed, not just because of the language and region limitations but mostly because what I was doing in the first place might not be best described as mining.

I must say that the volume of works I analyzed is also manageable. It wasn’t billions or even hundreds of books… Perhaps text mining would allow me to expand my field and include other texts besides literary works and journals–say, newspapers from the time–but then again they’re not digitized etc. etc.


Source: A very deep, dark, treacherous mine.