When the concept of data mining and text analysis was first introduced, my mind conjured up a rather sweet little image of tiny robots with miner’s hats. In my (admittedly rather coffee addled) mind, these robots marched dutifully along lines of text chipping away at the important bits. After placing their plunder in a cart, they pushed it along an analytical railroad for processing. The end result was restructured information waiting for a nice human to come along and turn it into a bright, snazzy word cloud.
When the coffee eventually wore off, I was thinking a little less about tiny robots and more about the way machines are allowing us to analyse data far quicker than we can do it ourselves. And not only quicker, but by handing over the reins to robots for a little while, we can investigate large scale amounts of resources for specific trends.
Text analysis is about using machine learning processes, linguistic and statistical data to structure information from different mediums for investigation. These techniques are often used in business or research to study things like word frequency, pattern recognition and link association. By looking at the data this way, a business may discover more about their target audience from a Twitter hashtag and a researcher might start to draw conclusions about knowledge behaviour from a large-scale survey.
If you want to make a Librarian flinch, there’s nothing quite so effective as to suggest that we should stop reading. While it may seem like a controversial idea to literature enthusiasts, Franco Moretti’s intentions in putting forward a method he described as ‘distance reading’ were not that we set our traditional analytical methods aside completely, only that we take our noses away from a singular book and consider wider context.
In his article Conjectures on World Literature, Moretti suggested that in order to gain a more overarching understanding of a subject, we should look at the overall make-up of a set of documents so we can gain insight that might overwise be overlooked, which is exactly what methods like text analysis can be used for.
Two years ago, a study into the early days of the French Revolution involved historians, statisticians and political scientists using machine learning to analyse politicians of the time and their speech patterns. Thanks to a form of AI able to read and generate natural languages the way humans can (Natural Language Processing or NLP), over 40,000 speeches were analysed, with key words and phrases identified and attributed to their speaker. Not only were the researchers able to draw conclusions about how different political parties spoke to the assembly, they also discovered that some of the more important conversations happened not in committees, but behind closed doors instead.
From counterintelligence generated from news reports, to email spam filters, the practical applications are vast and improving all the time. In the future, our data mining robotic friends might even be able to gather information in different languages and restructure them all based on meaning.
But while this process may seem like a boon to many, it’s important to recognise the pitfalls. Restructuring data from a selection of speeches on Brexit may provide insight into how politicians are trying focus a narrative, but without a well-rounded understanding of key issues in the debate, findings may end up being misleading. A word cloud generated from one of my manuscripts might tell me how often I use a certain word, but it can’t always account for my particular brand of creative licence. And how often I tend to talk about tiny robots.
Photo Credit: Steve Talkowski, March of the Robots