Digging Deeper (How Machines are Helping us Understand a little more about Ourselves)

When the concept of data mining and text analysis was first introduced, my mind conjured up a rather sweet little image of tiny robots with miner’s hats. In my (admittedly rather coffee addled) mind, these robots marched dutifully along lines of text chipping away at the important bits. After placing their plunder in a cart, they pushed it along an analytical railroad for processing. The end result was restructured information waiting for a nice human to come along and turn it into a bright, snazzy word cloud.  

When the coffee eventually wore off, I was thinking a little less about tiny robots and more about the way machines are allowing us to analyse data far quicker than we can do it ourselves. And not only quicker, but by handing over the reins to robots for a little while, we can investigate large scale amounts of resources for specific trends.

Text analysis is about using machine learning processes, linguistic and statistical data to structure information from different mediums for investigation. These techniques are often used in business or research to study things like word frequency, pattern recognition and link association. By looking at the data this way, a business may discover more about their target audience from a Twitter hashtag and a researcher might start to draw conclusions about knowledge behaviour from a large-scale survey.  

If you want to make a Librarian flinch, there’s nothing quite so effective as to suggest that we should stop reading. While it may seem like a controversial idea to literature enthusiasts, Franco Moretti’s intentions in putting forward a method he described as ‘distance reading’ were not that we set our traditional analytical methods aside completely, only that we take our noses away from a singular book and consider wider context.

In his article Conjectures on World Literature, Moretti suggested that in order to gain a more overarching understanding of a subject, we should look at the overall make-up of a set of documents so we can gain insight that might overwise be overlooked, which is exactly what methods like text analysis can be used for.

Two years ago, a study into the early days of the French Revolution involved historians, statisticians and political scientists using machine learning to analyse politicians of the time and their speech patterns. Thanks to a form of AI able to read and generate natural languages the way humans can (Natural Language Processing or NLP), over 40,000 speeches were analysed, with key words and phrases identified and attributed to their speaker. Not only were the researchers able to draw conclusions about how different political parties spoke to the assembly, they also discovered that some of the more important conversations happened not in committees, but behind closed doors instead.     

From counterintelligence generated from news reports, to email spam filters, the practical applications are vast and improving all the time. In the future, our data mining robotic friends might even be able to gather information in different languages and restructure them all based on meaning.     

But while this process may seem like a boon to many, it’s important to recognise the pitfalls. Restructuring data from a selection of speeches on Brexit may provide insight into how politicians are trying focus a narrative, but without a well-rounded understanding of key issues in the debate, findings may end up being misleading. A word cloud generated from one of my manuscripts might tell me how often I use a certain word, but it can’t always account for my particular brand of creative licence. And how often I tend to talk about tiny robots.  

Photo Credit: Steve Talkowski, March of the Robots

The Semantic Web or: How I Learned to Stop Worrying About our Future AI Overlords.

It might be hard to believe, with Siri sitting pretty in our pockets and Alexa on the mantlepiece, but the idea of an artificial assistant has been in the public consciousness since the 1800s. What started with Mary Shelly’s arguably mechanical Frankenstein’s monster, soon became Samuel Butler’s question of whether such evolutions could trigger machines to become our ultimate masters.  

But is A.I. really something to worry about? Science fiction’s most memorable villains are artificial in nature, from Skynet’s disdain for humanity, to HAL 9000’s famous “I’m sorry Dave, I’m afraid I can’t do that.” But are we right to feel such trepidation about our mechanical creations? After all, they’re naturally limited by the data we allow them to access.

The dream of a truly artificial intelligence relies heavily on programming capacity, intelligent algorithms and the ability to learn from patterns in data, and at present our data is largely isolated.

Of course, all that could change should a Semantic Web ever come to fruition. An idea coined by Tim Bernes-Lee, a Semantic Web would allow machines to read data as well as we do. A prospect made possible in part thanks to Linked Data.

Using metadata, database models like Resource Description Framework (RDF) and Web Ontology Language (OWL) allow for data to not just be categorised, but also to form relationships.

The idea may not sound all that revolutionary, indeed anyone who’s lost an afternoon thanks to TV Tropes will know the ease at which we can jump from one subject to another. But while the human mind is brilliant at making intuitive leaps, software needs direction. Machines don’t use the same language we do and there’s more to linking data than just a few extra lines of code. RDF databases use SPARQL in order to manipulate and retrieve data, and then comes the question of XML vs JSON regarding storage and organisation. Thinking on a local scale isn’t enough; for a truly Semantic Web to be realised, global solutions need to be agreed by everyone.

It’s no surprise that companies like Google, Microsoft and Yahoo have been trying. After all, what could make a web search more efficient than returning not just the results you want, but related information you didn’t realise existed. The launch of schema.org allowed webmasters to use free mark-up language to link their websites to others via metadata. And while it’s in use by over 10 million websites, the process is proving slow going.

In fact, Floridi argues that the idea, even at its most modest, is destined to remain largely unrealised. While using metadata to link documents is certainly helpful, the ontologies using them rely on abstractions that we may never fully achieve.

“One may wish to consider a set of restaurants not only in terms of the type of food they offer, but also for their romantic atmosphere, or value for money, or distance from crowded places or foreign languages spoken… the list of potential desiderata is virtually endless, so is the number of levels of abstraction adoptable and no ontology can code every perspective.”

And unfortunately, these aren’t the only challenges standing in the way of a Semantic Web. Never mind the fact that the data we produce isn’t just vast, it can be vague or downright misleading. And we’re often hesitant to release raw data at all. But as Berners-Lee emphasised in his TED talk back in 2009, making this data available and allowing it to be linked is central to the ultimate goal.

So while a Semantic Web might well be the catalyst the machine lifeforms from the Matrix were waiting for… it looks like it might still be a while before we need to worry about which colour pill we’d take.



Butler, S., 1863. Darwin Among The Machines [To The Editor Of The Press, Christchurch, New Zealand, 13 June, 1863.] | NZETC

Berners-Lee, T., 2000. Weaving The Web. London: Texere, pp.157-175.

Floridi, L. (2009). Web 2.0 vs the Semantic Web: A Philosophical Assessment, Episteme 6(01), 25-37

Split Second Info

In my other life, the one where I regularly attempt to corral teenagers into education without making the fact too obvious for fear they’ll scream bloody murder and hightail it, I often find myself wishing mobile phones didn’t exist. Or perhaps just didn’t exist all the time, like when I’m at work. Because what chance do I stand when the alternative is the entire internet?

In order to combat the temptation the kids feel to check their home screen, I often find myself looking for ways to summarise the important points I know we need to cover. I condense, paraphrase and bullet point. I give them the gist of it, hoping that I can grab their attention long enough to help them engage with the information they need.

And as I listened to our second DITA lecture on assimilation, I realised that perhaps the reason I had started delivering workshops the way I do, was because I was mirroring the online habits these young people had developed to combat information overload.

According to statistics from the Visual Capitalist, in 2018 there were over 2 million Snapchats messages, 481,000 tweets and 187 million emails sent every 60 seconds. With those figures in mind, it’s hardly surprising Ofcom’s article on Digital Dependency the same year reported the average Briton checked their phone every 12 minutes. And those numbers have only increased exponentially.

With so many notifications to check, posts to read and emails to reply to, we rely more and more on informal snap shots to keep us informed. Why read the entire article when we can get a sense of what’s happening from a headline? Even user generated content on websites like Reddit will often come with a ‘too long, didn’t read’ (TL;DR) addition to save us time.

As we endeavour to keep ourselves afloat in the sea of a relentlessly updated ‘infosphere’ this penchant for summarizing has become an important life skill. And as I watch the teenagers I work with try so hard to keep themselves informed via their mobiles, I can’t help but think about my own internet habits as well.

I wonder just how much information I’ve missed out on in the interests of saving time; how much context has gone over my head. I’m starting to realise how much faith I put not only in faceless algorithms to point me in the direction of things I want to see, but also other users to summarise the information for me when I get there.

The importance of accurate, unbiased headlines is hardly new, but with how quickly we can be exposed to so much data, it’s perhaps more important than ever to step back and take a moment to dig a little deeper.

The attention span myth is a classic example. A couple of years ago it was all I heard about from other professionals. How thanks to technology the average human attention span was no better than that of a goldfish. How could we possibly expect to reintroduce young people to more formal education, when they could barely sit through a 30 second video before skipping to the next one?

The idea that our attention spans went from 12 seconds in the year 2000 to a mere 8 in 2015 originated with a study from Microsoft Canada. And while the BBC successfully debunked it a couple of years later, the damage, you could argue, had already been done.   

So perhaps next time I’m scrolling from one article to the next on my google home page I’ll take the time to be a little more selective. Maybe I’ll take my jumps from Twitter to Reddit a little slower from now on. Those posts I want to see may have been pushed from the front page or down my timeline by the time I get there, but just because there’s something else interesting above them, doesn’t mean I don’t still want to read them.

Image: Yui Mok/PA (via here)