Category Archives: Stranger Than Fiction

Can a Machine Rewrite Shakespeare?

Even with mankind firmly in the 21st century and more than four decades after HAL from Space Odyssey 2001 showed humans the way to go, it seems unfathomable that the everyday ability to converse with machine information in natural language is not a reality after all these years.

Wouldn’t it be amazing if we could get computers for instance to read Shakespeare’s plays and perhaps re-write them using different words – and not just by applying synonym replacements or related concepts?  Imagine even a machine giving us a completely “new” play in Shakespeare’s writing style? Just think of the impact that reaching this technological level would have on a rapidly exploding data world that hungers for increased semantic intelligence filtering capabilities.

Why are we not there yet?

What hasn’t been done and what represents the “holy grail” in semantic analysis on a technical level, is an efficient multi-word to multi-word grammar and semantic transformation in text input-output flow.

Current semantic (meaning in language data) technologies available are crippled by a serious flaw. It is difficult to generate a semantically and grammatically equivalent body of text from an existing repository of language patterns and word combinations. Additionally, sentence structure and logical meaning flow have to fit in with the physical and rational make-up of the world we live in.

The flaw comes in when we literally have to “show computers our world”.  By attempting to “categorise” words or concepts beyond the English left-right, Arabic right-left, or Chinese up-down reading and writing order, most of the modern semantic intelligence technologies delivers a level of complexity that is unsustainable in terms of permutations.

By laying down logical concept rules, such as “a dog is alive” and “things that are alive replicates” giving us “a dog replicates”, current technologies hope to be able to create systems that generate and perpetuate rules of logic – and eventually represent some type of “machine intelligence” on a level with human thinking.

Categorisation systems very quickly run into the “permutation problem”: imagine any sentence of about 8-10 words, i.e. “I really appreciate my mother in the morning”. What would happen if we replace let’s say each word with 10 equivalent words that fit both grammatically and semantically? i.e. “I definitely/positively/demonstratively…” “like/admire/love my mother…”. Taking the original word phrase and randomly inserting the replacement words in all possible groupings that still make sense, we get 100 million phrases that are ALL grammatically and semantically equivalent – and we are only still saying that we feel positive about our mother some time early in the day!

Even the smallest body of text of even minimum complexity, obviously has trillions upon trillions upon trillions of grammar-semantic equivalents. In the usage of these logical categorisation systems, we just do not have the concept-combination multiplication power to cover the permutation problem. World-wide effort since the 1980’s around ontological classifications, hierarchical categorisation, entity collections and logic rule based systems have therefore not succeeded quite as envisaged. We can think of CYC, OpenCYC, Mindpixels and Wordnet amongst many.

“Permutations” is the villain that everyone hopes will disappear with “just a few more categorisations…”

Alas, it will not.

What is needed is a small compact “semantic engine” that can “see” our world and that will enable trillions of concept permutations to adequately represent the resulting image.

With an abundance of data in a complex and highly unstructured web and without a powerful enough “engine”, we really don’t have much chance of ordering and classifying this data such that all concepts inside of it relates to everything else in a manner that resembles our real human world holistically.

The search is therefore on for a technology that could take a quantum leap into the future. If we can start by enabling machines to “rewrite Shakespeare”, we should be able to develop an innovative, ontology-free, massively scalable, algorithm technology that requires no human intervention and that could act as librarian between humans and data.

The day when humans are able to easily talk-to, reason and “casually converse” with unstructured data will lead to a giant leap in the human-machine symbiosis and – after far too long a wait – in our lifetime we can perhaps still experience a true Turing “awakening”.

 To see a version of Shakespeare’s Hamlet re-written by a machine, have a look at…

The Bible Written by Hitler!

Authors of books invariably have their own writing styles – wouldn’t it be interesting to see how different texts would read if re-written by different authors?

Combining textual data and language patterns from disparate sources is extremely difficult, because the end result needs to be grammatically and semantically accurate for the aggregated sentences to be meaningful and fully understood.

The phrase “the rhinoceros is feeding on grass in the veld” is semantically and grammatically correct, whereas “feeding veld in grass on is the rhinoceros” is semantically correct (it is physically and logically possible for a rhinoceros in our world to perform this action), but obviously grammatically faulty. If we say “the rhinoceros is flying through the air and catching insects”, it makes grammar sense, but it is not in semantic meaning possible in the currently physically constrained world that we live in.

Being able to find interesting ways to re-write the Bible by combining texts is unusual, but also scientifically relevant as it can be shown that a very real technological application can flow from this.  

At present, keyword-based extraction from datasets is typically the entry point of human interaction with electronic data. The past decade has not only seen a phenomenal growth in data worldwide, but datasets have grown so diffuse that keyword searches are becoming ineffective in returning adequate and meaningful results.

What if the same technology that enables the Bible to be written by Hitler could also be applied to increase the retrieval power of electronic search?

When looking at a search input query, computers are currently unaware of any invisible words or concepts similar in meaning and natural language grammar structure related to this query. If computers were able to “see” an accurate “image” of our world and understand our real world in human terms, then any input query could be rephrased with similar words of any group size that fits both grammatically and semantically.

For example: National security agencies could massively broaden the amount of dataset “hits” for given target words if similar concepts around the target words are also simultaneously investigated. Danger words such as “suicide attack” can be triggered in searches even if only inoffensive words such as “sacrifice, impact, cause, final” are present in the dataset under analysis.

The implications for targeted advertising, online retail searches and even the simple act of matching Dave who likes “spaghetti bolognaise” with Mary that loves “Italian restaurants” on a dating site are massive.

Unfortunately – very few semantic technologies currently available can render the required multiword to multiword functionality.

What is needed is an elegant “engine”, that can create a computer “readable” realistic world image straight from random web crawling with both semantic and grammar accuracy. Being able to get computers to understand grammar without using grammar rules or classifications requiring human interpretation is surprisingly difficult. It is, however, very attractive, because no grammar ontologies are needed. This technology could work just as easily in Arabic, Mandarin, Chinese, Russian or French as it does in English. A small compact engine that could successfully synonym multiword groups to other multiword groups quickly and on a massive scale, while allowing for trillions of concept permutations to adequately represent data across the full cybersphere, will represent a major step forward in the development of semantic technologies.

The reward would be for current technologies to evolve gradually from keyword search to eventual human-machine conversational interaction – and the seemingly unrelated process of rewriting the Bible therefore subtly leads us into a technological future that could give us the ultimate HAL 9000.

To see what the Bible would look like in Hitler’s hand, have a look at…

George W Bush: Another Einstein?

Einstein’s “relativity theory” is mentally challenging.

An interesting experiment would be to mix George Bush’s language patterns into this complexity. This process will be shown – in analogy –  to mirror human-to-machine communication where it is possible for juxtaposing datasets to be integrated and information to be “smoothed” for interpretation at many entry points.  

Why is it still not possible for machines to “see” our human world and thus allow computers to converse and communicate with humans in natural language in an “everyday” manner?

With an abundance of unstructured data and with traffic flowing over the internet growing faster than the current network will be able to carry by 2013, we almost compulsively expect to extract useful information continuously to enhance personal and business decisions. Additionally, as the accumulation of data in proprietary databases and data repositories increases, it is essential to find more efficient ways of making information retrieval and data usage super-accessible.

A senior manager at a bank wants to obtain information about clients’ aggregated personal circumstances and financial needs: from a large repository of “unstructured” data; how does he or she know what to ask in order to identify the most relevant information? A new or extended bond for a client could be on offer if the manager knew of a planned home move by the client, for instance. Study loans could be on offer for children requiring further education, or a larger insurance package suggested if the manager knew that the client’s existing insurance was currently inadequate. The unstructured repository causes difficulties because the same question can be phrased in many different ways with different grammar analysis and semantic word combinations. Frustratingly, the retrieval results always differ according to options selected.

Current semantic technologies use extensive ontologies and categorisation systems, the mere design of which leads to severe “permutation” problems. Even the smallest body of text simply structured can generate trillions upon trillions of grammar-semantic equivalents. Given the mathematics, it is not surprising that we still do not have the concept-combination multiplication power able to adequately address the permutation problem.

What is needed is a compact and powerful representational matrix that can act as interpreter between human language and data to generate the trillions of concept permutations that adequately represent our real world. A machine that is in this way – able to “see” – could ensure that all of our human world concept combinations and language patterns relate to everything else logically and realistically in a stored electronic format.

Coming back to George Bush and Einstein’s relativity theory – a semantic “engine” that can easily perform the integration of language patterns between totally disparate textual sources, could also in a corresponding manner enable us to create multiple equivalent searches in the whole search engine- and proprietary database query universe. Searching for a needle in a haystack using a thousand magnifying glasses is a handy analogy to mirror the trillions of concept permutations that can in this way be provided to meaningfully and adequately represent any mass of unstructured data.

The application is definitely underpinned by a need, as most search engines are still three-word, caveman-speak search phrase solutions limited by keywords, at best only interchangeable with synonyms or related words. Effective multi-word to multi-word exchange technology is almost non-existent.

If successful, we might perhaps soon be able to expand beyond – and finally say goodbye to – classic keyword search and information retrieval. The immediate goal is “humanising” of cyber dataspace with seamless application of a semantic technology that enables mankind to converse and communicate effectively with data across a wide information spectrum.

 To see how George Bush subtly explains Einstein’s relativity, have a look at…