The Algorithm of Books: The Mistake of Applying Mathematical Equations to Literature

The world of publishing is a game of chance. This game is played no matter whether you are a writer vying for publication, a publisher taking a gamble on a writer’s work, a bookstore ordering copies of this writer’s work to put on display or a reader taking a chance on the published work of a new or at least previously unfamiliar writer. Life, like publishing, is also a game of chance and some gambles have better odds than others. At times, depending on the nature of the gamble, you can foresee the outcome of a decision or an action of your choosing. But not always.

Part of the beauty and the angst of this thing called life is in fact, the mystery. The unknown. The chances. Delving into the unknown shapes us. It marks and it mars us in business and in our everyday lives. And publishing, like life, has as much risk as a high-stakes game of poker. In the realm of publishing there is more at stake than the individual. Placed in a precarious position and at risk in the gamble are the jobs of the editors at the publishing house, the reputation of the writer, the success of the bookstore and the wooing of the reader.

Because of all those whose well-being stands on the edge, choosing a work that will be successful in the marketplace is of primary importance.  Money makes the world go round. Therefore, there must be a more efficient way to get rid of the “duds” and maximize the success of what is sadly becoming a (somewhat) dwindling market: the book.

What could this solution possibly be? Well, some very smart math people figured out an algorithm to determine what qualities the most popular titles have in common.

There are formulas to determine chemical reactions and formulas to determine area and perimeter and algorithms to determine probability of an infinite number of things. But this time, math has gone too far. The world of writing and literature is not a game of numbers and I dislike the fact that they are bringing numbers into this sacred realm very much. Sure there are people who crunch the numbers to determine how much of an advance a writer should get, how much a book should cost based on the number of pages it occupies, whether the book should appear in hardcover or paper back or both. I’m not in denial of the importance of math, numbers or money. But, an algorithm to determine the potential success of a book? Seriously?

According to gizmodo.com, in a recent article titled, Will Your Novel Be A Best Seller? Ask this Super Accurate Algorithm, this algorithm can determine the potential success of a book with an 84% accuracy rate. Researches from Stony Brook University have been working to develop a system called “Statistical Stylometry” which can mathematically examine words and grammar in books. Utilizing Project Gutenberg, these researchers have tested their algorithm on the literary works of the past that have come to find success. If you want to read a little more in-depth about the study, click here to read the abstract of the paper titled, “Success with Style: Using Writing Style to Predict the Success of Novels.”

While I began this post with some philosophizing about the risks of life, and while I do in fact respect the ingenuity of these researchers in determining this algorithm and their interest in literature in general, I still have issue with this situation. My complaints are threefold. First, I find the use of math to determine the success of literature to be rather insulting. A mathematical equation to explain the ebbs and flows of language is ridiculous and limiting. Secondly, from what I’ve read of the study, they do not take into account the changes that take place in popular language over time. And thirdly, they do not take into account the importance of those failed works in the grand scheme of things.

According to the study, “there are potentially many influencing factors, some of which concern the intrinsic content and quality of the book, such as interestingness, novelty, style of writing, and engaging storyline, but external factors such as social context and even luck can play a role” (http://aclweb.org/anthology/D/D13/D13-1181.pdf). This is all true and from the intro, one can assume that their intent is to reduce the rate of rejection for works that are diamonds in the rough, meaning: those works that initially faced numerous rounds of rejection before being picked up and then subsequently becoming best sellers.

So, let’s delve into a couple of the factors that the algorithm takes into consideration and the researchers determined based on the results they were given. Their sample of each text consists of the following: The researchers pulled the first 1000 sentences from each text to analyze and utilized various markers such as parts of speech tags in order to determine the voice/style of the writer. This is a good sample and a good idea, But, they don’t seem take into account the different styles of speech and writing of previous time periods. Cadences and sentence structure vary not only from writer to writer, but also from time periods and locations. They may address this issue further in the study, but the abstract did not have any indicators of how this was handled.

Some characteristics of successful versus unsuccessful works according to the study are as follows: “prepositions, nouns, pronouns, determiners and adjectives are predictive of highly successful books whereas less successful books are characterized by higher percentage of verbs, adverbs, and foreign words” (http://aclweb.org/anthology/D/D13/D13-1181.pdf). Thus, descriptiveness in a work is important. Creating a scene that the reader can taste, touch and feel is a big part of the writing bit. I can get behind this. Additionally, “more successful books use discourse connectives and prepositions more frequently, while less successful books rely more on topical words that could be almost cliche ́, e.g., “love”, typical locations, and involve more extreme (e.g., “breathless”) and negative words (e.g., “risk”) ” (http://aclweb.org/anthology/D/D13/D13-1181.pdf). Thus, according to this algorithm, works that utilize language that fits into a certain range, did not and will not achieve success in the marketplace.

But then the study makes a counter-intuitive point. “Successful books tend to bear closer resemblance to informative articles” (http://aclweb.org/anthology/D/D13/D13-1181.pdf). They don’t elaborate on the reasoning behind why this may be so. I hypothesize that this in fact could be true according to their study because of the fact that informative/journalistic articles are affiliated with a particular grade level of reading. The easier the reading level, the more mass appeal a book will have. This is certainly not an indicator that it is actually worthy of best-seller status, in my opinion. There are numerous poorly written books that became best sellers that from a literary perspective (and yes I am a bit of a literary snob) do not actually deserve the recognition that they receive.

But the surprise best seller is just as important as the dud, or the book that seems to be a dud based on the algorithm. They are important because they say something about life. They say something about what is going on in popular culture, about what is important at this particular juncture in time. If you cut them out with an algorithm, you are losing an important part of literary history.

These bad titles and these unexpected surprises are important and you can see this in the way that literary studies has been working to change the way that it preserves and educates its students. In the past, literary studies focused on the big authors, the writers who are the quintessential examples of popularity in their times. But then they began to turn their heads to the smaller writers, those who may or may not have been prolific. Their stories have value too. They teach us about the time period, they teach us about the everyday writer. They teach us about the different avenues that writing can take. All of these factors are important and without these works, a whole history would have been lost to us.

What would the world of publishing look like if they adopted this algorithm for their business? I’ve already talked about the slush pile. But, perhaps even more significantly, what happens to the 16% of works that would have been successful, potentially game- and life-changing books that never see the light of day and get trashed with the rest of the slush pile simply because a person saw a glimmer of possibility in the prose and then plugged the text of the work into this algorithm and received a negative result? We could lose the next great American novel simply because a writer took some liberties and played with text and language in a way that simply did not compute.

Computers and mathematical equations do a plethora of amazing things. But equations to determine the success of a book is just not something I can accept. Literature is about art, about teaching and delighting (to use the words of the ancients) and while the publishing world for those who work in it is primarily about making money, it will be a very sorry day for the all of us if something like this algorithm ever actually becomes a part of their business practice and is used as anything other than entertainment.

5 thoughts on “The Algorithm of Books: The Mistake of Applying Mathematical Equations to Literature”

1. aliveatnight says:

I stand behind you in this one. So much randomness is involved that you simply cannot do something like this. Thanks for the interesting read!

2. Senthil says:

I am not sure what you are summarizing here. You do set the stage that this is NOT something that can be done. But then go on and on about how certain facts that the algorithm discovered are true. You also acknowledge that we should NOT use the algorithm to “correctly” disregard (not publish) books that have a low sales/rank prediction. It appears you DO in fact recognize that there is a formula, but just detest the use of it. At least that is what I seem to understand and feel free to correct me if I am wrong.

I will have to say, you are in the league of those who though a computer can never ever ever beat a Chess Grandmaster. But then again, I think you just don’t ever want such a machine invented as you are just not ready for it or feel the world is not ready for it.

Here is how I would encourage you to look at it. When books come up with low ranks, there is a good reason behind it as per the algorithm. Simply put a book with great content may have a difficult reading style. In fact you go on to explain how certain other facts have NOT been taken into account by the algorithm. Those are excellent observations and one that I would think should and can be incorporated into future versions of this program/research. Once those are completed we will have a “method” to write books in a way that most people would appreciate the writing itself. If everyone starts following the approach, then we would take book writing to the next level. Here is an automated “tutor” at work to help you write books that people will appreciate in terms of style. Given that NEW baseline, people can then write their creative or informative books that will then become (or not become) the best seller based entirely on relevance to the topic/audience/time period/etc.

1. I am someone who approaches this not as someone who cares only for the monetary gains to be had from authors and their works.

Instead, I approach this concept from the perspective of a writer and as someone who has a background in literature. And while I respect the numbers and the algorithm that these researchers have developed, I do not think it is a good way for the writing world to select works for publication. There are important nuances to be found in both popular and unpopular works and not every nuance can be calculated into a formula. There are just too many. And writing styles and objectives change all the time. Language is fluid.

And why would anyone want a “method” to write a book? That feels like brainwashing to me. The point of art is to be different and unique, not to follow a fill-in-the-blank, paint-by-numbers method, at least I don’t think it is.

As far as a computer beating the chess grandmaster, I have no issue with that and certainly believed it to be possible. I never said that the computer couldn’t do the algorithm. I just think that many important things are left out if we follow this route.