Digital banking prepares for the next wave of innovations and adoptions.

Digital banking has come a long way in recent years, and it is likely to continue evolving and improving in the future. Some of the areas that are likely to see further innovations and adoptions in…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




A New Reader in Jane Austen Book Club

A random walk in Natural Language Processing (NLP) with Python & spaCy

Photo by davisuko on Unsplash

People join book clubs to make new friends or to gain new perspectives. When computers claim to be capable of reading a Jane Austen novel in seconds, I am curious to find out what the computer actually understands, and what insight it can offer to the reader communities.

spaCy is such a program: an amazing tool for Natural Language Processing (NLP). The research into NLP started as early as the 1950s, the first automatic machine translation developed by Georgetown University and IBM was running on mainframe computers. Since then, NLP has come such a long way that it works well on laptops now, and it can conveniently serve us as a reading companion.

Let’s read Jane Austen’s Pride and Prejudice together with spaCy. I hope spaCy can meet our expectation of being knowledgeable and insightful. At least I already know that spaCy is very quiet, patient, and always ready to answer our questions.

The digital version of this book is freely available at Gutenberg.org, which includes ePub and kindle versions for download, HTML for online reading; for our purpose to work with spaCy, plain text is the best.

Like any Jane Austen fan must know, Austen only completed 6 novels before she died at age of 41. As of today, all 6 novels have been adapted as feature films or TV series multiple times, and the Jane Austen reader communities can be found all around the world. Why has Jane Austen become a household name two hundred years after her death?

Jane Austen Book Club | Movie Trailer | Source: IMDB

According to BBC, Prof John Mullan, Lord Northcliffe Professor of Modern English Literature at University College London, said Austen’s enduring appeal boils down to one thing — her writing.

Let’s analyze a classic of Jane Austen’s novel. The story is always compelling, but let’s focus on the actual words used by Austen this time.

Chances are that you have read the book, perhaps more than once. I invite you to explore this book again and see what a computer can tell us. Do not be bothered by the details of coding if you are not a programmer. I will explain and visualize the findings.

spaCy is open source, developed and maintained by explosion.ai. It runs on most computers. Let’s read the book with spaCy. If you know Python and would like to follow along, install spaCy is easy. A quick start guide can be found on its official site, though, the following 2 lines should suffice.

We need to clean up a little bit of the text file, to get rid of the publisher’s notice and index, as those are not Austen’s actual writing. After that, let’s run Python and load spaCy. Reading the entire novel only takes a few seconds.

Now spaCy has loaded the entire book in memory. Let’s ask our first question to test what spaCy knows.

What is the book about?

spaCy can run a quick enquire on proper nouns, which will lead to an answer on this. As we already know that all Jane Austen’s books are about people, so the results will have these familiar names of her characters, just as we expected.

Let’s clean this up a little bit, as we do not really want to count words like Mr. or Mrs..

Rebecca Xu from Syracuse University created all the visualizations in this story. Here’s a treemap from the top 50 proper nouns.

Not with surprise, the name of Elizabeth appears most frequently; Bennet is her last name, but it could be associated with her sisters or parents; we should at least add Lizzy to the counts, the result is 730. The leading female character is mentioned 76% more than the leading male character, Mr. Darcy.

Study shows that a sentence needs to be in a good range so it can be easily understood, in other words, the length of sentences is important for readability.

With spaCy, it is very easy to find out how Jane Austen did here. When spaCy reads the book, it already keeps track of the sentences and total word counts.

The book of Pride and Prejudice has 156,644 words, written in 7,523 sentences. The average number of words in a sentence is 21.6. Jane Austen seemed to know this ahead of her time, as her sentences were well kept at the ideal length.

Let’s find the top 100 nouns used by Jane Austen.

Word Frequency: Nouns in Pride and Prejudice | Image by Rebecca Xu & Sean Zhai

On the list of top 100 nouns by Austen, sister ranked highest, which is expected for this particular novel depicting sisterhood. Man(#3) is listed before friend(#5), family(#6), mother(#9), daughter(#10), father(#12), this is a story about finding the proper man.

Time ranked the second place; day(#4), afternoon(#18), morning(#21), moment(#23), evening(#27) are all time-related;

Letter(#11) is important. Feeling(#15) is used more often than pleasure(#17); happiness(#31) and love(#32) are ahead of marriage(#41).

Let’s analyze adjectives next as they are closely related to nouns.

Word Frequency: Adjectives in Pride and Prejudice | Image by Rebecca Xu & Sean Zhai

Here’s what we get when we pull the list of most common verbs.

Let’s visualize this on a treemap.

Word Frequency: Verbs in Pride and Prejudice | Image by Rebecca Xu & Sean Zhai

Say is ranked on top, and there are also speak(#10) and tell(#12), replay(#19), talk(#22), talk(#28), ask(#34). Jane Austen loves conversations.

Think is ranked as high as #3. See(#5), hear(#7), feel(#9), look(#11) are also pretty high.

Believe(#15), hope(#18) and marry(#23).

After analyzing verbs, let’s take a look at adverbs.

Let’s put top-ranked nouns, verbs, adjectives and adverbs together to get an overall sense of Jane Austen’s vocabulary.

Word Frequency in Pride and Prejudice | Image by Rebecca Xu & Sean Zhai

I put the source code on Github, for easy reference. The code below writes the results in CSV files, which can be used for further analysis and visualization.

After using spaCy for a while, I genuinely felt like it had become a friend of mine; on a leisurely afternoon, I would like to take spaCy for a random walk and see what it can do for me.

If you like Jane Austen’s books and have a question, please respond to this story. I will see if I can answer your questions with the help of spaCy.

Cheers.

Add a comment

Related posts:

Top 40 Adorable Easter Presents for Babies to Express Your Love

The best part of the holiday season for everyone involved is when parents wake their kids up with an Easter basket full of treats before the egg hunt and the glazed ham. The typical plastic eggs and…