Scoring Texts by their grammar in Python

top feature image

Scoring Texts by their grammar in Python

6 minutes to read

During working on my master thesis, I had the task that I had a huge text-corpus, where I wanted to filter out texts with bad grammar. After digging through some paper, I found what I needed, the language-tool-python. The language-tool-python might be more known as the spellchecker-engine of OpenOffice, which is available as open-source language-tool-python.

So in this post, we will learn how to use this tool for creating a score, which indicated how well the grammar of a text is. Of course this score is not perfect and we will have a look at what the limits of this method are and maybe see, if there is something better we can use.

How to install the python-language-tool

First we need to install the python-language-tool by selecting our environment via conda and then we run in our terminal:

The language tool can be used either by a local server, which is automatically downloaded and set up during this installation, or a remote server, which offers its services for enterprises.

Spell and Grammar Checking with Python

To use the language tool, we first import it and create a tool, which defines which server has to be used, locally or the remote one. Per default, a local server is used:

Lets use this tool for a small test sentence to see how many error we will find in it. For this we use the check() function of tool:

By printing the length of matches we see how many error are found:

Awesome! Three error were found, which we would expect in this sentence, did it count correctly and found the errors we would expect it to find? Lets print the content of matches:

Perfect. As we can see in this print, all three errors are found as we would have expected it. We can also see in this output, that the python-language-tool does not only find the errors but also recommends alternatives to fix these errors. Just as a sidenote, tool also has a correct function, which automatically fixes found grammatical errors, but be beware of this, as there is no guarantee, that the correction will really improve the sentence.

Like the following example shows:

Here we can see, that a man called Peter Inborn called his son Hermes. No wait, Peters lastname is Igborn and he called his Son Hermse! The correction engine did not recognize, that these were names, and corrected it to words, which were known to it. So we can see, that no deeper intelligence is present as we would expect it from a deep learning model. This is just a simple matching tool, although a very powerful one. This is a important fact to keep in mind! Do not expect a 100% solution, it is a tool which helps us coming closer to our target, but it very useful, but not perfect.

This is important also for interpreting the score we are now about to create. We want a metric, which gives us an idea of the grammatical quality of the text. What does grammatical quality mean? It means, that the better the text quality is, the higher the score should be. We would expect none errors in a grammatically perfect written text, but we would expect a lower score, if the sentence is full of errors. So lets put this into simple math:

score = 1- \frac{ce}{cw}

In this score, we count how many errors(ce) occur in relation to the word-count(cw) of the sentence. This means, for a sentence with 12 words and 3 errors, its score will be 0.75. So the statement can be made: 1/4 of the words of the sentence contain grammatical errors. Again, keep in mind, that the term grammatical error now means, that the python-laguage-tool did classify a word as one, but as we saw above, this can be error prone, as non known words will be classified as error.

With this metric, we are able to compare and rank texts by their grammatical correctness. For this, I wrote this function:

This function takes a list of texts as input. It uses tqdm() for showing a nice progressbar. It iterates through the list and splits each text into sentences by this method:

It uses the tool, iterates through each sentence and checks it on grammatical errors. Afterwards two scores are calculated, one for the amount of errors in the sentence and one that only stored whether the sentence is error-free or not. So all texts are iterated this way and in the end, the function calculates the mean and variance for the texts (for making an general statement on the data source quality) and outputs for each score a list of scores.

As language-tool-python uses a java backend, the scoring can be pretty slow unfortunately. Using the GPU is also not supported. This is definitly a thing, which could be improved.

That way you can compare the texts in your text corpus, and filter those, that are of poor quality. If you want to know how to do this, let me know!

Thank you for reading. Was this topic interesting for you? Tell me and write it in the comments!

5 3 votes
Article Rating
Subscribe
Notify of
guest
6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
doge
doge
4 years ago

hi, great post… did you ever found out a solution to this problem
https://stackoverflow.com/questions/63284471/tensorflow-use-model-inside-another-model-as-layer

Last edited 4 years ago by doge
Salman
Salman
3 years ago

Can I get full version of this code?

Getting few errors which I’m not able to figure out.

Raj
Raj
Reply to  Michael Janz
2 years ago

I am getting an error which says “helpers is not defined”

Post navigation