During working on my master thesis, I had the task that I had a huge text-corpus, where I wanted to filter out texts with bad grammar. After digging through some paper, I found what I needed, the language-tool-python. The language-tool-python might be more known as the spellchecker-engine of OpenOffice, which is available as open-source language-tool-python.
So in this post, we will learn how to use this tool for creating a score, which indicated how well the grammar of a text is. Of course this score is not perfect and we will have a look at what the limits of this method are and maybe see, if there is something better we can use.
How to install the python-language-tool
First we need to install the python-language-tool by selecting our environment via conda and then we run in our terminal:
1 |
pip install language-tool-python |
The language tool can be used either by a local server, which is automatically downloaded and set up during this installation, or a remote server, which offers its services for enterprises.
Spell and Grammar Checking with Python
To use the language tool, we first import it and create a tool, which defines which server has to be used, locally or the remote one. Per default, a local server is used:
Lets use this tool for a small test sentence to see how many error we will find in it. For this we use the check() function of tool:
By printing the length of matches we see how many error are found:
Awesome! Three error were found, which we would expect in this sentence, did it count correctly and found the errors we would expect it to find? Lets print the content of matches:
Perfect. As we can see in this print, all three errors are found as we would have expected it. We can also see in this output, that the python-language-tool does not only find the errors but also recommends alternatives to fix these errors. Just as a sidenote, tool also has a correct function, which automatically fixes found grammatical errors, but be beware of this, as there is no guarantee, that the correction will really improve the sentence.
Like the following example shows:
Here we can see, that a man called Peter Inborn called his son Hermes. No wait, Peters lastname is Igborn and he called his Son Hermse! The correction engine did not recognize, that these were names, and corrected it to words, which were known to it. So we can see, that no deeper intelligence is present as we would expect it from a deep learning model. This is just a simple matching tool, although a very powerful one. This is a important fact to keep in mind! Do not expect a 100% solution, it is a tool which helps us coming closer to our target, but it very useful, but not perfect.
This is important also for interpreting the score we are now about to create. We want a metric, which gives us an idea of the grammatical quality of the text. What does grammatical quality mean? It means, that the better the text quality is, the higher the score should be. We would expect none errors in a grammatically perfect written text, but we would expect a lower score, if the sentence is full of errors. So lets put this into simple math:
score = 1-
In this score, we count how many errors(ce) occur in relation to the word-count(cw) of the sentence. This means, for a sentence with 12 words and 3 errors, its score will be 0.75. So the statement can be made: 1/4 of the words of the sentence contain grammatical errors. Again, keep in mind, that the term grammatical error now means, that the python-laguage-tool did classify a word as one, but as we saw above, this can be error prone, as non known words will be classified as error.
With this metric, we are able to compare and rank texts by their grammatical correctness. For this, I wrote this function:
This function takes a list of texts as input. It uses tqdm() for showing a nice progressbar. It iterates through the list and splits each text into sentences by this method:
It uses the tool, iterates through each sentence and checks it on grammatical errors. Afterwards two scores are calculated, one for the amount of errors in the sentence and one that only stored whether the sentence is error-free or not. So all texts are iterated this way and in the end, the function calculates the mean and variance for the texts (for making an general statement on the data source quality) and outputs for each score a list of scores.
As language-tool-python uses a java backend, the scoring can be pretty slow unfortunately. Using the GPU is also not supported. This is definitly a thing, which could be improved.
That way you can compare the texts in your text corpus, and filter those, that are of poor quality. If you want to know how to do this, let me know!
Thank you for reading. Was this topic interesting for you? Tell me and write it in the comments!
hi, great post… did you ever found out a solution to this problem
https://stackoverflow.com/questions/63284471/tensorflow-use-model-inside-another-model-as-layer
Hi, thank you for your feedback!
I did not look further into this topic, was it was not crucial for my tasks.
I think you can use two approaches:
1. Use a lambda layer and call predict on the model, you want to embed into another model. Possibly you can even apply backprop if you set training=true of this model, but I am not sure about that.
2. Use the layers of the model and embed layer for layer inside your model.
However, on the question is one answer which I could not check so far as I am very busy atm. Maybe you want to test that answer out and give it feedback?
Can I get full version of this code?
Getting few errors which I’m not able to figure out.
You can see all available code in this post. Which error do you get?
I am getting an error which says “helpers is not defined”
I see, thanks alot!
I updated the gist. This concerns the calls of
text_to_sentences
only