Charming the word snake – Terminology work and language checking with Python
Have you ever extracted technical terms from hundreds of documents? Have you checked dozens of topics for violations of writing rules? And have you felt that these tasks could be automated, but enterprise-grade software was too expensive?
In just a few days, you can implement powerful terminology tools, using free open-source software. In this tutorial, we will show you how.
Part 1 – From the terminological basis to the tools
We will use the words “concept” and “term”. If you have worked with terminology already, you will know the triangle of meaning. Concept and term are connected to an object:
With object, we mean any material or immaterial matter, event, or circumstance. This could be a mouse like the one in the picture.
With concept, we mean a notion or an idea that represents the object in our mind. When we hear the word “mouse”, we can think of two things: the tiny cheese-loving animal or a computer mouse.
Term means how I call the object in a particular language, for example, the animal: Maus (German) or mouse (English).
It is important to make this distinction because one term (“mouse”, for example) can belong to multiple concepts. Also, an object can be represented by multiple terms (“USB stick”, “flash drive”, “USB flash memory”).
Stages of terminology projects
Terminology projects usually follow a process, starting with the planning of the project and ending with the publication and eventually the maintenance.
In this tutorial we will skip the planning phase and begin right away with creating a first draft of a terminology list. After the extraction and having more knowledge about the terminology issues and technical questions, we can catch up on the planning, which would include identifying expert colleagues, such as product management, technical specialists, quality control, and also selecting tools.
Here, we’ll concentrate on phase 2 “Extraction” and the beginning of phase 3 “Standardization”.
Useful metadata for standardization
A terminology list usually consists of two levels: term and concept. Within these levels there are different categories, such as ID, definition, or source.
Our work begins at term level. The words we marked red in Figure 3 are the categories for our do-it-yourself approach, the automatic term extraction. An extraction on term level gives us a first list of terms including more information, such as part of speech and context information.
The next important step is allocating each term to a concept. If we want to decide which term is allowed and which term is forbidden, we must share a common idea of their meaning, that is, the concept the terms represent. Therefore, we add a definition on concept level. To the concept we have defined we allocate one allowed term and possibly one or more forbidden terms.
Tip: Finding definitions in open-source dictionaries can save time and provide a basis for discussions.
The following list is an example of how a terminology list in MS Excel could look like. We decided to have one column per data category. We have defined allowed values and formatting, for example, colors for the status. We also use unique IDs for the different concepts.
With our DIY approach, you can generate a part of your terminology list automatically, like term, context, and grammatical information such as part of speech.
Other data categories such as the subject field, status information, and comments are completed manually, having knowledge of the subject fields and the terminology.
These are the tools we use for our project:
- Python is our programming language suitable for professional use. Python is easy to read and learn.
- Python is an interpreted language. This means that you can simply write a script and immediately run it. You do not have to worry about compilers and such.
- spaCy is an open-source software library for advanced natural-language processing (NLP), written in the programming languages Python and Cython.
- There are lots of libraries available that make almost every common task rather simple. Some of those libraries you will see in this tutorial.
Part 2 – Extracting terminology
To show you how the automated term extraction works, we want to build a COVID-19 terminology list from the corresponding Wikipedia article.
To keep it simple, we work with one-word terms from the first paragraph of the article. This will show you how spaCy works and may inspire you to start writing your own NLP scripts. At the end of this tutorial, you’ll find a list of useful references.
This the first paragraph of the Wikipedia article on the coronavirus pandemic:
The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), first identified in December 2019 in Wuhan, China. The World Health Organization declared the outbreak a Public Health Emergency of International Concern in January 2020 and a pandemic in March 2020. As of 27 November 2020, more than 61.1 million cases have been confirmed, with more than 1.43 million deaths attributed to COVID-19.
Step 1: We save the text as simple string in a variable and convert the text into lowercase. It makes the process a lot easier.
Step 2: We import the spaCy library. Now all the NLP commands that spaCy provides are available to us.
The import can take a while. This small line of text loads a full model of the English language. The language model enables the behavior that we show you in this tutorial.
Step 3: We feed the raw text into the model. This returns a document object, which contains the features that we need to extract the terminology and add the metadata.
Let’s explore some of the basic natural language processing (NLP) concepts:
- Token is a little segment of a sentence, such as a word or a punctuation mark.
- Lemma is the base form of a token. For example, if “has” is the token, the lemma is “have”.
- Part-of-speech (POS) is a tag that we can assign to the token and that tells us, for example, if a word is a verb or a noun.
Now, let’s look at the sentence “The dog has a wet nose.” If we feed our language model with that sentence, we see three different concepts. The following table shows the tokens, the lemmas, and the POS tags of the sentence:
The word “has” is the token, its basic form (lemma) is “have”, and the part-of-speech (POS) is an auxiliary verb.
Filtering for relevant terms
Now we’ll build the one-word term list. Let us assume that the only words relevant for us are nouns, proper nouns, and verbs.
Nouns and proper nouns usually correspond to objects. Verbs usually refer to activities. This is all highly relevant for technical documentation.
Now, we create a function, into which we can pass a token. The function tests whether the token is a noun, proper noun, or a verb.
Building the initial list
To build the initial list, we go through every token in the document and decide if it is relevant or not. If it is, we add its lemma and its POS tag to our list of one-word terms.
We use lemmas because we are only interested in the base form of words. We add the POS tag to distinguish between words that can be both a noun and a verb, for example, “cause”.
Now we’ll remove all the duplicates from our list.
Checking the results
The word “coronavirus” appears twice in the table we get, once as a noun and once as a proper noun. This is a hint that the proper noun is part of a compound noun. In this case, it might be “coronavirus pandemic” or “coronavirus disease”.
Creating a list of sentences and their lemmas
But wouldn’t it be nice to get some context for the terms? The most obvious idea is to pick a term and look for a sentence in the original document that contains that term.
We now add a sample sentence from the original text to every term in our list.
We will use the spaCy’s built-in sentenizer to iterate through all the sentences in the document. To do that, we need a list of all the sentences and their respective terms.
To build a list of all the sentences in the text, including the lemmas that the sentences contain, we proceed as follows:
For every term in our list, we go through all the sentences, until we find one that contains that term. Once we found such a sentence, we add it to the entry.
The first entry we got is a list with the corresponding part-of-speech tags:
Now we can use this status structure to go through every term in our term list and check which sentence contains that term.
Checking the results again
We look at the term “attribute” with the POS tag VERB. Below, there is the sample sentence containing “attributed”, which is simple past of “attribute”. So now we have a sample sentence for every single term in our term list.
To add definitions, we leave the term level and transition to the concept level. Definitions of the terms would be helpful, but we obviously cannot extract definitions from the text. That’s why we need a third-party source.
In this case, we will use the Free Wordset Dictionary, which is a large dictionary in JSON format. You can use any other source if it can be accessed programmatically. Keep an eye on buzzwords like REST-API, structured data, and the like.
The program goes through the terms and looks them up in the dictionary. A large part of the code, however, deals with the situation when we cannot find the term in the dictionary.
Let’s look at the results again.
For the term “attribute”, we get the definition “to attribute or credit to”. But something peculiar happened here, and this describes the difference between term and concept.
The word “china” refers to the country, but the term is defined as “high quality porcelain”. This demonstrates the problem with synonyms and that the computer has certain limits. The moment we talk about meaning, humans with their personal expertise are required.
Writing the list to a CSV file
Because we do not want to do all the analysis work and discussion in the Python output, we write the results into an CSV file. This is rather easy with the built-in libraries. Afterwards, we can import the data into other applications, for example, MS Excel.
- We can count the frequency of terms.
- We can add two-word terms. This is a little bit tougher than it sounds. First, we need a list of all the bigrams in the text. Then we need to calculate how often these two terms appear next to each other in comparison to their individual frequency. Unfortunately, spaCy does not offer any built-in inbuild functions for that task.
- We can evaluate the relations between terms by looking for colocations. For example, we can count how often each term appears together in the same sentence with any of the other lemmas. This could give us an idea which terms are related.
Let’s summarize what we have done in part 2.
- We started with the preparation of our text.
- We extracted one-word terms and part-of-speech tags.
- We added sampled sentences for context.
- On the concept level, we were able to find propositions for definitions in open-source dictionaries.
- We wrote the results into a CSV file which we use as the basis for terminology work, for example, defining concepts, discussing allowed and forbidden status.
Part 3 – Checking writing rules
We all know examples of poor text quality: Sentences are too long or too complicated. The language is not concise or there are spelling errors.
Improving text quality includes writing rules. The first thing we need is a sample text. As an example, we will use this sample text in German:
There are two reasons for using German now:
- We want to show you that spaCy can deal with other languages than English.
- Implementing writing rules may require intimate knowledge of the language and we just know German better than English.
Again, we will load a language model. This time, of course, a German one.
Finding forbidden terms
Finding forbidden terms is maybe the most obvious and useful use case, putting the terminology we just extracted to use. First you need a list of forbidden terms.
If you have ever tried finding forbidden terms by using a search function, you know that this is hard. The reason is, that a term can take many forms. Using the spaCy's lemma method, it is easy to check for forbidden terms. Thanks to the lemma method you do not have to worry about all the different forms a word can take. Just go through every token in the text and compare its lemma to the list of forbidden terms. If you find a forbidden term, print it out, maybe plus some context.
In our case, we found all three words in the sample text. Here you can see the lemma method in action. Our list contains the verb “erfolgen”, but the text contains one of its many forms, “erfolgt”.
Finding long words
Long words are a major source of comprehension problems. You can find various definitions of what constitutes a long word. In German, it’s not all that useful to look at the number of characters in a word. There are many short syllables in German which are made up of many characters. That’s why it may be better sometimes to count syllables instead of characters.
While counting characters is easy, counting syllables is not. However, for German we can think of a simple heuristic. Every syllable contains at least one vowel. That’s why we start by counting all the vowels in a given word. But in German, vowels often appear next to each other, forming diphthongs. So, we count all diphthongs, two vowels used as one. Then, we subtract the number of diphthongs from the number of vowels, giving us a good estimate of the number of syllables in the word.
Next, we go through every token and check, whether it fits our requirement for long words. In this case either having more than three syllables or more than ten characters.
Here we have a sentence monster - 20 syllables and 63 characters long. This is the actual name of a German law.
Not only words are sometimes long, but sentences can also be long. There are at least three reasons why we should avoid long sentences:
- Long sentences are hard to read.
- With long sentences, grammatic references often become ambiguous.
- The longer the sentence, the easier it is to make mistakes.
In the literature we can find various limits for sentence length, mostly based on word count. For this demonstration let’s say that a sentence is long if it contains more than 15 words.
Using spaCy’s sentenizer we can iterate through all sentences and count the number of tokens that are not punctuation marks or symbols.
If we run the program against the sample text, it returns two long sentences: one with 34 words and one with 17 words.
Dependent clauses, enumeration…: What commas can tell you
A surprising insight is that the number of commas in a sentence can be revealing. Multiple commas in a sentence hint at a potential for optimization. If we write a program that finds all the sentences in our sample text that contain more than one comma, two sentences are returned. The first is a long sentence containing a dependent clause inserted in the middle. The second sentence contains an enumeration which should be turned into a list.
Finding nominalizations… or at least the worst ones
Nominalizations are not wrong per se. However, they can make a text static, bureaucratic and passive. Nominalizations often hint at different stylistic problems.
Finding nominalizations is not easy, in German some are very well hidden.
But if a word is a noun and ends with “-ung”,”-keit”,”-heit” or “-tion”, we can be pretty sure that it is a nominalization. We can turn that into a program and run it against our sample text. This returns one sentence containing three nominalizations and this sentence is indeed static and bureaucratic.
There are several more ideas we could pursue.
For example, we could try to find passive sentences. We could look for sentences that contain forms of the auxiliary verbs “have” or ”be” together with participles.
Another idea is checking for erroneous use of “dass” and “das” although this may prove to be harder than it sounds.
Apart from checking writing rules, we could also try to measure the results of our efforts. For example, we could calculate readability indices using basically the same techniques or estimate reading time.
Let’s summarize what we did in part 3:
- First, we prepared the source text.
- We used the terminology list to find forbidden terms.
- We looked for long words by counting syllables and characters.
- We looked for complicated sentences by counting words and commas.
- We looked for nominalizations using POS tags and suffixes.
We created this tutorial based on the talk “Charming the word snake – terminology work and language checking with python” by Maximilian Rosin and Esther Strauch (parson AG) held at the tcworld conference 2020 .
You can watch the recording here:
- Ann-Cathrin Mackenthun: Terminology management on a budget (https://www.parson-europe.com/en/knowledge-base/terminology-management-on-a-budget)
- Al Sweigart: Automate the Boring Stuff with Python (https://automatetheboringstuff.com/)
- Python 3: The programming language (https://www.python.org/)
- Anaconda: Package manager for data science (https://www.anaconda.com/)
- Jupyter Notebook: Creating documents with live code (https://jupyter.org/)
- spacy: NLP library (https://www.spacy.io)
- tabulate: Formatting output as tables (https://github.com/astanin/python-tabulate)
- json: Reading and writing JSON files (https://docs.python.org/3/library/json.html)
- csv: Reading and writing CSV files (https://docs.python.org/3/library/csv.html)
Add new comment