CohLitheSP: A new technique to study quality in Specialized Translation

The aim of this work is to analyse the benefits of introducing a code in Python language to analyze quality in terms of the appropriateness of the texts as regards reading. In order to do this, a didactic experience was implemented in two final dissertations on the university course of Translation and Interpreting at the University of Murcia, in the academic years 2018-2019 and 2019-2020. The steps followed were: 1. Evaluation metrics for both MT compared to the reference translation (student ́s translation) 2. Definition of a tool used to calculate easibility of the text 3. Determination of the weights 4. Calculation of the amplification constant for each specific corpus 5. Calculation of marks of easibility of texts 6. External evaluation attending House ́s model (2015) adapted In accordance with the above, the following research questions are proposed: is human or MT translation better, and, is it possible to create a rubric based on significant grounds to calculate an approximate mark for quality in Translation? Keywords— Computer-based studies, English for Specific Purposes, Linguistics, Literary translation, Quality in Translation processes, Scientific-technical translation.


I. INTRODUCTION
House (1981:127) starts most of her works with questions such as "What is a good translation?". In fact, it should be "one of the most important questions to be asked in connection with translation". For Halliday (2001:14) it is notoriously difficult to say why or even whether, something is a good translation". Quality translation should be mentioned here associated to the goals of MT and new 'interactive' and/or 'adaptive' interfaces have been proposed for post-editing (Green, 2015;Vashee, 2017). Therefore, in this case, human and MT are inextricably linked. Some recent studies mention that MT is almost 'human-like' or that it 'gets closer to that of average human translators' (Wu et al., 2016) and, also that MT quality is at human parity when compared to professional human translators" (Hassan et al., 2018). Ahrenberg (2017:1) states that the aim of MT is 'overcoming language barriers', although human translation is aimed at producing 'texts that satisfy the linguistic norms of a target culture and are adapted to the assumed knowledge of its readers'. In order to do that, MT One of the most required standards when comparing translations as mentioned before is quality. Mateo (2014), referring to Nord (1997) defines it as "appropriateness of a translated text to fulfil a communicative purpose". Following Mateo et al. (2017) the results of this quality should be 'Very good', 'Satisfactory', or 'Unacceptable'.
Nevertheless, there are authors who claim that it is almost impossible to overcome the perfection of human translation (Melby with T. Warner, 1995) and Giammarresi and Lapalme (2016). MT Translation has gone through three stages 'from early dictionary-matched machine translation to corpus-based statistical computeraided translation, and then to neural machine translation with artificial intelligence as its core technology in recent years' (Zhaorong, 2018). Papineni et al. (2002) focus mainly on 'developing metrics whose ratings correlate well with human ratings or rankings' Ahrenberg (2017:2).
House (2017:2) defines translation as 'the result of a linguistic-textual operation in which a text in one language is re-contextualized in another language'. For her, there are some interaction factors which should be taken into consideration (House, 2017:2-3): • the structural characteristics, the limitations of two languages (source and target language); • the extra-linguistic world • the source text with its features; • the linguistic-stylistic-aesthetic norms of the target language; • the target language rules; • intertextuality in the target text; • traditions, principles, etc., in the target language; • the translation company´s instructions given to the translator; • the translator's workplace conditions; • the translator's knowledge and expertise; • the translation receptors' knowledge and expertise.
House (2017:5) also insists on the cognitive aspects of translation, and specifically, the process of translation in the translator´s mind; a matter studied over the last 30 years, but certainly recently updated (cf. Equivalence is another key point in translation, and authors such as Jakobson (1966) and Nida (1964) stating on 'different kinds of equivalence', and Catford (1965); House (1977House ( , 1997; Neubert (1970Neubert ( , 1985; Pym (1995); and see  The translated text is well anchored in the target culture and, in transposing the original; the translator will often be confronted with culture-bound expressions or situations", and for Ahikary (2020) this means that "the equivalence is one of the most important aspects or goals of translation; translator has to focus on searching for the best equivalent terms between two different languages or dialects".
In accordance with the present experiment, which is based upon the study on the human translation and MT quality in two final dissertations in the university course of Translation and Interpreting at the University of Murcia, the following research questions are proposed: is human or MT translation better, and, is it possible to create a rubric based on significant grounds to calculate an approximate mark for quality in Translation?.

Contextualization and sample
This didactic experience was implemented in two final dissertations on the university course of Translation and Interpreting at the University of Murcia, in the academic years 2018-2019 and 2019-2020.

Development of the experiment
To carry out this work, different types of materials were used. First a suitable text in English which has never been translated before. The first final dissertation was a translation of a collection of texts dealing with: Quantum Physics, Technology, Medicine, Environment and Geology, with an extension of 600 words for each one. These mentioned scientific-technical texts have been taken from scientific publications and specialized magazines. The second one is an extract from Red Dirt (2016), a literary text from the narrative genre, whose main feature is the use of colloquial language, and is full of phraseological units and insults, with an extension of 2,500 words. For the MT two different tools were used: Matecat

Evaluation metrics for both MT
At this point it is important to reiterate that we are comparing a reference translation with a machine translation within the context of the underlying idea that "the closer a machine translation is to a professional human translation, the better it is" (Papineni, Roukos, Ward & Zhu 2002: 311-318).
The first evaluation metrics we are introducing here are Precision and Recall. First, we must count the number of words in both the machine and the reference translation. In order to do a calculation with Precision, the number of common words is divided by the number of words in the machine translation. The calculation of Recall is achieved by dividing the number of shared words by the number of words in the reference translation. We consider a system to be good if scores are high, so the best system is the one with the highest scores.
WER (Word Error Rate) is another metric we are implementing. In this method, differences such as substitutions, insertions and deletions are taken into account. This metric is based on Levenshtein distance calculated at word level. In this case, the lower the WER result, the better.
The most common metric used is BLEU (Bilingual Evaluation Understudy). This method discovers how many n-grams are overlapping between the machine translation and the reference translation. This metric is based upon the idea that the larger the number of n-grams overlapping between the machine translation and the reference translation, the better the machine translation is. The machine translations should be as near to 1 as possible to be considered good translations. The formula to calculate BLEU is: In order to obtain the results, a programme 1 , written in Python language, was used to implement the WER, BLEU, Precision and Recall functions from the information dumped in a file. The file recognized a header, followed by different text segments corresponding to the original, a reference translation and several translations to be compared. The code proceeded to calculate each function by combining the reference with each translation to generate another file in table format that could be used directly and sent to a spreadsheet. When performing translation tasks, three different machine translations were offered. The average is calculated for each suggestion offered by the machine, taking into account the above metrics. We can go a step further and consider students´ translations as a reference translation and compare them to the MT. Then, when calculating the above-mentioned evaluation metrics (WER, BLEU, Precision and Recall), the results are refined. Following this, a mark can be calculated using this formula: When W=0 no mistakes, maximum mark 1-W Following the above results, a mark can be calculated using this formula: When W=0 no mistakes, maximum mark 1-W and the target language. One project comprises one or several texts to be translated, and each project has a translations memory. Matecat provides, by default, a connection with Google Translate as a machine translation system, and a connection with MyMemory as a public translation memory. It is important to mention that MyMemory is an open, available translation memory including the translation memories of the European institutions, the United Nations and automatically extracted data from multilingual websites. The first operation to be carried out is the analysis of the project. By clicking Analyze, Matecat shows how many words need to be translated in the preliminary analysis report it produces. In this report, the total number of words of the source text is displayed under Total Word Count. Then the postediting is started and it is possible to see some translation suggestions. The translator has to decide how to adjust the translation and click Translated when the work is done.
Matecat also offers the concordance function to look up words and phrases in the active translation memories.
Once the post-editing is finished in the last segment, we can download the translated text and the translation memory. The Editing Log allows the translator to view adjustments made to the MT suggestions in the whole process. Finally, the average Post-Editing Effort (PEE) can be observed. It is important to mention that Matecat counts words according to industry standards, so "words or phrases with a 100% Translation Memory match are given a weighting of 30% and words or phrases with a partial TM match are given a weighting of 60%" (Matecat, 2014).

Wordfast Anywhere
As far as the second final dissertation on the literary text, the CAT tool used was Wordfast Anywhere, which is a Translation memory of the company Word. The procedure to use it is as follows: the text is divided into segments that are being translated and stored, creating glossaries and translations, which will appear in future translations depending on the index of coincidence of the words. It is necessary to create an account with an e-mail to Access a protected area, which acts as a cloud, where the translation memories, the glossaries and files of the project are stored. It is possible to access from any search engine and is offering the option of MT. This is the free version of Wordfast, the second memory translation used most in the world, after SDL Trados.

Definition of the tool used to calculate easibility of the text
To analyze the appropriateness of the texts as regards reading, a code in Python language has been developed.
The first operation carried out by this code is sequencing words of the text to recover the number of paragraphs, To apply the aforementioned metrics, the following are needed: • A reference text conforming to a valid corpus, • A glossary of technical or specific terms which is helping to know which words are specific within a corpus. These terms will not include measurement units nor "words of stop" (prepositions, determiners, etc), and • A set of connectors allowing to know when, in a sentence, something is being inferred from something previously said.
The selected metrics and their changes are: • PCNARL. Narrativity. It is calculated determining which words of the text to be evaluated are already being recognized in the reference text.
• PCSYNL. Readability. It determines the simplicity of the text in its language. In the case  Flesch), which is using a number of sentences, syllables and words. If someone wants to do it for the English language, it only needs to be changed with the Flesch-Kincaid 2 , whose formula is also based on a similar calculation.
• PCREFL. Referential Cohesion. In this version, the same referential cohesion as in Coh-Metrix is calculated; but instead of considering all nouns, it is only applied in technical or specific terms recognized in the glossary.
• PCDCL. Deep Cohesion. It determines the incidence of the connector over the recognized sentences.
• PCCNCL. Concreteness. In this version, instead of calculating the concreteness over the whole corpus of the language, the incidence of the terms of the glossary is determined from the recognized words in the reference text within the text to be evaluated.
This reduction in the cost of programming also requests to adopt mechanisms of compromise to be able to recognize the belonging of a word within large sets in such a way that the closest word is given back within some margins of tolerance etc.

Fig.1: Results of the programme
In order to do that, a structure (a decision tree) has been created to order words in such a way that we know instantly whether words are included in the structure or not: we are interested in this version not only in the lexemes of Spanish, but also in their cases. That is, considering that we have not been working with an extensive dictionary of Spanish language, nor the rules determining its lexemes, when it is masculine or feminine, in singular or plural. Furthermore, if it is a verb, it should recognize its verbal tense (present, past, future, conditional, etc.). The algorithm proceeds to repeat, as it were, a process of stressing a word, the first characters in every word several times, and more times than the last ones. In this way, when calculating the movements (Levenshtein's distance), errors will have less weight at the end of the word (morphemes) and more weight at the beginning (root).
By using this mechanism under a tolerance of 25% (the words whose ratio of Levenshtein is not below 75% are accepted) an approximation closely related to a process of lematization is obtained.
For the calculation of the narrativity, it is necessary to use these techniques, as well as for the calculation of the concretenessto be able to generate two decision trees.
The following ideas have been considered to separate in sentences: 1. A sentence is formed by more than ONE word.

After a dot a sentence begins in upper case.
3. The sentences that do not comply with 1 and 2 will be separated by "; -¿? ¡!.:" 4. If a sentence complies with 1 or 2, it will be added to next sentence.
After applying this simplified version of Coh-Metrix over the produced texts in Spanish, it is possible to see how, after being evaluated separately with a mark from 0 to 10, they seem to describe a similar curve: As can be seen in the above figures, different types of written texts for different technical corpuses seem to be minor differences in marks, but with a pattern that seems to say that measurements are not random. Therefore, it seems that, in addition, the texts used as references, representing a corpus without errors, have a mark below below 10 so students can never get that mark. Therefore, not only must each Coh-Lithe metric be weighted in such a way that favours the distinction among students' faculties, but, in addition, the results must be amplified so the reference texts have the same mark. For this reason, now there is an explanation on how to calculate the weighting of each metric and the constant used to amplify the mark.

Determination of the weights
By analysing the different students' texts, it is interesting to point out that the best marks should come from metrics where each student has the most dissenting marks and those metrics where students have better marks should weigh more. Therefore, after multiplying the media and standard deviation of each metric and normalizing the results, the following weights have been generated:

Calculation of the amplification constant for each specific corpus
Below, the results of evaluating the reference texts can be seen.

Fig.5: Results of evaluating reference texts
As we can observe, with the exception of Narrativity, the maximum mark is not achieved in each parameter, so, first, the weights for each case are applied and, later, a rule of three with the maximum mark (10). The result will be the constant by which all texts using this reference document are multiplied. For example, if the amplification constant over the texts of the technological corpus as reference is needed, then this formula is being used, after calculating the coefficients from the programme: Under these weights, marks of the six reference texts have been studied, and it has been found an amplification of 1.39.

Fig.6: Marks of reference texts
For that reason, if we do not want to multiply the amplifier within its corpus, it seems that it is not inexact to multiply by 1.39, regardless of the reference text.

Calculation of marks of easibility of texts
Regarding the calculation of the marks of the texts, the amplification constant must be applied by the addition of each metric divided by its maximum and multiplied by its weight. For example, the following formula can be observed over the technology texts: . This means that the translation is not marked pragmatically by its source text, so it could have been created independently; therefore they are 'pragmatically of equal concern for source and target language addressees´ (2015: 56). Meanwhile, an overt translation has to cope with the cultural assumption of the target language to be able to translate the text appropriately. In the final dissertations that we are analysing, the first one is a covert translation, and the second one an overt translation.

External evaluation
House (2015:63) states clearly that translation is 'the replacement of a text in the source language by a semantically and pragmatically equivalent text in the target language´; therefore, it must be equivalent. House agrees with Halliday's assumption (1989:11) that the text and the context of the situation should be separated. In addition, the concepts of Field, Mode and Tenor from Halliday are also used (House 2015: 64). The Mode refers to both the channel (in this case, the text is written to be read) and the degree to which potential or real participation is allowed for between writer and reader. The Field refers to the content, the subject matter. The Tenor is the nature of participants, the addresser and the addressees, whether the author's personal (emotional and intellectual) stance help to transmit the message. However, in her work, House incorporates the idea of Genre, 'It connects texts with the 'macro-context' of the linguistic and cultural community in which the text is embedded´ (2015:64). The following Figure   The cultural filter is another important concept introduced by House. As defined by the author (2015:68) it 'is a means of capturing socio-cultural differences in expectation norms and stylistic conventions between the source and target linguistic-cultural communities.
Therefore, to compare both texts (source and target language) it is necessary to bear the cultural concept in mind.
To apply House´s considerations, the authors of this work have created a questionnaire with 10 questions that have been posted to a class of the 4 th year of a university course of Translation and Interpreting, who have already finished a subject on Specialized Translation and have the knowledge to analyse and evaluate translations of this type.

Fig.8: Questionnaire
This questionnaire was posted in Google Forms after having successfully finished the subject on Specialized Translation, as a class activity on-line. The second final dissertation with a literary text has only 9 questions since we considered two of them as one.

Evaluation metrics
The scientific-technical texts had the following results:

Evaluation metrics
An analysis of the evaluation metrics in 3.1 shows us the following results: Regarding the final dissertation on scientific-technical texts: • MT Suggestion 1 is the best one in the 5 texts (Matecat can offer up to 3 MT suggestions), having 5.8. 6.1, 6.4, 6 and 7.7, which is an excellent result.
Considering the final dissertation on a literary text: • MT suggestion on the literary text had a mark of 4.1, which is not so negative if we consider that it is an overt text and the human translator had to adapt precisely to the target culture, so it means that more changes were made in the MT than in the final dissertation suggestion to post edit the text.

Evaluation of the texts
As can be seen, MT gets better results than reference translations (student's translations). In fact, considering the value these questionnaires have, these ones could be contrasted to the previous results. To be able to understand the value of questionnaires, first we observe the questions, bearing in mind that each student could evaluate their results corresponding with different degrees of relevance.
By adding the evaluations made by students in the previous questions the following results are obtained for a text within the 5 first of scientific-technical content: It is observed that students' evaluation approximately coincides with the Coh-Lithe-SP's evaluation. In addition, a similar evaluation is achieved with the literary text compared with Coh-Lithe-SP's.

V. CONCLUSION
In this work, a new and different tool has been shown which adds a supplementary challenge for students: the possibility of improving the readability of their own translations from English into Spanish.
Given the facts, the technique explained before is working properly mainly due to two results: on the one hand, it is proved that different texts coming from different typologies, including MT texts, get good or bad marks in the same metrics. On the other hand, the tests also show that, after refining the final mark, the result is approximate to a student's evaluation.
Moreover, it is important to stress the easy programming, which does not require large corpuses, despite the fact it comes from systems needing an enormous extra charge in the development of programming. This last feature is complemented by the fact that it is easily transformed to be working in any language.
The procedure used to test the new tool implemented with the external evaluation questionnaire should also be highlighted. This questionnaire updates and implements House's and Halliday's considerations by testing the new tool considering the pragmatic and cultural aspects of both source target texts.

VI. SOFTWARE
The programme written in Python used to calculate the statistics with commentaries in English can be found in the following address: https://archive.org/details/coh-lithe-sp-012