Reliability check – Structural difficulty in translation and interpretation

4.2.4 Reliability check

The semantic parsing method summarized in section 3.1 and detailed in annex I isn’t proposed here as a foolproof set of rules which can always be applied to a given sentence with the same result. Nor is its main aim to suggest a new approach to discourse analysis. The method was developed as a technical tool for assessing different language versions of a text or speech in terms of the three indicators of difficulty identified for this study. And it serves that purpose well.

Care has been taken to reduce the effect of parsing decisions in borderline cases, by parsing all language versions of a sentence the same way whenever they’re judged to contain the same information. Also, the statistical analysis uses a model tailored to minimize the effects of any differences in parsing decisions between sentences, as explained in section 4.2.1 on the procedure applied for the statistical analysis. Data pages showing all segmented texts and values recorded for each sentence can be found in annex II.

In addition, a reliability check has been performed on a random sample of the data. That check consisted in reanalyzing the original English version and one other language version of every tenth sentence in the corpus. New values were recorded for all variables. The new analysis took place several months after and independently from the first analysis. The results confirmed a very high rate of agreement between the first and the new analysis. The data for the check, including a summary of various parsing issues it highlighted, can be found in annex II.

4.2.4.1 Analysis

For each of the 110 sample sentences in the reliability check, the original English version and one translated or interpreted version were analyzed again. The check involved gathering new values for one independent variable – sentence complexity – and three dependent variables – reordering, nesting changes (changes in single, double and triple nestings) and changes in semantic relations. Summary statistics on the agreement between values recorded in both analyses are presented below. No data on changes in triple nestings are shown, since there were no triple nestings in the English or other language version of any sentence in the sample checked.

Let’s start with percentage rates for data agreement in the total sample of sentences in the data check. Summary statistics for those rates are shown in table 18.

Table 18
Percentage rates for data agreement in total sample
(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)

	Number of sentences	Same value in both analyses				Agreement rate (%)
	Number of sentences	C	R	N1 / N2	S	C	R	N1 / N2	S
Total sample	110	101	109	107 / 110	106	92	99	97 / 100	96

Another useful measure for the consistency of data recorded twice is Cohen’s kappa coefficient (κ). Kappa values are considered to be more robust than percentage rates like the ones in table 18, as they take into account the possibility of agreement occurring by chance. Kappa values range from 0 (no agreement) to 1 (perfect agreement). To measure the consistency of data for numerical variables with a small range of values, such as those recorded in this study, weighted kappas are generally used.

Weighted kappas for data agreement were first calculated for the total sample of sentences. Summary statistics for those calculations are shown in table 19.

Table 19
Weighted kappas for data agreement in total sample
(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)

Weighted kappa

95% asymptotic confidence interval

lower bound

upper bound

N1 / N2

Total sample

0.97

1.00

0.98 / 1.00

0.97

0.95

0.99

0.96 / 1.00

0.93

0.99

1.00

1.00 / 1.00

1.00

The bolded values from tables 18 and 19 are visualized in charts 11 and 12.

Chart 11. Percentage rates for data agreement in total sample

[If charts don’t appear, refresh page.]

Chart 12. Weighted kappas for data agreement in total sample

[Hover mouse over bars to see values.]

Landis & Koch (1977) suggest the following scale for interpreting weighted kappa values:

Kappa

                                                                                          < 0.00
                                                                                          0.00-0.20
                                                                                          0.21-0.40
                                                                                          0.41-0.60
                                                                                          0.61-0.80
                                                                                          0.81-1.00

Interpretation

          Poor
          Slight
          Fair
          Moderate
          Substantial
          Almost perfect

According to the above scale, there was almost perfect agreement between the first analysis and the new one in data recorded for the total sample of sentences. There was slightly less agreement in data for sentence complexity than for the other variables.

In addition to these overall results, we can try to get an idea of areas where there may be greater inconsistency than in others. We can do this by comparing data recorded for sentences in different modes, in different target languages and with different degrees of complexity. For smaller sample sizes like these, percentage rates are considered to be more robust than kappa values.

Let’s look first at percentage rates for data agreement by mode. Summary statistics for those rates are shown in table 20.

Table 20
Data agreement by mode
(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)

Mode	Number of sentences	Same value in both analyses				Agreement rate (%)
Mode	Number of sentences	C	R	N1 / N2	S	C	R	N1 / N2	S
Legal translation	48	44	47	45 / 48	45	92	98	94 / 100	94
Subtitle translation	40	37	40	40 / 40	40	93	100	100 / 100	100
Simultaneous interpretation	22	20	22	22 / 22	21	91	100	100 / 100	95

The bolded columns in table 20 indicate that there was almost perfect agreement between the first analysis and the new one in data recorded for each mode. There was slightly less agreement in data for legal translation than for the other two modes. These results are visualized in chart 13.

Chart 13. Data agreement by mode

[Hover mouse over bars to see values.]

Let’s look next at percentage rates for data agreement by target language. Summary statistics for those rates are shown in table 21.

Table 21
Data agreement by target language
(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)

Target language	Number of sentences	Same value in both analyses				Agreement rate (%)
Target language	Number of sentences	C	R	N1 / N2	S	C	R	N1 / N2	S
Russian	22	20	21	21 / 22	21	91	95	95 / 100	95
Hungarian	22	20	22	22 / 22	21	91	100	100 / 100	95
Turkish	22	20	22	21 / 22	21	91	100	95 / 100	95
Mandarin	22	21	22	21 / 22	21	95	100	95 / 100	95
Japanese	22	20	22	22 / 22	22	91	100	100 / 100	100

The bolded columns in table 21 indicate that there was almost perfect agreement between the first analysis and the new one in data recorded for each target language. There were no major differences in data for different target languages. These results are visualized in chart 14.

Chart 14. Data agreement by target language

[Hover mouse over bars to see values.]

Finally, let’s look at percentage rates for data agreement by degree of sentence complexity, as measured by the number of functionally subordinate or reported propositions in the original English version of a sentence. Sentence complexity as recorded in the first analysis was used as an input variable for this part of the check. That input variable consisted in three representative ranges for subordinate or reported proposition counts per sentence (0-2, 3-5 and 6+). Agreement between the first analysis and the new one in the exact value recorded for sentence complexity was one of the four output variables, as in the other parts described above. Summary statistics for this part of the check are shown in table 22.

Table 22
Data agreement by degree of sentence complexity
(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)

Degree of complexity	Number of sentences	Same value in both analyses				Agreement rate (%)
Degree of complexity	Number of sentences	C	R	N1 / N2	S	C	R	N1 / N2	S
0-2 subordinations	57	57	57	57 / 57	57	100	100	100 / 100	100
3-5 subordinations	31	30	31	30 / 31	30	97	100	97 / 100	97
6+ subordinations	22	18	21	20 / 22	19	82	95	91 / 100	86

The bolded columns in table 22 indicate that there was almost perfect agreement between the first analysis and the new one in data recorded for sentences with different degrees of complexity. The rates of data agreement decreased somewhat as sentence complexity increased. These results are visualized in chart 15.

Chart 15. Data agreement by degree of sentence complexity

[Hover mouse over bars to see values]

4.2.4.2 Summary of results

The reliability check presented in this section was carried out on a random 10% sample of sentences in the corpus, on one random target language of the five for each sentence. Of the three independent and three dependent variables recorded for each sentence, there was a possibility of variation in five variables: sentence complexity, reordering, changes in single and double nestings, and changes in semantic relations. With values for those five variables recorded twice each in 110 sample sentences, the data check included a total of 1,100 values, representing 2% of the overall data. If we discount the data for changes in double nestings because of the low numbers involved, a total of 880 significant values were recorded and compared.

Both in percentage rates and in weighted kappas, there was almost perfect agreement between the first analysis and the new one in values recorded for variables across the board. Breaking the results down, there were: (a) slightly more differences in data for sentence complexity than for the three indicators of difficulty, (b) slightly more differences in data for legal translation than for the other two modes, and (c) more differences in data for complex sentences than for simple ones. There were no major differences in data recorded for different target languages.

These results show that a random representative sample of the data in this study is almost perfectly replicable, as recorded independently by the author after an interval of several months. Although the sample represents just 2% of the total data, the large number of values checked and the very high levels of agreement found suggest that similar levels apply to the corpus overall. This confirms that the findings of the study can be regarded as robust.

The reliability check also involved an analysis of issues behind various parsing decisions. Those parsing issues are presented in section 2 of annex I. They reveal many gray areas where distinctions were borderline. It might be possible to clear up some of those gray areas with a more rigorous version of the semantic parsing method developed for this study. Some other areas may just be intrinsically gray, because human language doesn’t always allow for clear-cut distinctions in certain areas – like whether a proposition is functionally independent or subordinate, which of two linked propositions is more salient, or whether a speaker or writer identifies with the content of a subordinate clause.

Still, those gray areas are unlikely to affect the findings of the study significantly. That conviction is confirmed by the very high level of consistency found in the data check. It’s further strengthened – as explained in section 4.2.1 on the procedure applied for the statistical analysis – by a statistical model that isolates any parsing inconsistencies that do occur as extraneous effects not involving the independent variables being tested.

Of all the parsing issues highlighted in section 2 of annex I, the one that stands out most in terms of product quality is reformulation and non-standard syntax in simultaneous interpretation between languages with very different structure. Though such interpretation solutions can be useful, they can sound awkward. And they can change meaning. This study has focused on difficulty in production. It hasn’t considered the way the result of that production is perceived or understood. It could be interesting to ask listeners of similar content interpreted simultaneously into typologically different languages to assess how natural that interpretation sounds and how easy it is to understand, and to answer questions on its content. The findings could help shed light on another important effect of structural difference between languages on interpretation, from the perspective of the listener.

This chapter has presented the method and main findings of the statistical analysis, including a reliability check on data recorded. Those findings involve descriptions of associations between variables observed in our corpus data, as well as predictions for more general associations that could be found in the corpus, as well as in similar texts, talks and speeches. Those associations are discussed and conclusions drawn in the next and final chapter. That chapter also considers some coping tactics for simultaneous interpretation between languages with very different structure.

← 4.3 Predictions

→ 5. Discussion