Reliability check

Care has been taken to reduce the effect of parsing decisions in borderline cases, by parsing all language versions of a sentence the same way whenever they’re judged to contain the same information. Also, the statistical analysis uses a model tailored to minimize the effects of any differences in parsing decisions between sentences, as explained in section 4.2.1 on the procedure applied for the statistical analysis. Data pages showing all segmented texts and values recorded for each sentence can be found in annex II.

For each of the 110 sample sentences in the reliability check, the original English version and one translated or interpreted version were analyzed again. The check involved gathering new values for one independent variable – sentence complexity – and three dependent variables – reorderingnesting changes (changes in singledouble and triple nestings) and changes in semantic relations. Summary statistics on the agreement between values recorded in both analyses are presented below. No data on changes in triple nestings are shown, since there were no triple nestings in the English or other language version of any sentence in the sample checked.

Let’s start with percentage rates for data agreement in the total sample of sentences in the data check. Summary statistics for those rates are shown in table 18.

Table 18
Percentage rates for data agreement in total sample

(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)


Number of
sentences
Same value in both analysesAgreement rate (%)
CRN1 / N2 SCRN1 / N2S
Total sample110101109107 / 110106929997 / 10096


Another useful measure for the consistency of data recorded twice is Cohen’s kappa coefficient (κ). Kappa values are considered to be more robust than percentage rates like the ones in table 18, as they take into account the possibility of agreement occurring by chance. Kappa values range from 0 (no agreement) to 1 (perfect agreement). To measure the consistency of data for numerical variables with a small range of values, such as those recorded in this study, weighted kappas are generally used.

Weighted kappas for data agreement were first calculated for the total sample of sentences. Summary statistics for those calculations are shown in table 19.

Table 19
Weighted kappas for data agreement in total sample

(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)



Weighted kappa
95% asymptotic confidence interval
lower boundupper bound
CRN1 / N2SCRN1 / N2SCRN1 / N2S
Total sample0.971.000.98 / 1.000.970.950.990.96 / 1.000.930.991.001.00 / 1.001.00


The bolded values from tables 18 and 19 are visualized in charts 11 and 12.

Chart 11. Percentage rates for data agreement in total sample


[If charts don’t appear, refresh page.]

Chart 12. Weighted kappas for data agreement in total sample


[Hover mouse over bars to see values.]

Landis & Koch (1977) suggest the following scale for interpreting weighted kappa values:

                                                                                          Kappa

                                                                                          < 0.00
                                                                                          0.00-0.20
                                                                                          0.21-0.40
                                                                                          0.41-0.60
                                                                                          0.61-0.80
                                                                                          0.81-1.00

          Interpretation

          Poor
          Slight
          Fair
          Moderate
          Substantial
          Almost perfect

According to the above scale, there was almost perfect agreement between the first analysis and the new one in data recorded for the total sample of sentences. There was slightly less agreement in data for sentence complexity than for the other variables.


In addition to these overall results, we can try to get an idea of areas where there may be greater inconsistency than in others. We can do this by comparing data recorded for sentences in different modes, in different target languages and with different degrees of complexity. For smaller sample sizes like these, percentage rates are considered to be more robust than kappa values.

Let’s look first at percentage rates for data agreement by mode. Summary statistics for those rates are shown in table 20.

Table 20
Data agreement by mode

(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)



Mode

Number of
sentences
Same value in both analysesAgreement rate (%)
CRN1 / N2SCRN1 / N2S
Legal translation48444745 / 4845929894 / 10094
Subtitle translation40374040 / 404093100100 / 100100
Simultaneous interpretation22202222 / 222191100100 / 10095


The bolded columns in table 20 indicate that there was almost perfect agreement between the first analysis and the new one in data recorded for each mode. There was slightly less agreement in data for legal translation than for the other two modes. These results are visualized in chart 13.

Chart 13. Data agreement by mode

[Hover mouse over bars to see values.]

Let’s look next at percentage rates for data agreement by target language. Summary statistics for those rates are shown in table 21.

Table 21
Data agreement by target language

(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)



Target language

Number of
sentences
Same value in both analysesAgreement rate (%)
CRN1 / N2SCRN1 / N2S
Russian22202121 / 2221919595 / 10095
Hungarian22202222 / 222191100100 / 10095
Turkish22202221 / 22219110095 / 10095
Mandarin22212221 / 22219510095 / 10095
Japanese22202222 / 222291100100 / 100100


The bolded columns in table 21 indicate that there was almost perfect agreement between the first analysis and the new one in data recorded for each target language. There were no major differences in data for different target languages. These results are visualized in chart 14.

Chart 14. Data agreement by target language

[Hover mouse over bars to see values.]

Finally, let’s look at percentage rates for data agreement by degree of sentence complexity, as measured by the number of functionally subordinate or reported propositions in the original English version of a sentence. Sentence complexity as recorded in the first analysis was used as an input variable for this part of the check. That input variable consisted in three representative ranges for subordinate or reported proposition counts per sentence (0-2, 3-5 and 6+). Agreement between the first analysis and the new one in the exact value recorded for sentence complexity was one of the four output variables, as in the other parts described above. Summary statistics for this part of the check are shown in table 22.

Table 22
Data agreement by degree of sentence complexity

(C = Complexity, R = Reordering, N1/N2 = Changes in single/double nestings, S = Changes in semantic relations)



Degree of complexity

Number of
sentences
Same value in both analysesAgreement rate (%)
CRN1 / N2 SCRN1 / N2S
0-2 subordinations57575757 / 5757100100100 / 100100
3-5 subordinations31303130 / 31309710097 / 10097
6+ subordinations22182120 / 2219829591 / 10086


The bolded columns in table 22 indicate that there was almost perfect agreement between the first analysis and the new one in data recorded for sentences with different degrees of complexity. The rates of data agreement decreased somewhat as sentence complexity increased. These results are visualized in chart 15.

Chart 15. Data agreement by degree of sentence complexity

The reliability check presented in this section was carried out on a random 10% sample of sentences in the corpus, on one random target language of the five for each sentence. Of the three independent and three dependent variables recorded for each sentence, there was a possibility of variation in five variables: sentence complexity, reordering, changes in single and double nestings, and changes in semantic relations. With values for those five variables recorded twice each in 110 sample sentences, the data check included a total of 1,100 values, representing 2% of the overall data. If we discount the data for changes in double nestings because of the low numbers involved, a total of 880 significant values were recorded and compared.

Both in percentage rates and in weighted kappas, there was almost perfect agreement between the first analysis and the new one in values recorded for variables across the board. Breaking the results down, there were: (a) slightly more differences in data for sentence complexity than for the three indicators of difficulty, (b) slightly more differences in data for legal translation than for the other two modes, and (c) more differences in data for complex sentences than for simple ones. There were no major differences in data recorded for different target languages.

These results show that a random representative sample of the data in this study is almost perfectly replicable, as recorded independently by the author after an interval of several months. Although the sample represents just 2% of the total data, the large number of values checked and the very high levels of agreement found suggest that similar levels apply to the corpus overall. This confirms that the findings of the study can be regarded as robust.

The reliability check also involved an analysis of issues behind various parsing decisions. Those parsing issues are presented in section 2 of annex I. They reveal many gray areas where distinctions were borderline. It might be possible to clear up some of those gray areas with a more rigorous version of the semantic parsing method developed for this study. Some other areas may just be intrinsically gray, because human language doesn’t always allow for clear-cut distinctions in certain areas – like whether a proposition is functionally independent or subordinate, which of two linked propositions is more salient, or whether a speaker or writer identifies with the content of a subordinate clause.

Still, those gray areas are unlikely to affect the findings of the study significantly. That conviction is confirmed by the very high level of consistency found in the data check. It’s further strengthened – as explained in section 4.2.1 on the procedure applied for the statistical analysis – by a statistical model that isolates any parsing inconsistencies that do occur as extraneous effects not involving the independent variables being tested.