Annex I – Semantic parsing
4. Parsing guidelines
Below is an overview of the guidelines followed in parsing the original English version and the translated or interpreted versions of each sentence in the corpus for this study.
4.1 Segmenting the original English version of a sentence
1. Any syntactic phrase with a predicate and at least one argument or adjunct is segmented as a proposition. One proposition can be part of another. Each proposition is enclosed in square brackets, followed by a superscript number. That number indicates the linear position of the proposition in the sentence.
[We should always help others.]1
[Helping others]1 [is a reward in itself.]2
2. A coordinating link between propositions is included in the segment for the proposition that follows it. A subordinating link is included in the segment for the subordinate proposition.
[This tool slices]1 [and dices.]2
[This tool is useful]1 [for peeling potatoes.]2
3. Multiple arguments are included in the same proposition.
[We’re having turkey, stuffing and sweet potatoes.]1
4. Multiple predicates are segmented in separate propositions. Arguments, adjuncts and links shared by more than one proposition are included in the proposition they’re syntactically contiguous with and implied in the other one.
[at the time of signature]1 [or ratification of the treaty]2
5. A proposition that syntactically splits the predicate of another proposition from any part of its arguments counts as a nesting. A proposition that just splits an adjunct or a link from the rest of another proposition doesn’t count as a nesting. A split proposition has both parts enclosed in brackets. The part with the predicate gets a normal black number. The number for the other part is gray and doesn’t count in the numbering order.
[The cat]2 [the dog was chasing]1 [ran up the tree.]2
A proposition that splits the predicate and arguments of another proposition also counts as a nesting if either of the surrounding parts is a proposition in its own right. In that case, each proposition is numbered separately.
[Respecting people’s privacy]1 [when handling personal data]2 [is a key concern.]3
6. A modifier with at least one argument or adjunct is segmented as a separate proposition.
[The]2 [rapidly changing]1 [climate is a huge challenge.]2
[protection of data]1 [concerning people’s private lives]2
Modifiers with no arguments or adjuncts are included in the same proposition as the noun they modify.
[protection of people’s private data]1
7. A shortened relative clause is segmented as a separate proposition.
[The books]2 [on the table]1 [are mine.]2 ( = [The books]2 [which are on the table]1 [are mine.]2 )
but: [Leave the books on the table.]1
8. A semantically weak predicate that can be omitted in rephrasing isn’t segmented separately.
[if there’s anything you need]1 ( ≈ [if you need anything]1 )
[if such an act occurs as a result of using force]1 ( ≈ [if such an act results from force]1 )
9. An adjunct providing typical information – beneficiary, accompaniment, result, instrument, location, goal, time (except for descriptions of events), manner and measure (Larson 1984: 219-223) – is included in the same proposition as its predicate.
[I wrote this poem for you.]1
[I went to the movies with my boyfriend.]1
[I painted the room pink.]1
[We usually speak on FaceTime.]1
[Let’s meet in the park.]1
[We did it for fun.]1
[Let’s meet tomorrow afternoon.]1
[We can meet in person.]1
[I’m totally convinced.]1
An adjunct providing information other than typical information as defined above can generally be rephrased as a clause and is segmented as a separate proposition. In English and many other languages, such a proposition tends to be separated from its parent by a pause in speaking or a comma in writing.
[Because of these concerns,]1 [experts are advising caution.]2
[Many are willing to try,]1 [despite these concerns.]2
The same goes for an adjunct with the description of an event.
[After the meeting,]1 [we’ll go home.]2 ( ≈ [After we meet / After the meeting ends,]1 [we’ll go home.]2 )
10. A phrase which is an adjunct to an entire proposition (not just its predicate) is segmented as a separate proposition if it has an argument.
[In addition to these concerns,]1 [there’s a mistrust of anything new.]2
Otherwise, it’s included in the rest of the proposition it’s an adjunct to.
[Clearly, you need a vacation.]1
11. A phrase headed by a noun in apposition with another noun is segmented as a separate modifying proposition.
[We’re joined by Ms Nakamura,]1 [Deputy Director for Marketing.]2
The same goes for a phrase with a head like “including” or “such as” which modifies a noun.
[Members discussed various issues,]1 [including rules, fees and penalties.]2
11. A comment clause (a main clause used as a formulaic expression of the speaker or writer’s attitude to an assertion made in a subordinate clause) isn’t segmented as a separate proposition.
[I think it’s going to rain.]1 ( ≈ [It’ll probably rain.]1 )
12. A process nominal (a nominal describing a process, event or situation) is segmented as the predicate of a separate proposition if it has any arguments or adjuncts.
[Bavaria is known]1 [for car manufacturing.]2 ( ≈ [the manufacturing of cars]2 )
[We need to mitigate]1 [climate change.]2 ( ≈ [the changing of the climate]2 )
Though a phrase like “climate change” is a set term with an established equivalent in other languages, it has argument structure, so it’s segmented here as a separate proposition, as explained before.
In contrast, a result nominal (a nominal describing the result of an action or something created by an action) is treated as a simple noun, even if it’s modified.
[We have common responsibilities.]
The distinction between process nominals and result nominals was discussed in section 1.7 of this annex.
13. For two predicates in an asymmetric relation with shared arguments, there’s a continuum from nearly total propositional integration to increasing autonomy. As a guide, if each predicate can be modified separately, they’re segmented in separate propositions. Otherwise, they’re included in the same proposition. There are some borderline cases.
According to this test, both predicates are included in the same proposition in these sentences:
[He has to leave.] ( ✓[He always has to leave.] *[He has] [always to leave.] )
[He lets me stay.] ( ✓[He always lets me stay.] *[He lets me] [always stay.] )
[She wants to help.] ( ✓[She always wants to help.] *[She wants] [always to help.] )
[He forced me to leave.] ( ✓[He always forced me to leave.] ?[He forced me] [always to leave.] )
[She tries to help.] ( ✓[She always tries to help.] ?[She tries] [always to help.] )
And each predicate is segmented in a separate proposition in these sentences:
[He convinced me] [to stay.] ( ✓[He always convinced me] [to stay.] ✓[He convinced me] [always to stay.] )
[I promised] [to help her.] ( ✓[I always promised] [to help her.] ✓[I promised] [always to help her.] )
[I’ll help you] [look great.] ( ✓[I’ll always help you] [look great.] ✓[I’ll help you] [always look great.] )
[I said] [I’d help her.] ( ✓[I always said] [I’d help her.] ✓[I said] [I’d always help her.] )
[We reaffirm] [our faith.] ( ✓[We always reaffirm] [our faith.] ✓[We reaffirm] [our constant faith.] )
4.2 Segmenting a translated or interpreted version of a sentence
The purpose of this parsing method is cross-linguistic comparison. So each proposition in a translated or interpreted version of a sentence has the same number as the corresponding proposition in the original English version.
Also, all translated or interpreted versions of a sentence are parsed as similarly as possible to the original English version. This means:
1. A phrase in a translated or interpreted version of a sentence isn’t segmented as a separate proposition if it doesn’t correspond to a separate proposition in the original English version, even if it would be segmented separately if the parsing method was applied directly to the translated or interpreted version.
2. A phrase in a translated or interpreted version of a sentence is segmented as a separate proposition if it corresponds to a separate proposition in the original English version, even if it wouldn’t be segmented separately if the parsing method was applied directly to the translated or interpreted version.
3. If a proposition in a translated or interpreted version of a sentence doesn’t include or imply a predicate or an argument included in the corresponding proposition in the original English version, or if it includes a predicate or an argument not included or implied in the original version, the segment number of the proposition in the translated or interpreted version is followed by a “Δ” (“delta” for “change”).
English: [Such an act is a violation of this Treaty.]1
→ other language: [Such an act is punishable.]1Δ
4. If a proposition in a translated or interpreted version of a sentence doesn’t correspond to any part of the original English version, it’s marked by an “X” instead of a segment number.
English: [Such an act is punishable.]1
→ other language: [Such an act is punishable]1 [as determined by the Court.]X
4.3 Number lines
1. Each proposition segmented and numbered as above is represented by its number in a line below the sentence.
2. The number for each subordinate proposition is labeled “arg,” “mod” or “adj,” according to its semantic role (argument, modifier or adjunct) in relation to its parent, followed by the number of its parent.
1 2 3 4 5
The number line above indicates that propositions 1 and 3 are arguments of proposition 2, proposition 4 is an adjunct to proposition 3, and proposition 5 modifies a noun in proposition 4.
3. A proposition containing reported speech or thought is labeled “rep”, followed by the number of the proposition (if any) where its perspective is established.
1 2 3 4 5
The number line above indicates that proposition 2 contains reported speech or thought, and that its perspective is established in proposition 1.
A statement or thought which the speaker or writer identifies with isn’t treated as reported speech or thought, as explained in section 1.6 of this annex.
4. For a split proposition, where the part containing the predicate has a normal black number in the segmented text and the other part has a gray number, only the normal black number is copied onto the number line.
5. The number for a nested proposition (a proposition that syntactically splits the predicate of another proposition from any part of its arguments) is enclosed in curly brackets.
1 2 3 {4} 5
The number line above indicates that proposition 4 is nested.
6. If the label over a number in the number line for a translated or interpreted version of a sentence doesn’t match the label above the same number in the number line for the original version, that number is followed by a “Δ” (“delta” for “change”) in the number line for the translated or interpreted version.
1 2 3Δ 4 5
The number line above indicates that proposition 3 is attached with a different relation or to a different parent in a translated or interpreted version of a sentence compared to the original version.
7. If a proposition is marked with a “Δ” or an “X” in the segmented text for a translated or interpreted version of a sentence (see points 3 and 4 of section 4.2 above), its number is followed by a “Δ” in the number line for that version.
1 2 3Δ 4 5
The number line above indicates that the predicate and arguments of proposition 3 of a translated or interpreted version of a sentence don’t match those of the original version, or that proposition 3 of the translated or interpreted version doesn’t correspond to any part of the original version.
8. If a proposition in the original version of a sentence has no corresponding proposition in a translated or interpreted version, its number appears at the end of the number line for that version, but in gray, followed by a “Δ”.
1 2 4 5 3Δ
The number line above indicates that proposition 3 of the original version of a sentence is missing in a translated or interpreted version.
4.4 Sentence division
Annex II contains a display page with a data table for the original English version and the five translated or interpreted versions of each sentence in our corpus. The English version of each sentence is divided and punctuated as it appears in the original text or transcript. The translated versions of sentences in the legal texts and subtitled talks are divided and punctuated as published online. The interpreted versions of sentences in the speech are divided and punctuated as transcribed by expert interpreters and interpreter trainers for the Russian, Mandarin and Japanese versions, and by myself for the Hungarian and Turkish versions. Transcribers were guided only by the recordings of interpretation, without referring to the original English version of the speech. They followed the conventions of intelligent verbatim transcription, providing an accurate written representation of each recording, while editing out fillers and corrections.
The translations and transcriptions have some differences in sentence division from the original English texts and transcripts. Longer sentences in the original English version of a legal text or subtitled talk were sometimes split into two or more shorter sentences in translation. This may have been due to individual stylistic choice by the translator. In subtitle translation, it may sometimes have been due to the need for subtitles to match the video image or to space restrictions. Guidelines for translating TED talks instruct translators to keep subtitles to at most two lines of 42 characters each, and reading speed to at most 21 characters per second.
Similarly, longer sentences in the original English version of the interpreted speech as transcribed and published were sometimes split into two or more shorter sentences in a transcription of recorded interpretation. This may have been due to individual stylistic choice by the transcriber. It may also reflect syntactic and prosodic breaks made by the interpreter to reduce cognitive load. Sometimes two or more sentences in the original English transcription of the speech were combined into one longer sentence in a transcription of recorded interpretation. And there were borderline cases, where a subjective decision may have been made as to whether to transcribe a certain passage as one or more written sentences.
In general, such changes in sentence division don’t affect the values of variables recorded here for statistical analysis. This is the case when an extra sentence division in one language appears between propositions which are functionally independent in the other language too, as shown in figure 104.
English: [I’m here]1 [because I want]2 [to help.]3
1 2 3
Other language: [I’m here.]1 [I want]2 [to help.]3
1 2 3
Figure 104
Change in sentence division – no effect on variables recorded
The English version in figure 104 is written as a single sentence with an overt link between propositions 1 and 2. The other language version is written as two separate sentences with no overt link between those propositions. But in both versions, propositions 1 and 2 are functionally independent, and proposition 3 is an argument of proposition 2. Whether functionally independent propositions are written in the same sentence or in different sentences has no impact on the method used here for showing linear and hierarchical relations between propositions and recording counts for our dependent variables. Neither does the presence or absence of an overt link between propositions.
Sometimes a change in sentence division results from the appearance in translation or interpretation of a proposition which doesn’t correspond to any part of the original version. This happens when information not contained in the original is added, or when information implied in the original is made explicit. Sometimes the opposite happens: a change in sentence division results from the absence in translation or interpretation of a proposition contained in the original version. This happens when information contained in the original is omitted, or when information explicitly stated in the original is made implicit. Any such change is reflected in the number lines for the original English version and the other language version. And it’s recorded as a change in semantic relations, as shown in figures 105 and 106.
English: [I’m here]1 [to help.]2
1 2
Other language: [I’m here]1 [to help]2 [tidy up.]X
1 2 XΔ
Figure 105
Addition of proposition – recorded as change in semantic relations
English: [I’m here]1 [to help]2 [tidy up.]3
1 2 3
Other language: [I’m here]1 [to help.]2
1 2 3Δ
Figure 106
Removal of proposition – recorded as change in semantic relations
4.5 Segment size
The method for parsing sentences into propositions described in sections 1-3 of this annex is applied to the original English version of each sentence in the corpus. For ease of comparison, the other language versions of that sentence are divided into segments corresponding in information content to the propositions in the original English version. That means the segments in a translated or interpreted version of a sentence may not correspond exactly to propositions as they would be identified if the parsing method was applied directly to that language version.
The reason for doing this, rather than applying the parsing method to each language version of a sentence directly, is to avoid changing the number of segments when the translation or interpretation contains the same information as the original. An example is shown in (156).
(156) English: inalienable rights
Other language: rights which can never be taken away
In the English version of the phrase in (156), the modifier “inalienable” has no arguments or adjuncts. So our parsing method wouldn’t segment it as a separate proposition. Accordingly, the relative clause “which can never be taken away” in the other language version wouldn’t be segmented here as a separate proposition either, even though a relative clause would be treated as a separate proposition if the parsing method was applied directly to that version. Care has been taken to ensure that the segments in each translated or interpreted version of a sentence match the segments in the original English version when the information content is judged to be the same, as in (156). This minimizes the impact of minor phrasing differences between languages on the values for variables in our data.
Sometimes a segmenting mismatch between the original English version and another language version of a sentence does affect the number of segments recorded. Examples are shown in (157) and (158).
(157) English: my older brother
Other language: my brother, [who’s older than me]
(158) English: Paris, [which is in France]
Other language: Paris, France
In the English version in (157), the modifier “older” has no arguments or adjuncts. In the other language version in (158), the same is true of the modifier “France.” So our parsing method wouldn’t treat the single words “older” or “France” in those versions as separate propositions. But in (157), “older” corresponds in the other language version to a functionally independent proposition which makes an assertion. And in (158), “France” corresponds in the English version to a functionally independent proposition which makes an assertion. Because of this change in semantic status, the relative clauses in those versions would be segmented here as separate propositions.
Sometimes segmenting decisions are borderline, as in (159).
(159) English: [There are many people(] [)who love jazz.] Other language: [Many people love jazz.]
The English version in (159) could be treated either as one or as two propositions. The choice would depend on whether the semantically weak predicate “there are” is taken as the predicate of a functionally separate proposition. In a case like this, if the same information content was expressed in a single proposition in most other language versions, the method used in this study would treat the English version as a single proposition too.
Our analysis doesn’t measure the length or the content of propositions, just the linear and hierarchical relations between them. Some text chunks segmented as separate propositions in English can be quite short, as in (160).
(160) [climate change] goals
The treatment of nominal predicates, like “change” in “climate change,” was discussed in section 1.7 of this annex.
The equivalent of a phrase like (160) in another language may be longer, and may literally mean something like (161).
(161) goals [relating to dealing with the effects of climate change]
If the parsing method used in this study were applied directly to the longer phrase in (161), that phrase could be segmented as several separate propositions, as in (162).
(162) goals [relating] [to dealing with the effects] [of climate change]
Again, the parsing method used in this study is applied to the original English version of each sentence. If it were applied directly to the other language versions, one consequence could be higher counts for changes in semantic relations in some languages, because some of the other languages considered can be wordier than English. Also, the Turkish, Mandarin and Japanese versions of sentences in our corpus already show much higher rates of nesting than the Russian and Hungarian versions. Applying our parsing method directly to the translated or interpreted versions of sentences could well lead to even higher nesting rates for languages like Turkish, Mandarin and Japanese, which often have structures with multiple nestings.
A short English phrase like (160) may correspond to longer phrases in other languages. The components of those phrases may be sequentially ordered as shown in (163) and (164).
(163) countries’ [relating] [to dealing with [climate change] effects] goals (Mandarin)
(164) countries’ [climate change] [effects-with-dealing-to] [related] goals (Turkish, Japanese)
Despite such differences, if the original English version of a sentence is segmented as one proposition – or two or three – and if the equivalents of those propositions in other language versions are judged to have the same information content and functional status as in English, those versions are divided into the same number of segments as the English version. This makes the various language versions of each sentence easier to compare, minimizing the impact of minor phrasing differences between languages on the values of variables in our data. This represents a deliberately conservative choice to refrain from recording some information, in order to avoid any suggestion that the counts for indicators of difficulty in structurally different language pairs – which are already comparatively high – have been inflated by the inclusion of irrelevant data.
← 3. Shifting focus