Version 0.68 (Jan. 1, 2004)
HASIDA Kôiti
This document is under frequent revision.
The latest version is available as
http://i-content.org/GDA/tagset.html.
Also take a look at the Linguistic DS,
which is a linguistic-content description scheme
based on the GDA tag set, to be published soon as part of MPEG-7 MDS
Amendment 1.
Any comments are welcome;
Contact .
Table of Contents |
This document is properly displayed by Netscape Navigator 6.0, Mozilla 1.0, Opera 6.0 or later set up to enable stylesheets. Tags, attribute names, and attribute values are in bold face in the text. Technical terms, tags, attribute names, and attribute values are in red bold font where they are introduced. * (asterisk) means that the term is used not widely but locally in this and related documents. Examples are in a green typewriter font. Annotation task is normally done at a certain fixed level of minuteness, but the levels of minuteness are indeterminate in the examples below. |
This document discusses the GDA (Global Document Annotation) tag set, providing rationale of tags and examples of their usage. This may serve as a tagging manual for human annotators with plenty knowledge of both theoretical and computational linguistics, but the real tagging manual for most annotators must be provided on the basis of this document.
The GDA tag set aims at making the propositional content (and the dialogue structure) of linguistic data automatically recognizable. It is being developed so as to be easy to embed into TEI, EAGLES, and HTML tag sets. So the meanings of the GDA tags should be maximally consistent with the three tag sets. Some tags are imported from them, but when such a tag is defined in two or more of them, the meaning in HTML is preferrred to that of TEI and EAGLES, because GDA tags are expected to be very often embedded in HTML files.
This document is released for the sake of public survey and evaluation. Empirical evaluation is necessary on both how useful the tags described below are for practical applications and how consistently people can annotate documents with those tags. We would like to improve the tag set by taking results of such evaluations into account, before announcing it for public, extensive use.
To optimize the benefit per cost of tagging, we try to design as simple a tag set as posibble which captures enough contents for practical applications. The propositional content of an utterance might be unlimitedly complex due to the complexity of the context. An appropriate degree of complexity of the tag set could be identified, however, because the present technology concerning natural language can effectively process only limited sorts of information. For instance, tagging for metonymy may not be very useful. The tag set should go along with the contemporary state of the art. We can refine the tag set when more detailed tags become useful as technology advances.
Here we do not restrict ourselves to any single NLP/AI application, but try to address as many aspects of language which seem useful in one of translation, retrieval, summarization, question answering, case-based reasoning, presentation, and so forth. Users interested in only some of these applications may want to use subsets of the tag set. In this connection, the GDA tags are almost entirely optional, as application technologies do not normally require exhaustive tagging. In fact, many relatively simple untagged sentences can be analyzed right by the current technology. So we have tried to design the GDA tag set in such a way that more minute annotation entails more information; in particular, if you do not annotate then you do not commit yourself to any specific interpretation.
The GDA tag set is not specific to any particular language, though the example passages below are mostly in English. The usage of the tags is subject to some customization for particular languages, but we want to use the same vocabulary for the sake of coordination across different languages. Of course different tagging manuals are necessary for different languages. However, we hope to design the tag set so that it is easy for you to write such a manual once you have understood the idea behind the tag set.
The tag set is not a linguistic theory. It encodes the propositional content and the dialogues structure of a document, remaining somewhat neutral among linguistic theories. Encoding the propositional content and dialogue structure on the one hand and capturing linguistic generalizations on the other are different issues. In particular, we will sacrifice syntactic generalization very often, because syntax is not our primary concern but used as a partial aid for encoding semantics and pragmatics. This could be justified because people probably have better intuition about semantics and pragmatics than about syntax. Of course linguistic theories are very helpful in designing the tag set, but what is important is that the tag set can represent the semantic and pragmatic structure of a wide range of documents, but not that it captures linguistic generalizations. Needless to say, we will attempt to capture as much linguistic generalization as possible as far as we do not sacrifice clarity and ease of encoding semantic and pragmatic structures.
In principle, null annotation entails no information in the GDA tag set. This is to allow partial annotation. In particular, nothing is meant by the absense of a tag or an attribute. For instance, lack of specification of the scope of a plural noun phrase (as an alleged quantifier) does not mean that the noun phrase has no scope. The GDA tag set sometimes allows you to entail some information by lack of annotation, but the tag set is designed in such a way that you should be aware that you are meaning something with null annotation in such cases.
The rest of this document consists of two parts.
The first is Section 2, which may be regarded as a user manual.
This section contains subsections each of which addresses
how to annotate a particular type of linguistic construction
such as dependency, anaphora, and scoping.
The second part is Section 3 and thereafter, which may be thought
of as a reference manual, discussing how to use each tag and attribute.
2. Annotation of Semantic Structures
The GDA tag set is designed so that the GDA-annotation reduces the ambiguity in mapping a document to a sort of entity-relation graph (or semantic network) representing the underlying semantic structure. The tag set does not directly encode such graphs, though it should be straightforward to encode them with RDF or related tag sets such as DAML.
A semantic network consists of nodes and links, which may be labelled with concept identifiers. For instance, sentence `Tom met a girl' may have the semantic structure shown in Figure 1.
Figure 1: A semantic structure of `Tom met a girl.'
Operator identifiers are concept identifiers representing the (syntactic, semantic, or other) functions of function words (auxiliary, preposition, postposition, article, etc.) and function morphemes (tense marker, number marker, etc.). In Figure 1, the gray rectangles are operator identifiers. Many operator identifiers, such as agt (agent), sg (singular), and past, are defined native in the GDA tag set, and thus are of the simplest form without the `ont:' prefix. An operator identifier is either a relation identifier or a unary operator identifier. A relation identifeir represents a binary relation and is often associated with a dependency. For instance, the dependency of the subject on the verb may carry an AGENT relation, which is indicated by relation identifier agt. A unary operator identifier represents the function of auxiliary, article, some inflectional morpheme, or the like. Both relation identifiers and unary operator identifiers may lack explicit linguistic markings. For instance, no explicit word or morpheme designates the agt relationship between `met' and `Tom' in `Tom met a girl.'
The disks in Figure 1 are entities such as objects and events. As mentioned above, the dashed arrows represent membership to or containment in concepts. For instance, node m1 is an instance of concept eng:meet, thus representing a meeting event. Similarly, node g2 represents a single girl. The labelled solid arrows are instances of (primitive) binary relations designated by the labelling relation identifiers. For instance, the arrow from node m1 to Tom means that Tom is the agent of meeting event m1.
The semantic structure in Figure 1 can be addressed as follows.
<su> <persnamep opr="agt">Tom </persnamep> <v sem="past.eng:meet">met <v> <np opr="obj"> <adp sem="sg">a </adp> <n sem="eng:girl">girl</n> </np>. </su>The <su> (sentential unit) tag encloses a sentence, which is a linguistic constituent with no syntactic dependency on other part of linguistic data. <su> and the tags which may appear in sentences are called intrasentential tags, and elements with those tags are called intrasentential elements. All the intrasentential elements except for operator elements (<fo>, <bfo>, <bo>, and <fbo> elements) and <segs> elements are syntactic constituents. <su>, <ij> (interjection), and the intrasentential tags (such as <np>) with suffix `p' (for `phrase') are called phrasal tags, and elements with phrasal tags are called phrasal elements. A phrasal element cannot be the head of a larger constituent. That is, no element can syntactically depend on a phrasal element, unless explicitly specified by the dep (dependency) attribute. Since <persnamep> is a phrasal tag, for instance, `met' cannot depend on `Tom' in the current example. Other intrasentential tags are called head tags, and elements with them are called head elements. A head element can be the head of a larger constituent. Dependencies must hold among children of the same parent element, except when specified by the dep attribute. So in the above example:
The semantic structure of each GDA element consists of two parts which are called the head semantics and the operator, partially encoded by the sem and opr attributes, respectively. The value of each of these attributes in the above example is a concept string, which is a sequence of one or more concept identifiers connected via dots (`.'). In general, the value of sem and opr attribute may be one or more concept strings separated by blanks. The only nonatomic concept string in the above example is past.eng:meet. The order among the concept identifiers in a concept string should be consistent with the directions of the solid arrows in the corresponding part of the semantic structure. For instance, the following annotation is possible, where past and ont1:buy#2 must precede obj and obj must precede sg and deu:Blume, but there is no restriction on the order between past and ont1:buy#2 or that between sg and deu:Blume.
<v sem="past.ont1:buy#2.obj.sg.deu:Blume">bought a flower </v>Compare this with the corresponding semantic structure below:
Figure 2: The directed path represented by
past.ont1:buy#2.obj.sg.deu:Blume.
In general, a concept string represents a directed path as in Figure 2, which is in general a linear sequence of disk nodes connected via solid arrows in the same direction, where the nodes are instances of lexical concepts and unary operators, and the arrows are instances of binary relations, and the order among the concept identifiers in the concept string is compatible with the order along the directed path.
In general, the head semantics and the operator of a GDA element are directed paths, as shown in Figure 3:
Figure 3: Two parts of semantic structure of a GDA element.
The initial node of the main seantics of GDA element X is equal to the final node of the operator of X and called the self node of X and written self(X). The initial node of the operator of X is called the governor node of X and written gov(X).
There are four types of syntactic constructions: dependency structure, coordinate structure, apposition, and repair. Dependency structures and appositive structures partially overlap. The type of the syntactic construction at the top of an element is specified by the syn attribute.
If GDA element X depends on another GDA element Y or X is the head of Y, then gov(X) = self(Y). In Figure 1, for instance, gov(`Tom') = self(`met') = m1, where the solid arrow labelled with agt is the operator of `Tom.' The operator of X may be null, in which case self(X) = gov(X) = self(Y). In fact, self(`people') = gov(`people') = self(`for') in `for people'.
The opr attribute encodes a relationship in which the current element stands with respect to the element that it depends on, as in:
<v>go <adp opr="fin">to Paris </adp></v>The opr attribute with a relation identifer as its value may be attached to a function word, as in:
<v>go <ad sem="fin">to </ad>Paris </v>
Relation identifers can be combined to make compound relation identifers. There are two types of combination. If a and b are relation identifers, then so are a.b and a b, too. The operator `.' has precedence over the blank operator. That is, a.b c is the combination of a.b and c through blank.
a.b represents the composition of a and b as binary relations. That is, x and z stand in relation a.b, if and only if there exists y such that x and y stand in relation a and y and z stand in relation b.
<su>Tom stayed there <adp opr="tmx.fin">until noon</adp>.</su>
<v>eat it <adp opr="obj.int">from the head</adp></v>.
sleep <np opr="tmx.msr">ten hours </np>
swim <np opr="spx.msr">two miles </np>Note that `until noon' in the first example has opr="tmx.fin" but not opr="tim.fin". Since tim is equivalent to tmx.sup, tim.fin is equivant to tmx.sup.fin, which is the final point of a superset of the duration of the event. The sentence entails that Tom was there at noon, but tim.fin fails to entail it.
Multiple concept strings in the value of a sem or opr attribute are conjunctive, which represent the intersection of the denotation of the concept strings. Here we generally regard the denotation of a concept string as a binary relation, even when the concept string is a unary operator. So the conjunction of concept strings denotes the intersection of the binary relations denoted by the concept strings. (A binary relation is a subset of the Cartesian product of two sets. So the conjunction of several binary relations is the intersection of those subsets.) The value of a sem or opr attribute is multiple concept strings when the meaning in question cannot be captured by a single category, such as below.
I came <adp opr="res pur">so that I met him</adp>.
In `Kim likes Mary better than Betty,' for instance, we must specify whether `Betty' is compared with `Kim' or `Mary.' Similarly, in `Kim blamed Mary together with Betty,' we want to mark whether Kim and Betty blamed Mary or Kim blamed Mary and Betty. To implement this in general, we use extended relation identifers of the form a-b, where a and b are relation identifers but not extended relation identifers. a is a relation identifer such as cmp and bsc, and b is a relation identifer to indicate which element is in parallel with the current element, as in what follows:
Kim likes Mary better <adp opr="cmp-sbj">than Betty</adp>.= Kim likes Mary better than Betty likes Mary.
Bill helped Tom <adp opr="ql-sbj">as his son</adp>.= Being Tom's son, Bill helped Tom.
Tom found Sue <adp opr="pst-sbj">before Kim</adp>.= Tom found Sue before Kim found Sue.
The relation identifers are used as attributes as well, which we will call relational attributes. While the opr attribute appears in the depending element (the satellite in the case of rhetorical relations), the relational attribute appears in the governing element (the nucleus) and points to the depending element. Namely, the value of the relational attribute is the referential index of the element, if any, which semantically or pragmatically depends on the element containing this relational attribute. Of course the attribute name indicates the type of the dependency.
<np id="t1">Tom </np><v agt="t1">came</v>.
promise <np id="y1">me </np> <vp sbj="y1">to be admitted </vp>
Coreference is encoded by the relational attribute eq.
<np id="j0">John </np>beats <adp eq="j0">his </adp> wife.
Most of the syntactic constructions are dependency structures. The dep attribute can specify dependencies across element boundaries. The other, ordinary dependencies hold within elements. When an intrasentential element lacks the dep attribute, its syntactic relationship with the surrounding context is specified by the syn attribute of its parent element. syn="f" entails forward dependencies, whereas syn="b" backward dependencies.
The complexity of tagging is reduced by syn="f" and syn="b". For instance,
<adp syn="b">in order to talk to one of them </adp>is equivalent to the following:
<adp> in <seg> order <seg> to <seg> talk <seg> to <seg> one <seg> of them </seg> </seg> </seg> </seg> </seg> </seg> </adp>Thus syn="b" (may be syn="f" in the following example) allows us to dispense with tags embedding each other. On the other hand, a structure in which many constituents depend on one constituent can be easily treated with phrasal tags for the dependants:
<su syn="b"> <np>I </np> went <np>there </np> <np>yesterday </np> <adp>by foot </adp> <adp>with you </adp> <adp>after lunch</adp>. </su>
Nonstandard dependencies including crossing dependencies and parentheticals are marked up by the dep attribute. Here is an example of crossing dependency:
<su> <np>I </np> saw <np id="m0">a man </np> <np>yesterday </np> <vp dep="m0">who I don't know </vp> </su>Although `a man' is a phrasal element, it is a head due to the dependency enforced by the dep attribute. Note that the following tagging is wrong, because it entails that `who I don't know' depends on `saw'.
<su> <np>I </np> saw <np mod="w0">a man </np> <np>yesterday </np> <vp id="w0">who I don't know </vp> </su>
Parentheticals with normal dependencies can be annotated normally:
<su syn="b"> <np> Admission, <adp>even of a regular customer</adp>, </np> is prohibited. </su>Parentheticals without dependencies can be annotated with dep="nil":
<su> <np>That dog</np>, <adp dep="nil">or <vp><np>it </np>may be <np>a cat</np></vp></adp>, is scary. </su><su> may be used to indicate that the parenthetical element does not depend on anything around it:
<su> <np>That dog</np>, <su>or <vp><np>it </np>may be <np>a cat</np></vp></su>, is scary. </su>Parentheticals with inward dependencies can be annotated with the dep attribute:
<su dep="S" syn="b" opr="cnt"> <np>You </np> should, <su id="S">I suppose</su>, do <np>it </np> by yourself. </su>
Local syntactic constructions other than dependency are coordination, apposition, repair, error, and idiosyncratic structures, specified by syn="c", syn="a", syn="r", and syn="e", respectively. These structures consist of peer terms and operators. In `A and B,' for instance, `A' and `B' are peer terms and `and' is an operator. <fo> (forward affixing operator), <bo> (backward affixing operator), <io> (infix operator) elements are operators when they are children of coordinate, apposition, repair, error, and idiosyncratic structures.
<np syn="c"> Tom<io> and</io>Mary </np>came
<np syn="c"> Tom<bo> instead of</bo>Mary </np>came
syn="c" specifies a coordination, which may have a scope. In general, scoping is encoded by the sce (scoping element) attribute, which points at the scoping element (the coordinate structure in this case). In the following, the whole sentence is the scope of `London and Paris.'
<su sce="LP">I lived in <np id="LP" syn="c">London and Paris</np>. </su>A collective coordination, which lacks a scope, is encoded with sce="self".
<np syn="c" sce="self">London and Paris </np> are different.
Appositive strutucre involving gapping is specified by syn="a" and annotated similarly to coordinate structure.
<su syn="a"> <vp> <np>I </np> gave <np id="it">it </np> <adp id="mary">to Mary</adp> </vp>, <bo>that is</bo>, <vp> <np sb="it">the present </np> <adp sb="mary">to my wife</adp> </vp>. </su>In apposition, two elements correspoinding via sb are regarded as coreferential. So `it' corefers with `the present' and `to Mary' corefers with `to my wife' here.
Again here is an explicit alternative:
<su syn="a"> <vp> <np>I </np> gave <np id="it">it </np> <adp id="mary">to Mary</adp> </vp>, <bo>that is</bo>, <vp> <np ed=":">I </np> <v ed=":">gave </v> <np sb="it">the present </np> <adp sb="mary">to my wife</adp> </vp>. </su>
Repair involving gapping is specified by syn="r".
<su syn="r"> <vp> I gave <np id="boy">the boy </np> <adp id="dog">to the dog</np> </vp>, <io>oh excuse me</io>, <vp> <np sb="boy">the dog </np> <adp sb="dog">to the boy </adp> </vp> </su>
<np id="X">a man <vp><np opr="obj" eq="X">who </np> <persnamep>Mary </persnamep>loves </vp></np>
<np id="X">a woman <vp><adp>for <np eq="X">who</np>'s sake </adp> <persnamep>Tom </persnamep>works </vp></np>
<np id="X">a car <vp><np eq="X" id="W" opr="uba">which </np> Tom believes that Mary <v obj="W">bought </v></vp></np>
A relative clause is a phrase governed by a noun semantically related with some parts of it. Just like a topicalized sentence, a relative clause with a WH complementizer is regarded as a constituent (typically a <vp> element) whose head is the main verb of the clause; the WH complementizer depends on the main verb.
If the relative clause lacks a WH complementizer, then the relation between the noun governing the relative clause and the pronoun (a gap or a resumptive pronoun) coreferring the governing noun can be encoded by a relational attribute, as below.
<np id="P"> <n>people </n> <vp syn="b"><np>I </np>want to <v obj="P">meet </v></vp> </np>
<np id="BAR" syn="f">the <n>bar </n> <vp obj="BAR"><np>I </np>love </vp></np>
<np id="X" syn="f">the <n>man </n><vp>I think <vp aen="X">is crazy </vp></vp></np>
<np id="P" syn="b"><adp>the </adp><n>person </n>to <vp obj="P">blame </vp></np>
The relation between the relative pronoun, if any, and the head noun governing the relative clause can be encoded by an eq attribute. The relation between the gap (or resumptive pronoun) in the relative clause and the WH complementizer (`which' and `for whom' in the examples below) can be encoded by relational attribute if necessary, too.
opr="uba" means that `whom' does not semantically depend on `want'.
<np> <n>people </n> <vp syn="b"> <np id="X" opr="uba" eq="mcn">whom </n> <np>I</np> want to <v obj="X">meet </v> </vp> </np>mcn cannot be used if `work' is an element.
<np syn="b"> the <n>painter </n> <vp> <np><ad eq="mcn">whose </ad>work </np> surprised <np>me </np> </vp> </np>plg cancels uba, so that `work' is associated with `whom' via ben.
<np> the <n>man </n> <vp> <adp id="X" opr="uba.ben">for <np eq="mcn">whom </np></adp> <np>I </np> can <v plg="X">work </v> </vp> </np>Parasitic gaps are not distinguished from normal gaps.
<np> the <n id="B">book </n> <vp syn="b"> <n id="X" opr="uba" eq="B">which </n> <np>I</np> have <v obj="X">lost </v> <adp>before <v obj="X">reading </v></adp> </vp> </np>Infinitival relative clause:
the <n>tool </n> <vp syn="b"> <adp opr="mns">by <np eq="mcn">which </np></adp> to <v>open </v> it </vp>If the noun governing the relative clause is the minimal noun containing the gap or the resumptive pronoun, then the value of the relation attribute may be mcn (minimal container noun), as follows.
<np> <n>people </n> <vp syn="b"><np>I</np> want to <v obj="mcn">meet </v></vp> </np>
<np syn="f">the <n>bar </n><vp obj="mcn"><np>I </np>love </vp></np>
<np syn="f">the <n>man </n><vp>I think <vp aen="mcn">is crazy </vp></vp></np>
<np syn="b"><adp>the </adp><n>person </n>to <vp obj="mcn">blame </vp></np>
Topicalization and other long-distance dependencies are interpreted by relational attributes. Topicalization is regarded as essentially the same structure as relative clause with WH complementizer. Relation identifier uba means that the element has no direct semantic relationship with what it syntactically depends on. So extaposed elements tend to have uba. plg cancels uba.
<su> <adp opr="uba.gol" id="X">To her</adp>, <v> I think <vp plg="X">he sent a bouquet</vp> </v>. </su>
<su> <np opr="uba" id="X">This </np> is <ajp>easy <vp>for me to <vp obj="X">reach </vp></vp></ajp> </su>
<su> <np opr="uba" id="WH">What </np> <v>do you like to <vp obj="WH">eat</vp></v>? </su>
Combination of a relative clause and another long-distance dependency:
the <n>piano </n> <vp> <np eq="mcn" id="WH" opr="uba">which </np> <np id="S" opr="uba">this sonata </np> is easy <vp>to make her <v obj="S">play </v><adp arg="WH">with </adp></vp> </vp>
Gapping in coordination, apposition, and repair can be encoded with the sb (substitution) attribute. When sb appears in a peer term, the parts of the peer terms which neither point nor are pointed by sb are copied to the corresponding loci in the other peer terms.
<su syn="c"> <vp syn="b"> <adp id="P">Perhaps </adp> <np opr="ctl.obj"><nump id="n">nearly 270 </nump>people </np> were <v id="k" sit="E">killed </v> </vp> and <vp syn="b" nc="P"> <nump sb="n">1,400 </nump> <vp syn="b" sb="k">reported injured in <np id="E">the earthquake </np></vp> </vp> </su>This means `perhaps nearly 270 people were killed in the earthquake, and 1,400 people were reported injured in the earthquake' where the underlined parts are copies due to the sb attribute. sb is an adaptation of = in Penn TreeBank. The sit="E" addresses a zero anaphora which means that the killing event took place in the earthquake.
Another, less sofisticated, solution to gapping is to interpret gaps more overtly:
<su syn="c"> <vp syn="b"> <adp>Perhaps </adp> <np><nump>nearly 270 </nump>people </np> were killed <adp ed=":">in the earthquake </adp> </vp> and <vp syn="b"> <np><nump>1,400 </nump><n ed=":">people </n></np> <v ed=":">were </v> reported injured in the earthquake </vp> </su>
Below is another example usage of sb.
<su syn="c"> <vp syn="b"> <np syn="b"> Passengers going to <placename id="S">Shinjuku </placename> </np>, please go to <np id="T" syn="b">Track No. 1</np> </vp>, and <vp syn="b"> <placename sb="S">Ueno</placename>, <np syn="b" sb="T">Track No. 3</np> </vp>. </su>Here is a corresponding overt solution using ed:
<su syn="c"> <vp syn="b"> <np syn="b"> Passengers going to <placename>Shinjuku</placename> </np>, please go to <np syn="b">Track No. 1</np> </vp>, and <vp syn="b"> <np syn="b"> <bo syn="b" ed=":">passengers going to </bo> <placename>Ueno</placename>, </np> <bo syn="b" ed=":">please go to </bo> <np syn="b">Track No. 3</np> </vp>. </su>
Scopes of quantifiier, negation, modal operator, conditional operator, coordination, plural, and so forth are encoded by the sce (scoping element) attribute.
The sce value of an element A is the id value of another element B such that A is a scope of B. Here A must command B; an element commands another element when the former contains the latter or contains an element which either points at the latter element via a relational attribute or pointed by the latter element via the dep attribute. For instance, the following annotation entails the interpretation that each of three collectors bought one same paining, so that this painting has been bought three times as far as the sentence entails.
<su sce="c3"> <np id="c3">Three collectors </np> have bought <np sce="top">a painting</np>. </su>
An optionally scope-introducing element, such as `three collectors' and `Tom and Mary,' actually introduces a scope only when pointed via the sce attribute. In the above exmaple, if cse="c3" were absent, the interpretation is that the three men cooperatively bought a car, so that the car was bought once.
The scopes of elements such as `every man' and `Tom or Mary,' which always introduce scopes, are assumed to be the minimal dominating <vp> or <np> elements. For instance,
<su><np syn="c">Tom or Mary </np>came. </su>means that Tom came or Mary came, where the scope of `Tom or Mary' is the entire sentence.
For instance, the following means that each of three collectors bought a painting, entailing that three probably distinct paintings were bought.
<su sce="c3"><np id="c3">Three collectors </np> bought <np sce="c3">a painting</np>.</su>Here the referent of `a painting' is in the scope introduced by `three collectors.' Since there are three instantiations of this scope, corresponding to the three collectors, there are three possibly distinct paintings each of which was bought in one of those instantiations.
For another example, the de dicto reading of `Jane wants to marry a doctor,' which entails no specific doctor, is marked up as follows:
Jane <v id="w1">wants </v>to marry <np sce="w1">a doctor</np>.Here the doctor is situated in the scope introduced by the modal operator `wants.' Being the head of the complement of `want,' `marry' is forced to be situated in the scope of `wants.' So the sce attribute need not be specified for `marry.' As for the other elements, absense of the sce attribute entails no specific default reading. To entail the de re reading involving a specific doctor, `a doctor' must have an sce attribute pointing to an ancester <q> or <quote> or <dv> element or the whole document (represented by top).
Similarly, the reading of `every man loves a woman' in which `a woman' is in the outermost situation (that is, one woman is loved by all the men) is encoded by the following annotation:
Every man loves <np sce="top">a woman </np>The other reading, in which the referent of `a woman' is in the scope of `every man' (different men may love different women), is encoded by:
<np id="e0">Every man </np>loves <np sce="e0">a woman </np>
under construction
3. General Attributes
The following attributes are globally applicable to all the GDA tags exept <anchor/> and <alt/>. These attributes are all optional. lang, next and prev are straightforward imports from TEI.
<ajp dtp="mt">Long </ajp>is a short word.
<su dtp="so">Bang! </su>
That's great <ij dtp="vi">:-)</ij>.
from <bo dtp="op">(inside of) </bo>the car
That's a <n dtp="em">*common sense*</n>.
<q id="q1" who="j1" next="q2">`<su id="s1" next="s2">If it rains,</su>'</q> <np id="j1">John </np> said, <q id="q2" prev="q1">`<su id="s2" prev="s1"> I won't come.</su>'</q>
The syn attribute below can appear in every tag except <anchor/> and <alt/>.
<np><aj>American </aj><n>stock </n><n>holder </n></np>for instance, the three child elements may have arbitrary dependency relations under the above conditions. Two readings are possible which respect the English syntax: stock holder who is American and holder of American stock.
The first in the examples below means that I have boiled several eggs, wheres the latter means that I have several boiled eggs.
<su><np>I </np>have <vp><v>boiled </v><np>eggs</np></vp>.</su>
<su><np>I </np>have <np><vp>boiled </vp><n>eggs</n></np>.<su>
The following Russian example shows how to use the latter half of condition 2 above. Note that an empty head copular element must be assumed to be there because the existing children are phrasal elements and a symbol.
<su lang="rus" syn="f"><np>Eta </np> <np>dom</np>.</su> <su lang="rus" syn="f"><np>this <np>house `This is a house.'
In an element whose syn value is one of the following (c, a, r, e, and i), each child element must be either a peer term or an operator element (an <fo>, <bfo>, <bo>, <fbo>, or <io> element). So the following annotation is wrong, because `came here' is a peer term but `came' and `here' are not.
Kim <v syn="c"><v opr="pre">came </v> <adp>here </adp>and left</v>.Below are right annotations.
Kim <v syn="c"><v opr="pre">came here </v>and left</v>.
Kim ate <np syn="c">bread, <bo>though </bo>not egg</np>.
<np id="sk" syn="c"><bo>Instead of </bo>Sue, Kim </np>came.
A coordination may be collective or distributive. For instance, the below example means either that Tom and Mary got married with each other (collective reading), or that Tom got married with somebody and Mary got married with somebody else (distributive reading).
<su><np syn="c">Tom and Mary </np>got married. </su>The following annotation means that 'two hours and a half' is two hours plus half an hour.
<np syn="c"><n>two hours </n>and <n>a half </n></np>
When the peer terms which are children of an element with syn="c" do not have relation identifiers, the entire element refers to the sum (agglegation) of the referents of the peer terms.
When the peer terms which are children of an element with syn="c" have relation identifiers, the chid elements semantically depend on the governer of the entire element.
<n>route </n> <adp syn="c"> <adp opr="int">from London </adp> <adp opr="fin">to Paris </adp> </adp>Compare:
<v> go <adp opr="src">from London </ad> <adp opr="gol">to Paris </adp> </v>
Dates and times are not marked with syn="c".
<timep><time>two </time> <time>thirty </time></timep>
<su syn="a"> <vp>I introduced <np id="M">Mary </np><adp id="S">to Sue</adp></vp>, that is, <vp><np sb="M">my girlfrend </np><adp sb="S">to my wife</adp></vp>. </su>
<su> <vp> I gave <persname id="M">Mary </persname> <adp id="D">to the dog</adp> </vp>, <io>oh I'm sorry</io>, <vp> <np sb="M">the dog </np> <adp sb="D">to Mary </adp> </vp>. </su>
The following attributes are applicable to all the intrasentential tags (tags for intrasentential contents).
`Tom <v id="cm1">came</v>.' `Who <v cp="cm1"/>?'
<su id="TLHW"> <np id="T">Tom </np> loves <np id="W"><adp eq="T">his </adp>wife</np>. </su> <su cp="TLHW" strict="W">So does <np sb="T">Bill</np>.</su>
<su><np syn="c" sce="self">Tom and Mary </np>got married. </su>
I will <vp ed=":">do that</vp>.
<n orth="enough">enuff </n>
<orgname expan="Electrotechnical Laboratory" abbr="acronym">ETL </orgname>
<n pron="meetee">MITI </n>
Hereafter the addressor means not only the agent of a speech, but also the author of a written passage, the thinker of a thought, the performer of a sign language or a gesture, and so on, where the speech, the passage, the thought, etc. appear as tagged elements in the document. Similarly, the addressee means the recipient of them intended by the addressor.
A referential index is a name. A referential index usually refers to the element which has it as the id value. There are special referential indices which are not the id value of any element. We will call them deictic indices. Different occurrences of the same deictic index may refer to different things in one GDA file. For instance, fwd refers to the element or text subsequent to the element containing it, so that two occurrences of fwd in two distinct elements must refer to diferent things. The deictic indices are p0 (generic people, such as `your' in `Love your neighbor.'), p1 (first person (addressor) singular, or `I/my/me'), p1p (first person plural, or `we/our/us'), p1i (first person plural including second person), p1x (first person plural excluding second person), p2 (second person (addressee) singular), p2p (second person plural), nil (nothing), top (entire discourse), self (the element itself), fwd (forward), bwd (backward), and mcn (minimal container noun).
Tags defined in the GDA tag set follow.
Press the <q>`YES' </q>button.
The following tags represent structures in sentences, and called intrasentential tags. Elements with those tags are called intrasentential elements. Among them, phrasal tags are <su>, <ij> and tags whose name end with `p' (except <p> above, which represent paragraph) such as <np> and <vp>. Elements with phrasal tags are called phrasal elements. Phrasal elements represent maximal projections, which cannot be heads of larger constituents. No elements can syntactically depend on them, except when stipulated by the dep attribute. Head tags are the other intrasentential tags, such as <n> and <v>. Elements with head tags are called head elements. They can be heads in dependency structures without being specified by the dep attribute.
I <seg>saw a girl </seg>with a telescope.
<su>I <v>want <vp>to <vp>sleep </vp></vp></v>.</su>
<np><ad>the </ad><n>man </n></np>
<name>Mr. <persname>Brown </persname></name>
<num type="int" value="21">twenty one </num> <num type="percentage" value="10">10% </num>
<num type="ordinal" value="2">second </num>
<num type="fraction" value="1/3">one third </num>
<np>Incompleteness Theorem <bibref>(Goedel, 1931) </bibref></np>
<bo>, <bfo>, <fo>, <fbo>, or <io> are called operator tags, and elements they enclose are operator elements. Operator elements are operators of coordination, apposition, or repair, when their parent elements'syn values are c, a, r, or e.
work <bo sem="pur">in order to </bo> live
A <bfo> element is an operator of coordination, apposition, repair, or error, when its parent element's syn value is c, a, r, or e.
<v syn="c">weaken, <bo>rather than </bo>strengthen</v>, <np>the control </np>
<su> <np syn="c"> <bfo>Not only </bfo> <persname>Tom </persname> <bfo>but also </bfo> <persname>Mary </persname> </np> came. </su>
<su> <np syn="c"> <persname>Tom </persname> <io>and </io> <persname>Mary </persname> </np> got married. </su>
<p> and the tags thereafter are called intradivisional tags. <su> and the tags thereafter are called intrasentential tags. The following table shows elements of which tags (in the left) can contain which tags (in the right) as children (not descendants in general).
|
The following tags are used to encode ambiguities. In GDA, these tags are usually not manually handled, but instead automatically processed by computers. The elements of these tags are all empty, and can appear anywhere in the document. These tags except <anchor/> are called link tags. Link elements (elements with link tags) are children of other elements only when they are referred to via the dtrs attribute.
<anchor id="n0"/>The idea <anchor id="n1"/> occurred to me <anchor id="s0"/>that I should go <anchor id="s1"/>.
<su dtrs="va0">I <anchor id="a0"/>saw <anchor id="a1"/> the girl <anchor id="a2"/>with a telescope <anchor id="a3"/>.</su> <alt id="va0" content="a0 a3" targets="vp1 vp2"/> <v id="vp1" dtrs="np1"/> <v id="vp2" dtrs="vp3 pp1"/> <n id="np1" content="a1 a3" dtrs="pp1"/> <v id="vp3" content="a0 a2"/> <ad id="pp1" content="a2 a3"/>
Relation identifiers represent primitive binary relations between dependent elements and governing (depended) elements, and include grammatical functions, thematic roles, and rhetorical relations. The distinction among these three types of relations is often vague. For instance, LOCATION counts as both a grammatical function and a thematic role. Although CAUSE is usually regarded as a rhetorical relation, it can also serve as a thematic role of phrases such as `due to lack of money.' This is why we conflate grammatical functions, thematic roles, and rhetorical relations. Among the values introduced below, cau, cnc, cnd, and so on, serve as both rhetorical relations and thematic roles.
A purpose of relation identifiers is to associate complement elements (subjects, objects, indirect objects, and so forth) with the corresponding arguments of verbs, adjectives, etc. To fulfill this, we employ a rather standard approach: the association is specified by marking elements with grammatical functions such as SUBJECT and OBJECT (sbj and obj below, respectively), provided that we have a dictionary containing the argument structures of verbs and so on. In many languages, there is usually no need to explicitly markup complements such as subjects objects, and indirect objects, because their grammatical functions are obvious from the surface forms and hence their thematic roles can be inferred from the dictionary. When the verb has multiple argument structures, such as with `Tom opens the door' (where `Tom' is the agent) and `The key opens the door' (where `the key' is the instrument), we can either markup the subject noun phrases with the thematic role or markup the verb in terms of the argument structure. Also, by using grammatical functions we do not have to worry about whether the subject of buy should be AGENT or RECIPIENT, for example.
The rest of the purpose of relation identifiers is to resolve ambiguities of both the thematic roles of adjunct elements, which are typically prepositions and postpositions, and the rhetorical relations which are not explicitly marked. To attain this, we must simply markup the elements in question with thematic roles and rhetorical relations. However, the exhaustive listing of thematic roles and rhetorical relations appears impossible, as widely recognized. We are not yet sure about how many thematic roles and rhetorical relations are sufficient for engineering applications such as machine translation, but as mentioned before, the appropriate granularity of classification will be determined by the current level of technology.
The relation identifers are enumerated below in several clusters.
a friend <ad sem="arg">of </ad> mine
which <ad sem="arg">of </ad> them
He promised <np opr="ctl.exp">me </np><vp opr="cnt">to be permitted </vp>
<np opr="xpl">It </np> is easy to do.
<np opr="xpl">There </np> lived a man in this house.
<su> <adp id="K" opr="uba.ben">For Kim</adp>, <v syn="b"><np>I</np>have <v plg="K">worked </v>hard</v>. </su>
<vp syn="ba"><v ppa="SLOWLY">walk </v> talking <adp id="SLOWLY">slowly </adp></vp>
<su><np opr="agt">Tom </np>came. </su>
Mary beats <np opr="obj">her husband</np>.
Tom is gone, <adp opr="res">so that I'm alone</adp>.
Sue built <np opr="res">a house</np>.
Kim turned the car <adp opr="res">to garbage</adp>.
get it <adp opr="src">from him </adp>
give <np opr="gol">him </np> the book
Tom had a date <ad sem="agt.rpt">with </ad> Mary.
mix sugar <ad sem="obj.rpt">with </ad> salt
<np opr="agt.rcp">Tom and Mary </np> got married.
mix <np opr="obj.rcp">sugar and salt </ad>
similarity <ad sem="obj.rcp">of </ad> the two
I gave <np opr="gol">him </np> a book <adp sem="ben">for <np>her </np>sake </adp>
a present <adp opr="gol ben">for you</adp>
It seems <adp opr="exp">to me </adp> that he left.
You came <adp opr="agt.jnt">with her</adp>.
a daughter <adp opr="pos">of mine </adp>
<ss>Tom is gone. <su opr="ela">He escaped. </su></ss>
expensive cars <adp opr="eg">such as Mercedez </adp>
<np>plan <vp opr="cnt">to visit Tokyo </vp></np>
ask <np opr="exp">her </np> <adp opr="cnt">for a date </adp>
the <n>fact <adp opr="cnt">that you're here </adp></n>
persuade him <vp opr="cnt">to go </vp>
<ss><su opr="cau">Tom came. </su> <su>Mary was surprised. </su></ss>
I went home <adp opr="cau">because I was sleepy </adp>
He died <adp opr="cau">of cancer</adp>.
I went there <vp opr="pur">to see her</vp>.
I'll come <adp opr="cnd">if you're there </adp>
<ss><su opr="cnc">Tom came. </su><su>Mary wasn't surprised. </su></ss>
<ss>Tom came. <su opr="cntrst">However, Bill left. </su></ss>
I was asleep <adp opr="tmx">during his talk</adp>.
I was born <ad opr="tim">in </ad> 1958.
She came <ad opr="pre">after <ad> he arrived.
eating <ad sem="coc">while </ad> driving
live <ad opr="loc">in </ad> Tokyo
hang <ad sem="ilc">on </ad> a bar
cut <ad sem="ilc">at </ad> the center
exit <ad sem="via">from </ad> the window
go <ad sem="via">through </ad> the tunnel
climb <np opr="via">a mountain </np>
pass <ad sem="via">by </ad> a bridge
I walked <adp opr="dir">to the north</adp>.
keep yourself <ad sem="opp">from </ad> the evil
independent <ad sem="opp">of </ad> her parents
stay here <adp opr="tmx.int">from tomorrow </adp>
keep drinking <adp opr="tmx.fin">until next morning </adp>
play a role <adp opr="sit">in a ceremony </adp>
Dog is <np opr="in">animal</np>.
Guernica is <np opr="in">his work</np>.
This is <np opr="in">beer</np>.
a <ajp opr="in">female </ajp> doctor
New York <ad opr="app">or </ad> Big Apple
go out <adp opr="ccm">with a book in hand </adp>
<np opr="agt.met">The most <aj obj.eq="mcn">expensive </aj>ham sandwitch </np> is gone.When more concrete relation identifiers are applicable, they should be used instead of met. For instance, part-whole relations should be encoded by sub and sup, and possetion by pos.
resemble <np opr="bsc">his father </np>
different <ad opr="bsc">from </ad> the promise
say the same thing as <np opr="bsc-agt">I </np>
as tall <adp opr="cmp">as Bill </adp>
taller <adp opr="cmp">than Bill </adp>
dance <ad opr="sim">like </ad> a butterfly
judge <adp opr="bas">accroding to the law </adp>
These shoes are too small <adp opr="cev">for me </adp>
<su> <np opr="who">(Tom) </np> <q><su><np eq="p1">I </np>am hungry. </su></q> </su>
<q who="TOM"><su><np eq="p1">I </np>'m hungry.</su></q>
<bo sem="who">According to </bo> the police, the criminal escaped.
When who is used as a relational attribute and it points to an upper context (e.g., who="top"), the utterance is a note by the addressor of that context (in case of who="top", the the author of the entire document).
<q>`<su><adp who="top">(Tom is)</adp>crazy. </su>'</q>
<q>`<su>Th <su who="top">(laughter)</su>at's funny. </su>'</q>
<np id="tom">Tom </np>said. <q who="tom" whm="p1">`Come.'</q>
survive <ad sem="mns">by </ad> eating grasses
paint <ad sem="mns">with </ad> a brush
made <ad sem="mat">of</np> wood
weigh <np opr="msr"><nump opr="msr">two </nump>kilograms </np>
<np>two cups <ad sem="mob">of </ad>tea </np>
I scolded her <ad sem="ql">as </ad>her father.
talk <ad sem="sbm">about </ad>it.
work three days <np opr="uni">a week </np>
Not only noun phrases but also verb phrases, sentences, and so on refer to objects, events, states of affairs, and so on. Here we introduce attributes to classify such references.
<np opr="gn">Dinosaurs </np>are extinct.
dance like <np opr="gn">a butterfly </np>
Give me <np opr="sg">your fish</np>.
Give me <np opr="pt">water</np>.
<np opr="plgn">These cars </np>are expensive.(Several models of cars are entailed here.)
In a generic reading, the predication concerns (default properties of) the whole kind referred to by the noun phrase in question. An accidental universal quantification, such as with `I know (all) the Emperors of Japan,' does not qualify as a generic reading. We do not distinguish the two types of generic reading: those such as with `Chickens evolved from dinosaurs' and those such as with `Chickens lay eggs.' This distinction is captured by classifying the predicates.
individual vs. stage reading?
Most langauges have grammaticized marking of tense, but for instance Chinese lack tense marking so that tense tagging will be of a great benefit in Chinese. Perhaps no language lacks grammaticized aspect marking, but aspect tagging could be useful in some cases.
Brutus <v opr="past">murders </v> Caesar.
He <v opr="pres">could do it</v>.
Aspects are interpreted with the following unary operator identifiers:
under construction
7. Others
stl politeness
under construction