Báo cáo khoa học: "Syntagmatic and Paradigmatic Representations of Term Variation"

pdf
Số trang Báo cáo khoa học: "Syntagmatic and Paradigmatic Representations of Term Variation" 8 Cỡ tệp Báo cáo khoa học: "Syntagmatic and Paradigmatic Representations of Term Variation" 704 KB Lượt tải Báo cáo khoa học: "Syntagmatic and Paradigmatic Representations of Term Variation" 0 Lượt đọc Báo cáo khoa học: "Syntagmatic and Paradigmatic Representations of Term Variation" 1
Đánh giá Báo cáo khoa học: "Syntagmatic and Paradigmatic Representations of Term Variation"
5 ( 12 lượt)
Nhấn vào bên dưới để tải tài liệu
Để tải xuống xem đầy đủ hãy nhấn vào bên trên
Chủ đề liên quan

Nội dung

Syntagmatic and Paradigmatic Representations of Term Variation Christian Jacquemin LIMSI-CNRS B P 133 91403 O R S A Y C e d e x FRANCE j acquemin@limsi, fr Abstract A two-tier model for the description of morphological, syntactic and semantic variations of multi-word terms is presented. It is applied to term normalization of French and English corpora in the medical and agricultural domains. Five different sources of morphological and semantic knowledge are exploited (MULTEXT, CELEX, AGROVOC, WordNetl.6, and Microsoft Word97 thesaurus). 1 Introduction In the classical approach to text retrieval, terms are assigned to queries and documents. The terms are generated by a process called automatic indexing. Then, given a query, the similarity between the query and the documents is computed and a ranked list of documents is produced as output of the system for information access (Salton and McGill, 1983). The similarity between queries and documents depends on the terms they have in common. The same concept can be formulated in many different ways, known as variants, which should be conflated in order to avoid missing relevant documents. For this purpose, this paper proposes a novel model of term variation that integrates linguistic knowledge and performs accurate term normalization. It relies on previous or ongoing linguistic studies on this topic (Sparck Jones and Tait, 1984; Jacquemin et al., 1997; Hamon et al., 1998). Terms are described in a two-tier framework composed of a paradigmatic level and a syntagmatic level that account for the three linguistic dimensions of term variability (morphology, syntax, and semantics). Term variants are extracted from tagged corpora through F A S T R 1, a unification-based transformational parser described in (Jacquemin et al., 1997). Four experiments are performed on the French and the English languages and a measure of precision is provided for each of them. Two experiments are made on a French corpus [AGRIC] composed of 1.2 x 106 words of scientific abstracts in IF A S T R can be downloaded www. limsi, f r/Individu/j acquemi/FASTR. from 341 the agricultural domain and two on an English corpus [MEDIC] composed of 1.3 x 106 words of scientific abstracts in the medical domain. The two experiments in the French language are [AGRIC] + Word97 and [AGRIC] + AGROVOC. In the former, synonymy links are extracted from the Microsoft Word97 thesaurus; in the latter, semantic classes are extracted from the AGROVOC thesaurus, a thesaurus specialized in the agricultural domain (AGROVOC, 1995). In both experiments, morphological data are produced by a stemming algorithm applied to the MULTEXT lexical database (MULTEXT, 1998). The two experiments on the English language are [MEDIC] + WordNet 1.6 or [MEDIC] + Word97; they correspond to two different sources of semantic knowledge. In both cases, the morphological data are extracted from CELEX (CELEX, 1998). 2 Term Variation: Representation and Exploitation Terms and variations are represented into two parallel frameworks illustrated by Figure 1. While terms are described by a unique pair composed of a structure--at the syntagmatic level--and a set of lexical items--at the paradigmatic level--, a variation is represented by a pair of such pairs: one of them is the source term (or normalized term) and the other one is the target term (or variant). The syntagmatic description of a term is a context free rule; it is complemented with lexical information embedded in a feature structure denoted by constraints between paths and values. For instance, the term speed measurement is represented by: { Syntagm:{i°-+N2N1} } (N1 lemma) = measurement Paradigm: {N2 lemma> = speed (1) This term is a noun phrase composed of a head noun N1 and a modifier N2; the lemmas are given by the constraints at the paradigmatic level. This framework is similar to the unification-based representation of context-free grammars of (Shieber, 1992). Variation Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Normalized term Variant Syntagmatic ,ev., ............... :-= . . . . . . . . . . . . -~. . . . . ILl\ Paradigmatic L2 [ transformation ~ : ---~-~- ._ ~-'~---- L2I l / I L l / / speed~m~ment ' , ' ~ J level - j [ ~ I .... ~ - : - _ _, andsemanfic I Ll' L2'I links lnstantiation of the [ource . . . . . . . . . . . . . . . . . . . . . . . . . . . . I_ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1: Two level description of terms and variations involved in the definition of term variations: morphological and semantic relations. The morphological family of a lemma l is denoted by the set FM(l) and its semantic family by the set FSL (l) or Fsc (l). At the syntagmatic level, variations are represented by a source and a target structure. At the paradigmatic level, the lexical elements of variations are not instantiated in order to ensure higher generality. Instead, links between lexical elements are provided. They denote morphological a n d / o r semantic relations between lexical items in the source and target structures of the variation. For example, the variation that associates a Noun-Noun term such as the preceding term speedN= measurementN1 with a verbal f o r m o f the head word and a synonym of the argument such as measuringvl maximaIh shorteningN velocityN,= is given by: Semantic family /~-~velocity Syntagm: { (N° -+ N2 N1) =0" } (V0 --~ V1 (Prep ? Det ? (AINIPart)*) N~) Paradigm: { root)=(Vlroot) {N12sem)=(Ni2sem ) (2) } Morphological family If this variation is instantiated with the term given in (1), it recognizes the lexico-syntactic structure Vl (Prep ? Det ? (AINIPart)*) N~ Figure 2: Paradigmatic links between lemmas (3) in which V1 and measurement are morphologically related, and N~ and speed are semantically related. The target structure is under-specified in order to describe several possible instantiations with a single expression and is therefore called a candidate variation. In this example, a regular expression is used to under-specify the structure2; another solution would be to use quasi-trees with extended dependencies (Vijay-Shanker, 1992). 3 Semantic family Paradigmatic relations As illustrated by Figure 2 and Formula (2), there are two types of paradigmatic relations between lemmas 2A stands for adjective, N for noun, Prep for preposition, V for verb, Det for determiner, Part for participle, and Adv for adverb. 342 Roughly speaking, two words are m o r p h o l o g i cally r e l a t e d if and only if they share the same root. In the preceding example, to measure and measurement are in the same morphological family because their common root is to measure. L e t / : be the set of lemmas, morphological roots define a binary relation M from £ t o / : that associates each lemma with its root(s): M E £ ~ £. M is not a function because compound lemmas have more than one root. The morphological family FM(l) of a lemma 1 is the set of lemmas (including l) which share a common root with l: Vle f~, FM (l) = {l' E /Z * 3r E /:, (/, r) E A(/',r) E M} = M - I ( M ( { I } ) ) M (4) (liD(/:) is the power-set of £:, the set of its subsets.) There are principally two types of s e m a n t i c relations: direct links through a binary relation SL E /2 ~ £: or classes C E ~(l?(/:)). In the case of semantic links, the semantic family Fs~ (l) of a lemma 1 is the set of lemmas (including l) which are linked to l: FSL • IP(E) Vl E ~, FSL (l) = {l' • f~ * (l, Y) • SL} tJ {l} = u {l} (5) In the case of semantic classes, the semantic family Fsc (l) of a lemma l is the union of all the classes to which it belongs: VleL, Fsc(l)= U c (c~c)^(tec) U(l} (6) Links and classes are equivalent, the choice of either model depends on the type of available semantic data. In the experiments reported here, direct links are used to represent data extracted from the word processor Microsoft Word97 because they are provided as lists of synonyms associated with each lemma. Conversely, the synsets extracted from WordNet 1.6 (Fellbaum, 1998) are classes of disambiguated lemmas and, therefore, correspond to the second technique. With respect to the definitions of semantic and morphological families given in this section, the candidate variant (3) is such that V1 • FM(measurement) and N~ • FSL(speed) or N~ • Fsc (speed). 4 Morphological and Semantic Families In the experiments on the English corpora, the CELEX database is used to calculate morphological families. As for semantic families, either WordNet 1.6 or the thesaurus of Microsoft Word97 are used. M o r p h o l o g i c a l Links from C E L E X In the CELEX morphological database (CELEX, 1998), each lemma is associated with a morphological structure that contains one or more root lemmas. These roots are used to calculate morphological families according to Formula (4). For example, the morphological family FM(measurementN) of the lemmas with measurev as root word is { commensurable A , commensurably Adv , countermeasureN, immeasurableA, immeasurablyAdv, incommensurableA, measurableA, measurablyAdv, measureN , measureless A , measurementN , mensurable A , tape-measureN, yard-measureN , measurev }. 343 S e m a n t i c Classes from W o r d N e t Two sources of semantic knowledge are used for the English language: the WordNet 1.6 thesaurus and the thesaurus of the word processor Microsoft Word97. In the WordNet thesaurus, disambiguated words are grouped into sets of synonyms--called synsets--that can be used for a class-based approach to semantic relations. For example, each of the five disambiguated meanings of the polysemous noun speed belongs to a different synset. In our approach, words are not disambiguated and, therefore, the semantic family of speed is calculated as the union of the synsets in which one of its senses is included. Through Formula (6), the semantic family of speed based on WordNet is: Fsc (speedN) = {speedN, speedingN, hurryingN, hasteningN, swiftnessN, fastnessN, velocityN, amphetamineN }. S e m a n t i c Links from Microsoft W o r d 9 7 For assisting document edition, the word processor Microsoft Word97 has a command that returns the synonyms of a selected word. We have used this facility to build lists of synonyms. For example, FSn ( speed N ) = { speedN , swi]tnesss, velocityN , quicknessN , rapidityN , accelerationN , alacrityN , celerityN} (Formula (5)). Eight other synonyms of the word speed are provided by Word97, but they are not included in this semantic family because they are not categorized as nouns in CELEX. 5 Variations The linguistic transformations for the English language presented in this section are somehow simplified for the sake of conciseness. First, we focus on binary terms that represent 91.3% of the occurrences of multi-word terms in the English corpus [MEDIC]. Then, simplifications in the combinations of types of variations are motivated by corpus explorations in order to focus on the most productive families of variations. T h e 3 D i m e n s i o n s o f Linguistic Variations There are as many types of m o r p h o l o g i c a l relations as pairs of syntactic categories of content words. Since the syntactic categories of content words are noun (N), verb (V), adjective (A), and adverb (Adv), there are potentially sixteen different pairs of morphological links. (Associations of identical categories must be taken into consideration. For example, Noun-Noun associations correspond to morphological links between substantive nouns such as agent/process: promoter~promotion.) Morphological relations are further divided into simple relations if they associate two words in the same position and crossed relations if they associate a head word and an argument. Combining categories and positions, there are, in all, 64 different types of morphological relations. In (Hamon et al., 1998), three types of semantic relations are studied: a link between the two head words, a link between the two arguments, or two parallel links between heads and arguments. These authors report that double links are rare and that their quality is low. They only represent 5% of the semantic variations on a French corpus and they are extracted with a precision of 9% only. We will therefore focus on single semantic links. Since we are only concerned with synonyms, only two types of semantic links are studied: synonymous heads or synonymous arguments. The last dimension of term variability is the structural transformation at the syntagmatic level. The source structure of the variation must match a term structure. There are basically two structures of binary terms: X1 N2 compounds in which X1 is a noun, an adjective or a participle, and N1 Prep N~ terms. According to (Jacquemin et al., 1997), there are three types of syntactic variations in French: coordinations (Coot), insertions of modifiers (Modif), and compounding/decompounding (Comp). Each of these syntactic variations is further subdivided into finer categories. the morphological link: a source and a target syntactic category and the syntactic positions of the source and target lemmas. The S e r e column indicates whether the variation involves a semantic link and the position of the lemmas concerned by the link (both lemmas must have an identical position). The Pattern column gives the target syntactic structure as a function of the source structure which is either X1N2, A1N2, or N1N2. For example, Variation # 4 2 transforms an Adjective-Noun term A1 N2 into N1 ((CC Det?) ? Prep Det ? (AIN[Part) °-a) N~ N1 is a noun in the morphological family of A1 (noted FM(A1)N) and N~ is semantically related with N2 (noted Fs(N2)). This variation recognizes malignancy in orbital turnouts as a variant of malignant tumor because malignancy and malignant are morphologically related, turnout and tumor are semantically related, and malignancyN inprep orbitaIA tumoursN matches the target pattern. Variation # 5 6 is a more elaborated version of variation (2) given in Section 2. Multi-dimensional Linguistic Variations Sample Syntactico-semantic Variants from [MEDIC] The first 36 variations in Table 1 do not contain any morphological link. They are built as follows. Firstly, the different structures of noun phrases are used as target structures. Twelve structures are proposed: head coordination (#1), argument coordination (#4), enumeration with conjunction (#7), enumeration without conjunction (#10), etc. Then e a c h transformation is enriched with additional semantic links between the head words or between the argument words. Semantic links between argument words are found in variations # ( 3 n + 2)o
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.