Terminological variation, a means of identifying research topics from textsFidelia I B E K W E - S A N J U A N C R I S T A L - G R E S E C , Stendhal University, Grenoble France and Dept. o f Information & C o m m u n i c a t i o n I U T du Havre - B.P.

pdf
Số trang Terminological variation, a means of identifying research topics from textsFidelia I B E K W E - S A N J U A N C R I S T A L - G R E S E C , Stendhal University, Grenoble France and Dept. o f Information & C o m m u n i c a t i o n I U T du Havre - B.P. 7 Cỡ tệp Terminological variation, a means of identifying research topics from textsFidelia I B E K W E - S A N J U A N C R I S T A L - G R E S E C , Stendhal University, Grenoble France and Dept. o f Information & C o m m u n i c a t i o n I U T du Havre - B.P. 546 KB Lượt tải Terminological variation, a means of identifying research topics from textsFidelia I B E K W E - S A N J U A N C R I S T A L - G R E S E C , Stendhal University, Grenoble France and Dept. o f Information & C o m m u n i c a t i o n I U T du Havre - B.P. 0 Lượt đọc Terminological variation, a means of identifying research topics from textsFidelia I B E K W E - S A N J U A N C R I S T A L - G R E S E C , Stendhal University, Grenoble France and Dept. o f Information & C o m m u n i c a t i o n I U T du Havre - B.P. 0
Đánh giá Terminological variation, a means of identifying research topics from textsFidelia I B E K W E - S A N J U A N C R I S T A L - G R E S E C , Stendhal University, Grenoble France and Dept. o f Information & C o m m u n i c a t i o n I U T du Havre - B.P.
4 ( 3 lượt)
Nhấn vào bên dưới để tải tài liệu
Để tải xuống xem đầy đủ hãy nhấn vào bên trên
Chủ đề liên quan

Nội dung

Terminological variation, a means of identifying research topics from texts Fidelia I B E K W E - S A N J U A N C R I S T A L - G R E S E C , Stendhal University, Grenoble France and Dept. o f Information & C o m m u n i c a t i o n I U T du Havre - B.P. 4006 - 76610 Le Havre France E-mail : fidelia@iut.univ-lehavre.fr Abstract After extracting terms from a corpus of titles and abstracts in English, syntactic variation relations are identified amongst them in order to detect research topics. Three types of syntactic variations were studied : permutation, expansion and substitution. These syntactic variations yield other relations of formal and conceptual nature. Basing on a distinction of the variation relations according to the grammatical function affected in a term - head or modifier - term variants are first clustered into connected components which are in turn clustered into classes. These classes relate two or more components through variations involving a change of head word, thus of topic. The graph obtained reveals the global organisation of research topics in the corpus. A clustering method has been built to compute such classes of research topics. Introduction The importance of terms in various natural language tasks such as automatic indexing, computer-aided translation, information retrieval and technology watch need no longer be proved. Terms are meaningful textual units used for naming concepts or objects in a given field. Past studies have focused on building term extraction tools : TERMINO (David S. & Plante P. 1991), LEXTER (Bourigault D. 1994), ACABIT (Daille 1994), FASTR (Jacquemin 1995), TERMS (Katz S.M. & Justeson T.S. 1995). Here, term extraction and the identification of syntactic variation relations are considered for topic detection. Variations are changes affecting the structure and the form of a term producing another textual unit close to the initial one e.g. dna amplification and 564 amplification fingerprinting of dna. Variations can point to terminological evolution and thus to that of the underlying concept. Topic is used in its grammatical sense, i.e. the head word in a noun phrase. In the above term, fingerprinting is the topic (head word) and dna amplification its properties (modifiers). However, a topic cannot appear by chance in specialised litterature, so this grammatical definition needs to be backed up by empirical evidence such as recurrence of terms sharing the same head word. We constituted a test corpus of scientific abstracts and titles in English from the field of plant biotechnology making up ---29000 words. These texts covered publications made over 13 years (1981-1993). We focused on three syntactic variation types occurring frequently amongst terms : permutation, substitution and expansion (§2). Tzoukermann E. Klavans J. and Jacquemin C. (1997) extracted morpho-syntactic term variants for NLP tasks such as automatic indexing. They accounted for a wide spectrum of variation producing phenomena like the morpho-syntactic variation involving derivation in tree cutting and trees have been cut down 1. We focused for the moment on terms appearing as noun phrases (NP). Although term variants can appear as verb phrases (VP), we believe that NP variants reflect more terminological stability thus a real shift in topic (root hair --~ root hair deformation) than their VP counterpart (root hair the root hair appears deformed). Also, our application - research topic identification - being quite sensitive, requires a careful selection of term variants types depending on their interpretability. Examples taken from Tzoukermann et al. (1997). This is to avoid creating relations between terms which could mislead the end-user, typically a technological watcher, in his task. For instance how do we interpret the relation between concept class and class concept ? Also, our aim is not to extract syntactic variants per se but to identify them in order to establish meaningful relations between them. 1 Extracting terms from texts 1.1 Morpho-syntactic features Term e x t r a c t i o n is based on their morphosyntactic features. The morphological composition of NP terms allows for a limited number of categories mostly nouns, adjectives and some prepositions. Terms can appear under two syntactic structures : compound (the specific alfalfa nodulation) or syntagmatic (the specific nodulation of alfalfa). Since terms are used for naming concepts and objects in a given knowledge field, they tend to be relatively short textual units usually between 2-4 words though terms of longer length occur (endogeneous duck hepatitis B virus). In this study, we fixed a word limit of 7 not considering determiners and prepositions. Based on these three features, morphological make-up, syntactic structure and length, clauses are processed in order to extract complex terms rather than atomic ones. The motivation behind this approach is that complex terms reveal the association of concepts, hence they are more relevant for the application we are considering. A fine-grained term extraction strategy would isolate the concepts and thus lose the information given by their associations in the corpus. For this reason, we could not consider the use of an existing term extraction tool and thus had to carry out a manual simulation of the term extraction phase. NP splitting rules take into account the lexical nature of the constituent words and their raising properties (i.e. derived nouns as opposed to nonderived ones). Furthermore, following the empirical approach successfully implemented by Bourigault (1994), we split complex NPs only after a search has been performed in the corpus for occurrences of their sub-segments in unambiguous situations, i.e. when the sub-segments are not included in a larger segment. This favours the extraction of pre-conceived textual units possibly 565 corresponding to domain terms. However morphosyntactic features alone cannot verify the terminological status of the units extracted since they can also select non terms (see Smadja 1993). For instance root nodulation is a term in the plant biotechnology field whereas book review also found in the corpus is not. Thus in the first stage, the terms extracted are only plausible candidates which need to be filtered in order to eliminate the most unlikely ones. This filtering takes advantage of lexical information accessible at our level of analysis to fine-tune the statistical occurrence criterion which used alone, inevitably leads to a massive elimination. 1.2 Splitting complex noun phrases An NP is deemed complex if its morpho-syntactic features do not conform to that specified for terms, e.g. oxygen control of nitrogen fixation gene expression in bradyrhizobium japonicum a title found in our corpus. Its corresponding syntactic context is : NP1_of_NP2_prepLNP3 where NP is a recognised noun phrase, prep~ refers to the class of preposition not containing of and often found in the morphological composition of terms (for, by, in, from, with). Normally, exploiting syntactic information on the raising properties of the head noun (control) and following the distributional approach, the above segment will be split thus : NPl NP2 --4 NP3 But this splitting is only performed if no subsegment of the initial one occurred alone in the corpu s. This search yielded nitrogen fixation gene expression and bradyrhizobium japonicum which both occurred more than 6 times in the corpus. Their existence confirms the relevance of our splitting rule which would have yielded the same result: oxygen control; nitrogen fixation gene expression; bradyrhizobium japonicum Altogether, 4463 candidate terms were extracted from our corpus and subjected to a filtering process which combined lexical and statistical criteria. The lexical criterion consisted in eliminating terms that contained a determiner other than the that remained after the splitting phase. Only this determiner can occur in a term as it has the capacity, out of context, to refer to a concept or object in a knowledge field, i.e. the use of the variant the low-line instead of the full term low fertility droughtmaster line 2. The statistical criterion consisted in eliminating terms starting with the and appearing only once. These two criteria enabled us to eliminate 30% (1304) candidates and to retain 70% (3159) which we consider to be likely terminological units. We are aware that this filtering procedure remains approximate and cannot eliminate bad candidates like book review whose morphological and lexical make-up correspond to those of terms. But we also observe that such bad candidates are naturally filtered out in later stages as they rarely possess variants and thus will not appear as research topics (see §4). where tl is really found in the corpus, N is a string of words that is either empty or a noun. 37 terms were concerned by this relation. Some examples are given in Table 1. 2 with m' ~ m • Head substitution (H-Sub) : t2 is a substitution of tl if and only if : tz= M m h andt2= M m h ' with h' ~ h Tzoukermann et al. (1997) considered chemical treatment against disease and disease treatment as substitution variants whereas, in our study, after transformation, they would be a case of leftexpansion (L-Exp). Examples of head and modifier substitutions are given in Table 2. 1543 terms shared substitution relations : 1084 in the modifier substitution and 872 in the head substitution. The same term can occur in both categories. Identifying syntactic variants Given the two syntactic structures under which a term can appear - compound or syntagmatic - we first pre-processed the terms by transforming those in a syntagmatic structure into their compound version. This transformation is based on the following noun phrase formation rule for English : D A M 1 h p m Mz---~ D A m M2 Ml h where D, A and M are respectively strings of determiner, adjective and words whose place can be empty, h is a head noun, m is a word and p is a preposition. Thus, the compound version of the specific nodulation of alfalfa will give the specific alfalfa nodulation. This transformation does not modify the original structure under which a term occurred in the corpus. It only serves to furnish input data to the syntactic variation identification programs. This transformation which is equivalent to permutation ( § 2 . 1 ) i s the linguistic relation which once accounted for, reveals the formal nature of the other types of syntactic variations. Also, it enables us to detect variants in the two syntactic structures thus accounting for syntactic variants such as defined in Tzoukermann et al. (1997). In what follows, t~ and t2 are terms. 2.1 Permutation(Perm) It marks the transformation of a term, from a syntagmatic structure to a compound one : tI=ANMI hpmM2 t2=AmM2NMI h 2 It apparently refers to a breed (line) of cattle. 566 2.2 Substitution (Sub) It marks the replacing of a component word in tl by another word in t2 in terms of equal length. Only one word can be replaced and at the same position to ensure the interpretability of the relation. We distinguished between modifier and head substitution. • Modifier substitution (M-Sub) : t2 is a substitution of t~ if and only if : t~ = M 1 m M 2 h and t2 = M~ m' M 2 h 2.3 Expansion (Exp) Expansion is the generic name designating three elementary operations of word adjunction in an existing term. Word adjunction can occur in three positions : left, right or within. Thus we have left expansion, right expansion and insertion respectively. • Left expansion (L-Exp) : tz is a left-expansion of t~ if and only if : tl = M h and t2 = M' m' M h • Right expansion (R-Exp) : t2 is a right-expansion of t~ if and only if : tl = M h and t2 = M h M' h' • Insertion (Ins) : t2 is an insertion of t~ if and only if : tl = M l m M z h t2 =M1 m m ' M ' M E h Examples of each sub-type of expansion are given in Table 3. Some terms combine the two types of expansion left and right expansions (noted LR-Exp), for example root o f bragg ---> root exudate of soyabean cultivar bragg. These complex expansion variants were also identified. A total of Syntagmatic structure accession of azolla-anabaena avirulent strain of pseudomonas syringae curling of root hair excision of nodule the specific nodulation of alfalfa 1014 terms were involved in the expansion variation relations. Altogether, 82% (2593 out of 3159) terms were involved in the three types of syntactic variations studied showing the importance of the phenomena amongst terms. Compound structure azolla-anabaena accession avirulent pseudomonas syringae strain root hair curling / root-hair curling nodule excision the specific alfalfa nodulation Table 1. Examples of permutation variants identified in the corpus. Modifier substitution variants alfalfa root hair curled root hair lucerne root hair characteristic dna fingerprinting conventional dna fingerprinting complex dna fingerprinting enzymatic amplification of dna amplification of genomic dna Head substitution variants nodule development regulation nodule development arrest nodule development consequence infection thread development infection thread formation infection thread initiation nodulation of soybean mutant isolation of soybean mutant property of soybean mutant Table 2. Some head and modifier substitution variants identified in the corpus. Left expansion self-licking ---> refractor), self-licking stereotypic self-licking nitrogenase activity ---> nitrogenase activity of cv. bragg nitrogenase activity of nitrate nitrogenase activity of nts382 nitrogenase activity of soyabean Right expansion blue light ---> blue light-induced expression blue light induction blue lisht induction experiment immigrant of eastern countries --> immigrant children of eastern countries 3 Insertion conserved domain ---> conserved central domain conserved protein domain fast staining of dna---> fast silver staining of dna Table 3. Examples of expansions variants identified in the corpus. The programs identifying syntactic variants were written in the Awk language and implemented on a Sun Sparc workstation. 3 This example is fictitious. 567 Syntactic variations possess formal properties such as symmetry and antisymmetry. Permutation and substitution engender a symmetrical relation between terms, e.g. genomic dna a template dna. Expansion engenders an antisymmetrical or order relation between terms, for instance nitrogen fixation
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.