Báo cáo khoa học: "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales"

pdf
Số trang Báo cáo khoa học: "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales" 10 Cỡ tệp Báo cáo khoa học: "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales" 230 KB Lượt tải Báo cáo khoa học: "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales" 0 Lượt đọc Báo cáo khoa học: "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales" 0
Đánh giá Báo cáo khoa học: "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales"
4.9 ( 11 lượt)
Nhấn vào bên dưới để tải tài liệu
Để tải xuống xem đầy đủ hãy nhấn vào bên trên
Chủ đề liên quan

Nội dung

Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales      Bo Pang and Lillian Lee (1) Department of Computer Science, Cornell University (2) Language Technologies Institute, Carnegie Mellon University (3) Computer Science Department, Carnegie Mellon University Abstract the scope of new applications enabled by the processing of subjective language. (The papers collected by Qu, Shanahan, and Wiebe (2004) form a representative sample of research in the area.) Most prior work on the specific problem of categorizing expressly opinionated text has focused on the binary distinction of positive vs. negative (Turney, 2002; Pang, Lee, and Vaithyanathan, 2002; Dave, Lawrence, and Pennock, 2003; Yu and Hatzivassiloglou, 2003). But it is often helpful to have more information than this binary distinction provides, especially if one is ranking items by recommendation or comparing several reviewers’ opinions: example applications include collaborative filtering and deciding which conference submissions to accept. We address the rating-inference problem, wherein rather than simply decide whether a review is “thumbs up” or “thumbs down”, as in previous sentiment analysis work, one must determine an author’s evaluation with respect to a multi-point scale (e.g., one to five “stars”). This task represents an interesting twist on standard multi-class text categorization because there are several different degrees of similarity between class labels; for example, “three stars” is intuitively closer to “four stars” than to “one star”. We first evaluate human performance at the task. Then, we apply a metaalgorithm, based on a metric labeling formulation of the problem, that alters a given  -ary classifier’s output in an explicit attempt to ensure that similar items receive similar labels. We show that the meta-algorithm can provide significant improvements over both multi-class and regression versions of SVMs when we employ a novel similarity measure appropriate to the problem. 1 Introduction There has recently been a dramatic surge of interest in sentiment analysis, as more and more people become aware of the scientific challenges posed and Therefore, in this paper we consider generalizing to finer-grained scales: rather than just determine whether a review is “thumbs up” or not, we attempt to infer the author’s implied numerical rating, such as “three stars” or “four stars”. Note that this differs from identifying opinion strength (Wilson, Wiebe, and Hwa, 2004): rants and raves have the same strength but represent opposite evaluations, and referee forms often allow one to indicate that one is very confident (high strength) that a conference submission is mediocre (middling rating). Also, our task differs from ranking not only because one can be given a single item to classify (as opposed to a set of items to be ordered relative to one another), but because there are settings in which classification is harder than ranking, and vice versa. One can apply standard  -ary classifiers or regression to this rating-inference problem; independent work by Koppel and Schler (2005) considers such 115 Proceedings of the 43rd Annual Meeting of the ACL, pages 115–124, Ann Arbor, June 2005. c 2005 Association for Computational Linguistics methods. But an alternative approach that explicitly incorporates information about item similarities together with label similarity information (for instance, “one star” is closer to “two stars” than to “four stars”) is to think of the task as one of metric labeling (Kleinberg and Tardos, 2002), where label relations are encoded via a distance metric. This observation yields a meta-algorithm, applicable to both semi-supervised (via graph-theoretic techniques) and supervised settings, that alters a given  -ary classifier’s output so that similar items tend to be assigned similar labels. In what follows, we first demonstrate that humans can discern relatively small differences in (hidden) evaluation scores, indicating that rating inference is indeed a meaningful task. We then present three types of algorithms — one-vs-all, regression, and metric labeling — that can be distinguished by how explicitly they attempt to leverage similarity between items and between labels. Next, we consider what item similarity measure to apply, proposing one based on the positive-sentence percentage. Incorporating this new measure within the metriclabeling framework is shown to often provide significant improvements over the other algorithms. We hope that some of the insights derived here might apply to other scales for text classifcation that have been considered, such as clause-level opinion strength (Wilson, Wiebe, and Hwa, 2004); affect types like disgust (Subasic and Huettner, 2001; Liu, Lieberman, and Selker, 2003); reading level (Collins-Thompson and Callan, 2004); and urgency or criticality (Horvitz, Jacobs, and Hovel, 1999). 2 Problem validation and formulation We first ran a small pilot study on human subjects in order to establish a rough idea of what a reasonable classification granularity is: if even people cannot accurately infer labels with respect to a five-star scheme with half stars, say, then we cannot expect a learning algorithm to do so. Indeed, some potential obstacles to accurate rating inference include lack of calibration (e.g., what an understated author intends as high praise may seem lukewarm), author inconsistency at assigning fine-grained ratings, and 116 Rating diff.  or more 2 (e.g., 1 star) 1 (e.g.,  star) 0 Pooled Subject 1 Subject 2 100% 100% (35) 100% (15) 83% 77% (30) 100% (11) 69% 65% (57) 90% (10) 55% 47% (15) 80% ( 5) Table 1: Human accuracy at determining relative positivity. Rating differences are given in “notches”. Parentheses enclose the number of pairs attempted. ratings not entirely supported by the text 1 . For data, we first collected Internet movie reviews in English from four authors, removing explicit rating indicators from each document’s text automatically. Now, while the obvious experiment would be to ask subjects to guess the rating that a review represents, doing so would force us to specify a fixed rating-scale granularity in advance. Instead, we examined people’s ability to discern relative differences, because by varying the rating differences represented by the test instances, we can evaluate multiple granularities in a single experiment. Specifically, at intervals over a number of weeks, we authors (a non-native and a native speaker of English) examined pairs of reviews, attemping to determine whether the first review in each pair was (1) more positive than, (2) less positive than, or (3) as positive as the second. The texts in any particular review pair were taken from the same author to factor out the effects of cross-author divergence. As Table 1 shows, both subjects performed perfectly when the rating separation was at least 3 “notches” in the original scale (we define a notch as a half star in a four- or five-star scheme and 10 points in a 100-point scheme). Interestingly, although human performance drops as rating difference decreases, even at a one-notch separation, both subjects handily outperformed the random-choice baseline of 33%. However, there was large variation in accuracy between subjects.2 1 For example, the critic Dennis Schwartz writes that “sometimes the review itself [indicates] the letter grade should have been higher or lower, as the review might fail to take into consideration my overall impression of the film — which I hope to capture in the grade” (http://www.sover.net/˜ozus/cinema.htm). 2 One contributing factor may be that the subjects viewed disjoint document sets, since we wanted to maximize experimental coverage of the types of document pairs within each difference class. We thus cannot report inter-annotator agreement, Because of this variation, we defined two different classification regimes. From the evidence above, a three-class task (categories 0, 1, and 2 — essentially “negative”, “middling”, and “positive”, respectively) seems like one that most people would do quite well at (but we should not assume 100% human accuracy: according to our one-notch results, people may misclassify borderline cases like 2.5 stars). Our study also suggests that people could do at least fairly well at distinguishing full stars in a zero- to four-star scheme. However, when we began to construct five-category datasets for each of our four authors (see below), we found that in each case, either the most negative or the most positive class (but not both) contained only about 5% of the documents. To make the classes more balanced, we folded these minority classes into the adjacent class, thus arriving at a four-class problem (categories 0-3, increasing in positivity). Note that the four-class problem seems to offer more possibilities for leveraging class relationship information than the three-class setting, since it involves more class pairs. Also, even the two-category version of the rating-inference problem for movie reviews has proven quite challenging for many automated classification techniques (Pang, Lee, and Vaithyanathan, 2002; Turney, 2002). We applied the above two labeling schemes to a scale dataset3 containing four corpora of movie reviews. All reviews were automatically preprocessed to remove both explicit rating indicators and objective sentences; the motivation for the latter step is that it has previously aided positive vs. negative classification (Pang and Lee, 2004). All of the 1770, 902, 1307, or 1027 documents in a given corpus were written by the same author. This decision facilitates interpretation of the results, since it factors out the effects of different choices of methods for calibrating authors’ scales.4 We point out that but since our goal is to recover a reviewer’s “true” recommendation, reader-author agreement is more relevant. While another factor might be degree of English fluency, in an informal experiment (six subjects viewing the same three pairs), native English speakers made the only two errors. 3 Available at http://www.cs.cornell.edu/People/pabo/moviereview-data as scale dataset v1.0. 4 From the Rotten Tomatoes website’s FAQ: “star systems are not consistent between critics. For critics like Roger Ebert and James Berardinelli, 2.5 stars or lower out of 4 stars is always negative. For other critics, 2.5 stars can either be positive 117 it is possible to gather author-specific information in some practical applications: for instance, systems that use selected authors (e.g., the Rotten Tomatoes movie-review website — where, we note, not all authors provide explicit ratings) could require that someone submit rating-labeled samples of newlyadmitted authors’ work. Moreover, our results at least partially generalize to mixed-author situations (see Section 5.2). 3 Algorithms Recall that the problem we are considering is multicategory classification in which the labels can be naturally mapped to a metric space (e.g., points on a line); for simplicity, we assume the distance metric    throughout. In this section, we present three approaches to this problem in order of increasingly explicit use of pairwise similarity information between items and between labels. In order to make comparisons between these methods meaningful, we base all three of them on Support Vector Machines (SVMs) as implemented in Joachims’ (1999)  "!$#&%('*) package. 3.1 One-vs-all The standard SVM formulation applies only to binary classification. One-vs-all (OVA) (Rifkin and Klautau, 2004) is a common extension to the  -ary case. Training consists of building, for each label , an SVM binary classifier distinguishing label from “not- ”. We consider the final output to be a label 1023 preference function +-,./ , defined as the signed 0 distance of (test) item to the side of the vs. not- decision plane. Clearly, OVA makes no explicit use of pairwise label or item relationships. However, it can perform well if each class exhibits sufficiently distinct language; see Section 4 for more discussion. 3.2 Regression Alternatively, we can take a regression perspective by assuming that the labels come from a discretization of a continuous function 4 mapping from the or negative. Even though Eric Lurio uses a 5 star system, his grading is very relaxed. So, 2 stars can be positive.” Thus, calibration may sometimes require strong familiarity with the authors involved, as anyone who has ever needed to reconcile conflicting referee reports probably knows. feature space to a metric space.5 If we choose 4 from a family of sufficiently “gradual” functions, then similar items necessarily receive similar labels. In particular, we consider linear, 5 -insensitive SVM regression (Vapnik, 1995; Smola and Schölkopf, 1998); the idea is to find the hyperplane that best fits the training data, but where training points whose labels are within distance 5 of the hyperplane incur no 0 loss. Then, for (test) instance , the label preference 10;< function +76 8:9 is the negative of the distance be0 tween and the value predicted for by the fitted hyperplane function. Wilson, Wiebe, and Hwa (2004) used SVM regression to classify clause-level strength of opinion, reporting that it provided lower accuracy than other methods. However, independently of our work, Koppel and Schler (2005) found that applying linear regression to classify documents (in a different corpus than ours) with respect to a three-point rating scale provided greater accuracy than OVA SVMs and other algorithms. 3.3 Metric labeling Regression implicitly encodes the “similar items, similar labels” heuristic, in that one can restrict consideration to “gradual” functions. But we can also think of our task as a metric labeling problem (Kleinberg and Tardos, 2002), a special case of the maximum a posteriori estimation problem for Markov random fields, to explicitly encode our desideratum. Suppose we have an initial label pref1023 , perhaps computed via one erence function +  of the two methods described above. Also, let 10? be a distance metric on labels, and let 7>= de0 note the @ nearest neighbors of item according to some item-similarity function AB C . Then, it is quite natural to pose our problem as finding a map0
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.