Challenges in Machine Learning and Data Mining

pdf
Số trang Challenges in Machine Learning and Data Mining 15 Cỡ tệp Challenges in Machine Learning and Data Mining 1 MB Lượt tải Challenges in Machine Learning and Data Mining 0 Lượt đọc Challenges in Machine Learning and Data Mining 0
Đánh giá Challenges in Machine Learning and Data Mining
4.4 ( 17 lượt)
Nhấn vào bên dưới để tải tài liệu
Đang xem trước 10 trên tổng 15 trang, để tải xuống xem đầy đủ hãy nhấn vào bên trên
Chủ đề liên quan

Nội dung

Machine learning and data mining  Machine learning Challenges in Machine Learning and Data Mining To build computer systems that learn as well as human does (science of learning from data).  ICML since 1982 (23th ICML in 2006), ECML since 1989.  ECML/PKDD since 2001.  ACML starts Nov. 2009. Tu Bao Ho, JAIST Based on materials from 1. “9 challenges in ML” (Caruana & Joachims) 2. “10 challenging problems in DM” (Yang & Wu) in International Journal of Information Technology & Decision Making (2006)  Data mining To find new and useful knowledge from large datasets (data engineering).  ACM SIGKDD since 1995, PKDD and PAKDD since 1997 IEEE ICDM and SIAM DM since 2000, etc. Note: Difference between statistics, machine learning, data mining? Co-chair of Steering Committee of PAKDD, member of Steering Committee of ACML What is machine learning?  The goal of machine learning is to build computer systems that can adapt and learn from their experience (Tom Dietterich).  A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measure by P, improves with experience (Tom Mitchell book, p. 2).  ML problems can be formulated as  Given: (x1, y1), (x2, y2), …, (xn, yn) - xi is description of an object, phenomenon, etc. - yi is some property of xi, if not available learning is unsupervised  Find: a function f(x) that f(xi) = yi Finding hypothesis f in a huge hypothesis space F by narrowing the search with constraints (bias) Overview of ML challenges 1. 2. 3. 4. 5. 6. 7. 8. 9. Generative vs. discriminative learning Learning from non-vectorial data Beyond classification and regression Distributed data mining Machine learning bottlenecks Intelligible models Combining learning methods Unsupervised learning comes of age More informed information access 2 1. Generative vs. discriminative methods 1. Generative vs. discriminative methods Training classifiers involves estimating f: X  Y, or P(Y|X). Examples: P(apple | red  round), P(noun | “cá”) Training classifiers involves estimating f: X  Y, or P(Y|X). Examples: P(apple | red  round), P(noun | “cá”) Generative classifiers    Discriminative classifiers Assume some functional form for P(X|Y), P(Y) Estimate parameters of P(X|Y), P(Y) directly from training data, and use Bayes rule to calculate P(Y|X = xi) HMM, Markov random fields, Bayesian networks, Gaussians, Naïve Bayes, etc.    Generative classifiers Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data Assume some functional  Assume some functional form form for P(X|Y), P(Y) for P(Y|X)  Estimate parameters of P(Y|X) Estimate parameters of directly from training data P(X|Y), P(Y) directly from training data, and use Bayes ) P(logistic X | Y ) P (Yregression,  SVM, rule to calculate P(Y|X = xi) P(Y | X )  ( ) P X traditional neural networks, HMM, Markov random nearest neighbors, boosting, P (red  round | apple) P(apple) fields, Bayesian networks, P(apple MEMM, | red  round ) conditional  random P(red  round ) Gaussians, Naïve Bayes, etc. fields, etc.   SVM, logistic regression, traditional neural networks, nearest neighbors, boosting, MEMM, conditional random fields, etc. Discriminative classifiers  (cá: fish, to bet) (cá: fish, to bet) Generative vs. discriminative methods Generative vs. discriminative learning Generative approach   Try to build models for the underlying patterns Can be learned, adapted, and generalized with small data. Discriminative approach   Try to learn to minimize an utility function (e.g. classification error) but not to model, represent, or “understand” the pattern explicitly (detect 99.99% faces in real images and do not “know” that a face has two eyes). Often need large training data, say 100,000 labeled examples, and can hardly be generalized.   Objective: determine which is better for what, and why Current:  Discriminative learning (ANN, DT, KNN, SVM) typically more accurate    Generative learning (graphical models, HMM) typically more flexible    Better with larger data Faster to train More complex problems More flexible predictions Key Challenges:   Making generative methods computationally more feasible When to prefer discriminative vs. generative Kernel methods: the basic ideas 2. Kernel methods Input space X       inverse map -1 x2 (xn) (x) (x1) … xn-1 Most learning algorithms work on flat, fixed length feature vectors Each new data type requires a new learning algorithm Difficult to handle strings, gene/protein sequences, natural language parse trees, graph structures, pictures, plots, …  (xn-1) (x2) xn k(xi,xj) = (xi).(xj) kernel function k: XxX  R Kernel matrix Knxn kernel-based algorithm on K (computation done on kernel matrix) Key Challenges:    x1 Objective: learning from non-vectorial input data Current: Feature space F  :X  R2  H  R3 One data-interface for multiple learning methods One learning method for multiple data types ( x1 , x2 )  ( x1 , x2 , x12  x22 ) Research already in progress Kernel methods: math background Input space X x1 x2 Feature space F inverse map -1 (x) … xn-1 3. Beyond classification and regression (x1) xn  (xn) (xn-1) (x2)    k(xi,xj) = (xi).(xj) kernel function k: XxX  R Objective: learning to predict complex objects Current: Gram matrix Knxn= {k(xi,xj)}  - kernel-based algorithm on K Linear algebra, probability/statistics, functional analysis, optimization  Mercer theorem: Any positive definite function can be written as an inner product in some feature space.  Kernel trick: Using kernel matrix instead of inner product in the feature space. Every minimizer of min{C ( f , {x , y })  ( f ) admits  Representer theorem: a representa tion of the form f (.)    K (., x ) f H i i H m i 1 i i Nguyen, C.H., Ho, T.B. (2008). An Efficient Kernel Matrix Evaluation Measure, Pattern Recognition, 41 (11), 3366-3372. Nguyen, D.D., Ho, T.B. (2006). A Bottom-up Method for Simplifying Support Vector Solutions, IEEE Neural Networks, Vol.17(3)   Most machine learning focuses on classification and regression Discriminative methods often outperform generative methods Generative methods used for learning complex objects (e.g. language parsing, protein sequence alignment, information extraction) Key Challenges:    Extend discriminative methods (ANN, DT, KNN, SVM, …) to more general settings Examples: ranking functions (e.g. Google top-ten, ROC), natural language parsing, finite-state models  Learning to rank. Find ways to directly optimize desired performance criteria (e.g. ranking performance vs. error rate) What is structured prediction? (Daume)    Structured prediction is a fundamental machine learning task involving classification or regression in which the output variables are mutually dependent or constrained. What is structured prediction? (Lafferty)  Text, sound, event logs, biological, handwriting, gene networks, linked data structures like the Web can be viewed as graphs connecting basic data elements.  Important tasks involving structured data require the computation of a labeling for the nodes or the edges of the underlying graph. E.g., POS tagging of natural language text can be seen as the labeling of nodes representing the successive words with linguistic labels.  A good labeling will depend not just on individual nodes but also the contents and labels of nearby nodes, that is, the preceding and following words, thus the labels are not independent. Such dependencies and constraints reflect sequential, spatial or combinatorial structure in the problem domain, and capturing these interactions is often as important for the purposes of prediction as capturing input-output dependencies. Structured prediction (SP)  the machine learning task of generating outputs with complex internal structure . Handwriting recognition x Structured prediction y  Structured Learning / Structured Prediction structured brace Sequential structure (a) Unstructured Model yt-1 yt yt+1 xt-1 xt xt+1 (b) Linear-Structured Model Labeling sequence data problem         X is a random variable over data sequences Y is a random variable over label sequences whose labels are assumed to range over a finite label alphabet A Problem: Learn how to give labels from a closed set Y to a data sequence X x1 x2 x3 X: Thinking is being Y: noun verb y1 y2 noun y3 POS tagging, phrase types, etc. (NLP), Named entity recognition (IE) Modeling protein sequences (CB) Image segmentation, object recognition (PR) etc. 4. Distributed learning Objective: DM/ML with distributed data Current:      Most ML algorithms assume random access to all data Often data comes from decentralized sources (e. g. sensor networks, multiple organizations, learning across firewalls, different security systems) Many projects infeasible (e.g. organization not allowed to share data) Key Challenges:    Develop methods for distributing data while preserving privacy Develop methods for distributed learning without distributing the data Le, N.T., Ho, T.B., Ho, B.H. (2010). Sequence-dependent histone variant positioning signatures, BMC Genomics, Vol. 11 (S4) Le, N.T., Ho, T.B., Tran, D.H. (2009). Characterizing nucleosome dynamics from genomic and epigenetic information using rule induction learning, BMC Genomics, 10(Suppl.3) 5. Full auto: ML for the masses   Objective: make ML easier to apply to real problems Current:    6. Interpretable models    ML applications require detailed knowledge about the algs Preparing/Preprocessing takes at least 75% of the effort   Key Challenges:   Automatic selection of machine learning method Tools for preprocessing the data   Objective: make learning results more understandable Current:   Key Challenges:  Reformatting, Sampling, Filling in missing values, Outlier detection Robust performance estimation and model selection   Ho, T.B., Nguyen, T.D., Shimodaira, H., Kimura, M.(2003). A Knowledge Discovery System with Support for Model Selection and Visualization, Applied Intelligence, Kluwer Academic Publishers, Vol. 19, Issue 1-2, 125-141 Methods often achieve good prediction accuracy The prediction rule appears complex & is difficult to verify Lack of trust in the rule Lack of insight Machine learning methods that are understandable & generate accurate rules Methods for generating explanations Model verification for user acceptance 7. Ensemble methods   Objective: combining learning methods automatically Current:     We do not have a single DM/ML method that “does it all” Results indicate that combining models results in large improvements in performance Focus on boosting and bagging Key Challenges:    Develop methods that combine the best of different learning algs Searching for good combinations might be more efficient than designing one “global” learning algorithm Theoretical explanation for why and when ensemble methods help 8. Unsupervised learning   Objective: improve state-of-the-art in unsupervised learning Current:     Research focus in 90’s was supervised learning Much progress on supervised learning methods like neural networks, support vector machines, boosting, etc. Unsupervised learning needs to “catch up” Key Challenges:     More robust and stable methods for clustering Include user feedback into unsupervised learning (e.g. clustering with constraints, semi-supervised learning, transduction) Automatic distance metric learning Clustering as an interactive data analysis process 9. Information access   Objective: more informed information access Current:     Bag-of-words Retrieval functions exploit document structure and link structure Information retrieval is a process without memory Key Challenges:       Develop methods for exploiting usage data Learn from the query history of a user/user group Preserving privacy while mining access patterns Exploiting common access patterns and finding “groups” of users Web Expert: agent that learns the web (beyond Google) Topic modelling What is data mining? Overview of DM challenges (ICDM’05) “Data-driven discovery of models and patterns from massive observational data sets” Statistics, Inference Languages, Representations Data Management Applications 1. Developing a unifying theory of DM  The current state of the art of datamining research is too “ad-hoc”   techniques are designed for individual problems no unifying theory  Needs unifying research  Long standing theoretical issues    Exploration vs explanation  Example: VC dimension. In statistical learning theory, or sometimes computational learning theory, the VC dimension (for Vapnik-Chervonenkis dimension) is a measure of the capacity of a statistical classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter. How to avoid spurious correlations?  Knowledge discovery on hidden causes? Similar to discovery of Newton’s Law? Developing a Unifying Theory of Data Mining Scaling Up for High Dimensional Data/High Speed Streams Mining Sequence Data and Time Series Data Mining Complex Knowledge from Complex Data Data Mining in a Network Setting Distributed Data Mining and Mining Multi-agent Data Data Mining for Biological and Environmental Problems Data-Mining-Process Related Problems Security, Privacy and Data Integrity Dealing with Non-static, Unbalanced and Cost-sensitive Data 2. Scaling up for high dimensional data and high speed streams  Scaling up is needed      VC dimension of perceptron is 3. ultra-high dimensional classification problems (millions or billions of features, e.g., bio data) Ultra-high speed data streams Streams  Deep research  1. 2. 3. 4. 5. 6. 7. 8. 9. 10. continuous, online process e.g. how to monitor network packets for intruders? concept drift and environment drift? Excerpt from Jian Pei’s Tutorial http://www.cs.sfu.ca/~jpei/ 4. Mining complex knowledge from complex data (complexly structured data) 3. Sequential and time series data   How to efficiently and accurately cluster, classify and predict the trends? Time series data used for predictions are contaminated by noise    How to do accurate shortterm and long-term predictions? Signal processing techniques introduce lags in the filtered data, which reduces accuracy Key in source selection, domain knowledge in rules, and optimization methods        many objects are not independent of each other, and are not of a single type. mine the rich structure of relations among objects, E.g.: interlinked Web pages, social networks, metabolic networks in the cell Integration of data mining and knowledge inference  The biggest gap: unable to relate the results of mining to the real-world decisions they affect - all they can do is hand the results back to the user. More research on interestingness of knowledge Real time series data obtained from Wireless sensors in Hong Kong UST CS department hallway 6. Distributed data mining and mining multi-agent data 5. Data mining in a network setting Community and Social Networks  Linked data between emails, Web pages, blogs, citations, sequences and people  Static and dynamic structural behavior  Mining in and for Computer Networks  detect anomalies (e.g., sudden traffic spikes due to a DoS (Denial of Service) attacks  Need to handle 10Gig Ethernet links (a) detect (b) trace back (c ) drop packet Mining graphs Data that are not i.i.d. (independent and identically distributed)   Need to correlate the data seen at the various probes (such as in a sensor network)  Picture from Matthew Pirretti’s slides,penn state An Example of packet streams (data courtesy of NCSA, UIUC)  Adversary data mining: deliberately manipulate the data to sabotage them (e.g., make them produce false negatives) Game theory may be needed for help  Games Player 1:miner Action: H T Player 2 H (-1,1) T (1,-1) T H (1,-1) Outcome (-1,1) 7. Data mining for biological and environmental problems 8. Data-mining-process related problems  How New problems raise new questions Large scale problems especially so       Biological data mining, such as HIV vaccine design DNA, chemical properties, 3D structures, and functional properties  need to be fused Environmental data mining Mining for solving the energy crisis 3000 metabolites Metabolomics Proteomics 2,000,000 Proteins Genomics 25,000 Genes to automate mining process?  the composition of data mining operations  Data cleaning, with logging capabilities  Visualization and mining automation  Need a methodology: help users avoid many data mining mistakes  What is a canonical set of data mining operations a step in the KDD process consisting of methods that produce useful patterns or models from the data Maybe 7090% of effort and cost in KDD 2 5 4 Putting the results in practical use Interpret and evaluate discovered knowledge Data mining 3 Extract Patterns/Models Collect and preprocess data KDD is inherently interactive and iterative 1 Understand the domain and define problems (In many cases, viewed KDD as data mining) Nguyen, T.P., Ho, T.B. (2008). An Integrative Domain-Based Approach to Predicting PPI, Bioinformatics and Comput.Biology, Vol. 6, Issue 6 Tran, D.H., Satou, K., Ho, T.B. (2008). Finding microRNA regulatory modules in human genome using rule induction, BMC Bioinformatics, Vol. 9. 9. Security, privacy and data integrity    How to ensure the users privacy while their data are being mined? How to do data mining for protection of security and privacy? Knowledge integrity assessment   Alice ’s age Add random number to Age 30 becom es 65 (30+35 ) Perturbation Methods Secure Multi-Party Computation (SMC) Methods 30 | 70K | ... 50 | 40K | ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm 10. Dealing with non-static, unbalanced and cost-sensitive data   . . ... .  ... Model Luong, T.D., Ho, T.B. (2010). Privacy Preserving Frequency Mining in 2-Part Fully Distributed Setting, IEICE Trans. Information Systems, Vol.E93-D, No.10, 2701-2708, October 2010  The UCI datasets are small and not highly unbalanced Real world data are large (10^5 features) but only < 1% of the useful classes (+’ve) There is much information on costs and benefits, but no overall model of profit and loss Data may evolve with a bias introduced by sampling pressure blood test essay ? temperature 39oc ? ? cardiogram ? • Each test incurs a cost • Data extremely unbalanced • Data change with time Comparing the challenges 1. Generative vs. discriminative learning 2. Learning from non-vectorial data 1. Unifying Theory of Data Mining  2. Scaling Up for High Dimensional Data/High Speed Streams 3. Sequence Data and Time Series Data 3. Beyond classification and regression 4. Complex Knowledge from Complex Data 4. Distributed data mining 6. Data-Mining-Data Mining in a Network Setting 5. Machine learning bottlenecks 6. Intelligible models 7. Combining learning methods * Some papers can be found here http://www.jaist.ac.jp/~bao/K417-2008 5. and Environmental Problems 7. Distributed DM and Mining Multi-agent Data 8. Unsupervised learning comes of age 8. Biological Process Related Problems 9. More informed information access 10. Non-static, Unbalanced and Costsensitive Data 9. Security, Privacy and Data Integrity What is the structure of water? Water in biological systems Water and Proteins One of the 100 outstanding unsolved problems in science Protein folding 39 Proteins function Hydration Water Hydration water molecules are trapped at the surface of protein in picosecond time Milli 10-3 Micro 10-6 Nano 10-9 Pico 10-12 Femto 10-15 Atto 10-18 40
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.