NIPS 2006 Workshop on Causality and Feature Selection

NIPS 2006 Workshop on Causality and Feature Selection

fancy picture..

Workshop at the NIPS*2006 conference, Whistler, Canada, on December 8, 2006.

[ Background | Call for Abstracts | Program | Posters| Organization ]


Determining and exploiting causal relationships is central in human reasoning and decision making. Yet, most of machine learning does not attempt to uncover causal relationships in data, as this is unnecessary to make good predictions. For instance, in medical diagnosis, the abundance of a protein in serum may be used as a predictor of disease. It is not relevant to know whether the protein is a cause of the disease (resulting from a gene mutation), or a consequence (an antibody responding to inflammation). If one is interested in a diagnose, the abundance of protein is enough and means disease. So in the end, machine learning might not be much concerned with causality. This statement would be true if all machine learning problems were about minimizing prediction error only. However, more and more applications require the assessment of the results of given actions. Such assessment is essential in many domains, including epidemiology, medicine, ecology, economy, sociology and business.

Predictive models simply based on event correlations do not model mechanisms. They allow us to make predictions in a stationary environment (no change in the distribution of all the variables), but do not allow us to predict the consequence of given actions. For instance, smoking and coughing are both predictive of respiratory disease. One is a cause and the other a symptom. Acting on the cause can change the disease state, but not acting on the symptom. Thus it is extremely important to distinguish between causes and symptoms to predict the consequences of actions.

In the past few years, much effort in machine learning has been devoted to feature selection, the art of uncovering significant dependencies between the input variables and a desired outcome. State-of-the-art feature selection methods can select relevant features among millions of distracters, with less than a hundred examples. In contract, causal models usually deal with just a few variables and a quasi-perfect knowledge of the variable distribution, which implies an abundance of training examples. Therefore, recent developments in feature selection and causality inference can cross-fertilize to:

See more information on NIPS 2006

Targeted group of participants:

This workshop is especially targeted at researchers interested in understanding the outcome of uncontrolled or partially controlled experiments. Such experiments occur in many areas (medicine, biology, sociology, economy, finance and marketing, etc.) and are typical of studies where the subjects cannot be isolated from their environment.

Related workshops

NIPS 2006 Workshop on Learning when test and training inputs have different distributions , December 9, organized by Joaquin Quiñonero Candela (Carlos III University of Madrid and Technical University of Berlin), Neil Lawrence (University of Sheffield), Anton Schwaighofer (Fraunhofer FIRST) and Masashi Sugiyama (Tokyo Institute of Technology).

Preliminary investigations on feature selection and causality have revealed strong ties with the problem of incorporating robustness to distributional changes in learned models." The audience of the different input distributions workshop might therefore be also interested in the "distributional change" workshop, and vice versa.
Past relevant workshops:

Call for Abstracts

Extended abstracts (1 to 4 pages long) are invited to be submitted. A selection of the submitted abstracts will be accepted as oral or poster presentations. The selected contributions will be invited to publish an extended version for the workshop proceedings.


...the call for abstracts is over now...

Discussion Forum

We plan to have an informal discussion session and dinner on the evening of December 8th, with participants of our workshop and the "learning when test and training inputs have different distributions" workshop. Please contact Isabelle Guyon if you would like to be on the dinner mailing list (we need to find a large enough restaurant).


7.30-7.45am Welcome and introduction
Andre Elisseeff, IBM Zurich Research Lab, Switzerland
7.45-8.45am An Introduction to Causal Modeling and Discovery Using Graphical Models (ppt slides)
Gregory F. Cooper, University of Pittsburgh, USA

This tutorial will describe fundamental approaches for representing causal relationships with graphical models, particularly causal Bayesian networks. The tutorial also will describe several representative algorithms for discovering causal relationships from observational and/or experimental data. The strengths and limitations of the representation and algorithms will be discussed.
8.45-9.00am Poster highlights: D, F, A, E, I, G, H
9.00-9.30am Break (Poster and Coffee)
9.30-9.55am Contrasting Feature Selection for Causal Inference with Feature Selection for Classification
Frederick Eberhardt (joint work with David Danks and Peter Spirtes), Carnegie Mellon University, USA

We will contrast the properties that make a set of features useful for the purpose of classification with the properties that make a set of features useful for several different kinds of causal inference. We will also explain how these differences make feature selection for causal inference harder to scale up to large numbers of variables than feature selection for classification, and describe several approaches that have recently been taken to extend feature selection algorithms for causal inference to larger numbers of features.
9.55-10.20am Inferring causal directions by evaluating the complexity of conditional distributions (slides)
X. Sun, D. Janzing, B. Schoelkopf, Max Planck Institute, Tuebingen, Germany

We propose a new approach to infer the causal structure that has generated the observed statistical dependences among n random variables. The idea is that the factorization of the joint measure of cause and effect into P(cause)P(effect|cause) leads typically to simpler conditionals than non-causal factorizations. To evaluate the complexity of the conditionals we have tried two methods. First, we have compared them to those which maximize the conditional entropy subject to the observed first and second moments since we consider the latter as the simplest conditionals. Second, we have fitted the data with conditional probability measures being exponents of functions in an RKHS space and defined the complexity by a Hilbert-space semi-norm. Such a complexity measure has several properties that are useful for our purpose. We describe some encouraging results with both methods applied to real-world data. Moreover, we have combined constraint-based approaches to causal discovery (i.e., methods using only information on conditional statistical dependences) with our method in order to distinguish between causal hypotheses which are equivalent with respect to the imposed independences. Furthermore, we compare the performance to Bayesian approaches to causal inference.
10.20-10.30am Poster highlights: B, C, J
10.30am Adjourn
4.00-4.45pm Feature Selection and Causal Discovery (ppt slides)
Isabelle Guyon, Clopinet, USA, André Elisseeff, IBM Research, Switzerland, and Constantin Aliferis, Vanderbilt University, USA

This presentation introduces the problems posed by feature selection and learning causal dependencies. What is feature selection? Why is it hard? What works best in practice? How to make progress using causality? Can causal discovery benefit from feature selection? We argue that machine learning researchers should abandon the usual motto of predictive modeling: “we don’t care about causality”. Feature selection may benefit from introducing a notion of causality:
  • To be able to predict the consequence of given actions.
  • To add robustness to the predictions if the input distribution changes.
  • To get more compact and robust feature sets.
Reciprocally causal discovery using observational data may benefit from advances in feature selection algorithms. Causal discovery is not entirely solved with experiments because Randomized Controlled Trials (RCT) may be:
  • Unethical (e.g. a RCT about the effects of smoking)
  • Costly and time consuming
  • Impossible (e.g. astronomy)
Using already collected observational data may help planing future experiments by spotting the most relevant "causal features", which are the most promising targets of manipulation.
4.45-5.10pm Using SVM Weight-Based Methods to Identify Causally Relevant and Non-Causally Relevant Variables (ppt slides)
Alexander Statnikov, Douglas Hardin and Constantin Aliferis, Vanderbilt University, USA

Variable selection is often used to derive insights in the causal structure of the data-generating process. For example, in biology and medicine, biomarkers are sought to better understand the factors that cause disease, determine its progression, and identify the members of the relevant molecular pathways. We conducted a simulation experiment to study SVM weight-based ranking and variable selection methods using two network structures that are often encountered in biological systems and are expected to occur in many other settings as well. We attempted to recover both causally and non-causally relevant variables using SVM weight-based methods under a variety of experimental settings (data-generating network, noise level, sample size, and SVM penalty parameter). Our experiments show that sometimes SVMs can produce excellent classifiers that assign higher weights to irrelevant variables than to the relevant ones. Likewise, the application of the recursive variable selection technique SVM-RFE, does not remedy this problem. More importantly, we found that when it comes to identifying causally relevant variables, SVM weight-based methods can fail by assigning higher weight or selecting (in the context of SVM-RFE) variables that are relevant but non-causally so. Furthermore, even irrelevant variables can have higher weights or can be selected more often than the causally relevant ones. These results are corroborated by a theoretical analysis as well as recent research employing high-fidelity re-simulation in biological and medical domains. Thus, the totality of empirical evidence so far suggests that causal interpretation of current state-of-the-art SVM variable selection results must be conducted with great caution by practitioners. We show that this problem is not linked to the specific variable selection techniques studied but rather that the maximum margin inductive bias, as typically employed by SVM-based methods, is locally causally inconsistent. New SVM methods may be needed to address this issue and this is an exciting and challenging area of research.
5.10-5.35pm Discovery of linear acyclic models in the presence of latent classes using ICA mixtures (pdf, slides)
Shohei Shimizu (1,2), and Aapo Hyvärinen (1)
1. Helsinki Institute for Information Technology , Finland
2. The Institute of Statistical Mathematics, Tokyo, Japan

Causal discovery is the task of finding plausible causal relationships from statistical data. Such methods rely on various assumptions about the data generating process to identify it from uncontrolled observations. We have recently proposed a causal discovery method based on independent component analysis (ICA) called LiNGAM, showing how to completely identify the data generating process under the assumptions of linearity, non-gaussianity, and no hidden variables. In this paper, after briefly recapitulating this approach, we extend the framework to cases where latent (hidden) classes are present. The model identification can be accomplished using ICA mixtures. Experiments confirm the performance of the proposed method.
See the report and JMLR paper for more details.
5.35-6.00pm Break (Poster and Coffee)
6.00-6.25pm LassoOrderSearch: Learning Directed Graphical Model Structure using L1-Penalized Regression and Order Search. (slides, short paper)
Mark Schmidt and Kevin Murphy, University of British Columbia, Canada

To speed up the search for the best DAG structure given data, we propose to use the regularization paths from L1- penalized regression to rapidly find the best set of parents given a node ordering. This reduces the complexity of evaluating a node ordering from O(Nd3dK) to O(Nd5) time (O(Nd4) in the linear-Gaussian case), where N is the number of data cases, d is the number of nodes, and K is the maximum number of parents (fan-in). Not only is this approach much faster than previous approaches, but our linear (instead of exponential) dependence on the number of parents allows us to tractably fit much more complex models. We provide experimental comparisons with several other heuristic search techniques illustrating the effectiveness of this approach.
6.25-7.00pm Discussion
7.00pm Adjourn


A Application and Comparative Evaluation of Causal and Non-Causal Feature Selection Algorithms for Biomarker Discovery in High-Throughput Biomedical Datasets
C.F. Aliferis (1,2), A. Statnikov (1), I. Tsamardinos (1,4), E. Kokkotou (5) and P.P. Massion (1,3)
1. Vanderbilt University School of Medicine, USA
2. Vanderbilt Ingram Comprehensive Cancer Center, USA
3. University of Crete, Greece
4. Beth Israel Deaconess Medical Center and Harvard Medical School, USA

B Bayesian structure learning using dynamic programming and MCMC (slides, short paper, more info)
Daniel Eaton and Kevin Murphy, University of British Columbia, Canada.

We show how to significantly speed up MCMC sampling of DAG structures by using a powerful non-local proposal based on Koivisto’s dynamic programming (DP) algorithm, which computes the exact marginal posterior edge probabilities by analytically summing over orders. Furthermore, we show how sampling in DAG space can avoid subtle biases that are introduced by approaches that work only with orders, such as Koivisto’s DP algorithm and MCMC order samplers.
C Supervised Feature Selection via Dependence Estimation (short paper, more info)
Le Song (1), Alex Smola (2), Arthur Gretton (3) and Karsten Borgwardt (4).
1. NICTA and University of Sydney, Australia
2. SML, National ICT Australia
3. MPI for Biological Cybernetics, Germany
4. Karsten Borgwardt Ludwig-Maximilians-University, Germany

We introduce a framework of feature filtering for supervised learning. It employs the Hilbert-Schmidt Independence Criterion (HSIC) as a measure of dependence between data and labels. The key idea is that good features should maximize such dependence. Feature selection for various supervised learning problems (including binary, multiclass and regression problems) can be unified under this framework, and the solution is approximated using a backward-elimination algorithm. Particularly, for binary problems, HSIC is also related to criteria such as Pearson’s correlation, signal-to-noise ratio, Maximum Mean Discrepancy and the Kernel- Target Alignment. We conducted experiments on various real world data, which demonstrate the usefulness of this framework.
D From Perturbation Data to a Causal Functional Pathway Representation. (short paper)
Nir Yosef (1), Alon Kaufman (2) and Eytan Ruppin (3)
1. School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
2. Center of Neural Computation, Hebrew University, Jerusalem, Israel
3. School of Medicine, Tel-Aviv University, Tel-Aviv, Israel
E Information Flows in Causal Networks (pdf)
Nihat Ay (1,2) and Daniel Polani (3)
1. Max Planck Institute for Mathematics in the Sciences Leipzig, Germany
2. Santa Fe Institute Santa Fe, USA
3. University of Hertfordshire, Hatfield, UK

We introduce a notion of causal independence based on virtual interventions, a fundamental concept of the theory of causal networks. Causal independence allows one to define a measure for the strength of a causal effect. We call this measure information flow. A theoretical result, examples and comparisons with conventional mutual information measures are given.
F Causal Discovery Algorithms based on Y Structures (ppt slides,short paper)
Subramani Mani (1) and Gregory F. Cooper (2)
1. Vanderbilt University, USA
2. University of Pittsburgh, USA
G Principled selection of impure measures for consistent learning of linear latent variable models (pdf)
Ricardo Silva, Gatsby Computational Neuroscience Unit, University College London.

In previous work, we have developed a principled way of learning the causal structure of linear latent variable models (Silva et al., 2006). However, we have considered the case for models with pure measures only. Pure measures are observed variables that measure no more than one latent variable. This paper presents theoretical extensions that justify the selection of some types of impure measures, allowing us to discover hidden variables that could not be identified in the pre- vious case.
H Experimental Learning of Causal Models with Latent Variables (extended abstracts)
Sam Maes (1), Stijn Meganck (2) and Philippe Leray (1)
1. INSA Rouen, France
2. Vrije Universiteit, Brussel, Belgium

This article discusses graphical models that can handle latent variables without explicitly modeling them quantitatively. There exist several paradigms for such problem domains. Two of them are semi-Markovian causal models and maxi- mal ancestral graphs. Applying these techniques to a problem domain consists of several steps, typically: structure learning from observational and experimental data, parameter learning, probabilistic inference, and, quantitative causal inference. A problem is that research in each of the existing approaches only focuses on one or a few of all the steps involved in the process of modeling a problem including latent variables. In other work we have investigated the integral process from observational and experimental data unto different types of efficient inference. The goal of this article is to focus on learning the structure of causal models in the presence of latent variables from a combination of observational and experimental data.
I Evaluation of Local Causal Discovery Algorithm using Simulated Gene Network of Malignant Mesotheliomas in Mice (short paper)
Changwon Yoo, Erik M. Brilz, Mark Pershouse and Elizabeth Putnam, University of Montana Missoula, USA

Recent advances in cancer research have provided new insights in the molecular mechanisms underlying the transition of tumor progression. Thus, it is not surprising that the development of methods to discover causal gene regulation pathways from high throughput data, such as DNA microarray, is becoming a more important problem in cancer research. To this end, it is desirable to compare experiments of the system under complete interventions of some genes, e.g., knock-out of some genes, with experiments of the system under no interventions. However, it is expensive and sometimes difficult (if not impossible) to conduct wet lab experiments of complete interventions of genes in animal models, e.g., a mouse model. Thus, it will be helpful if we can discover promising causal relationships among genes with observational data alone in order to identify promising genes to perturb in the system that can later be verified in wet laboratories. This paper describes method and evaluation of a causal analysis algorithm, the Equivalence Local Implicit latent variable scoring Method (EquLIM), on data generated from a gene network simulator that implements the process of malignant mesotheliomas in mice. We first apply EquLIM to a small amount of simulated data with no interventions (&le 100 cases) and compare the results to previous implementation version of EquLIM. The implementation that is described in this paper showed better prediction results in terms of positive prediction value and area under receiver operating characteristics.
J Local Factor Analysis with Automatic Model Selection and Data Smoothing Based Regularization (short paper)
Lei Shi and Lei Xu, Chinese University of Hong Kong Shatin, Hong Kong

Local factor analysis (LFA) is regarded as an efficient approach that implements local feature extraction and dimensionality reduction. A further investigation is made on an automatic BYY harmony data smoothing LFA (LFA-HDS) from the Bayesian Ying-Yang (BYY) harmony learning point of view. On the level of regularization, an data smoothing based regularization technique is adapted into this automatic LFA-HDS learning for problems with small sample sizes, while on the level of model selection, the proposed automatic LFA-HDS algorithm makes parameter learning with automatic determination of both the component number and the factor number in each component. A comparative study has been conducted on simulated data sets and several real problem data sets. The algorithm has been compared with not only a recent approach called Incremental Mixture of Factor Analysers (IMoFA) but also the conventional two-stage implementation of maximum likelihood (ML) plus model selection, namely, using the EM algorithm for parameter learning on a series candidate models, and selecting one best candidate by AIC, CAIC, BIC, and cross-validation (CV). Experiments have shown that IMoFA and ML-BIC, ML-CV outperform ML-AIC or ML-CAIC. Interestingly, the data smoothing BYY harmony learning obtains comparably desired results compared to IMoFA and ML-BIC but with much less computational cost.
K Applications of causality to risk management, information quality and retail
Andre Elisseeff, Ulf Holm Nielsen and Jean-Philippe Pellet, IBM Zurich Research Lab, Switzerland

In this presentation, we will show three examples where causality adds new functionalities that cannot be implemented by predictive modeling or classical data mining/machine learning approaches. We will first go through the use of causality to assess the risk and compliance level of a pharmaceutical manufacturing site. We will show why a causal analysis is necessary and discuss the methodology of building a causal model from scratch using expert opinion only. We will then describe the need for causality when linking the data collection process with the level of data quality. The latter is usually referred as a the quality of the syntax of the data (i.e. formatting, no missing values, no misspells). Causality allows to go a bit further by identifying potential flaws in the data collection process. At last, we will mention the benefits of applying the concept of causality when analyzing retail or marketing data with an emphasis on time series.