Workshop at the NIPS*2006 conference, Whistler, Canada, on December 8, 2006. 
Determining and exploiting causal relationships is central in human reasoning and decision making. Yet, most of machine learning does not attempt to uncover causal relationships in data, as this is unnecessary to make good predictions. For instance, in medical diagnosis, the abundance of a protein in serum may be used as a predictor of disease. It is not relevant to know whether the protein is a cause of the disease (resulting from a gene mutation), or a consequence (an antibody responding to inflammation). If one is interested in a diagnose, the abundance of protein is enough and means disease. So in the end, machine learning might not be much concerned with causality. This statement would be true if all machine learning problems were about minimizing prediction error only. However, more and more applications require the assessment of the results of given actions. Such assessment is essential in many domains, including epidemiology, medicine, ecology, economy, sociology and business.
Predictive models simply based on event correlations do not model mechanisms. They allow us to make predictions in a stationary environment (no change in the distribution of all the variables), but do not allow us to predict the consequence of given actions. For instance, smoking and coughing are both predictive of respiratory disease. One is a cause and the other a symptom. Acting on the cause can change the disease state, but not acting on the symptom. Thus it is extremely important to distinguish between causes and symptoms to predict the consequences of actions.
In the past few years, much effort in machine learning has been devoted to feature selection, the art of uncovering significant dependencies between the input variables and a desired outcome. Stateoftheart feature selection methods can select relevant features among millions of distracters, with less than a hundred examples. In contract, causal models usually deal with just a few variables and a quasiperfect knowledge of the variable distribution, which implies an abundance of training examples. Therefore, recent developments in feature selection and causality inference can crossfertilize to:
See more information on NIPS 2006
This workshop is especially targeted at researchers interested in understanding the outcome of uncontrolled or partially controlled experiments. Such experiments occur in many areas (medicine, biology, sociology, economy, finance and marketing, etc.) and are typical of studies where the subjects cannot be isolated from their environment.
Morning  
7.307.45am 
Welcome and introduction Andre Elisseeff, IBM Zurich Research Lab, Switzerland 
7.458.45am 
An Introduction to Causal Modeling and Discovery Using Graphical Models (ppt slides) Gregory F. Cooper, University of Pittsburgh, USA This tutorial will describe fundamental approaches for representing causal relationships with graphical models, particularly causal Bayesian networks. The tutorial also will describe several representative algorithms for discovering causal relationships from observational and/or experimental data. The strengths and limitations of the representation and algorithms will be discussed. 
8.459.00am  Poster highlights: D, F, A, E, I, G, H 
9.009.30am  Break (Poster and Coffee) 
9.309.55am 
Contrasting Feature Selection for Causal Inference with Feature
Selection for Classification Frederick Eberhardt (joint work with David Danks and Peter Spirtes), Carnegie Mellon University, USA We will contrast the properties that make a set of features useful for the purpose of classification with the properties that make a set of features useful for several different kinds of causal inference. We will also explain how these differences make feature selection for causal inference harder to scale up to large numbers of variables than feature selection for classification, and describe several approaches that have recently been taken to extend feature selection algorithms for causal inference to larger numbers of features. 
9.5510.20am 
Inferring causal directions by evaluating the complexity of
conditional distributions (slides) X. Sun, D. Janzing, B. Schoelkopf, Max Planck Institute, Tuebingen, Germany We propose a new approach to infer the causal structure that has generated the observed statistical dependences among n random variables. The idea is that the factorization of the joint measure of cause and effect into P(cause)P(effectcause) leads typically to simpler conditionals than noncausal factorizations. To evaluate the complexity of the conditionals we have tried two methods. First, we have compared them to those which maximize the conditional entropy subject to the observed first and second moments since we consider the latter as the simplest conditionals. Second, we have fitted the data with conditional probability measures being exponents of functions in an RKHS space and defined the complexity by a Hilbertspace seminorm. Such a complexity measure has several properties that are useful for our purpose. We describe some encouraging results with both methods applied to realworld data. Moreover, we have combined constraintbased approaches to causal discovery (i.e., methods using only information on conditional statistical dependences) with our method in order to distinguish between causal hypotheses which are equivalent with respect to the imposed independences. Furthermore, we compare the performance to Bayesian approaches to causal inference. 
10.2010.30am  Poster highlights: B, C, J 
10.30am  Adjourn 
Afternoon  
4.004.45pm 
Feature Selection and Causal Discovery (ppt slides) Isabelle Guyon, Clopinet, USA, André Elisseeff, IBM Research, Switzerland, and Constantin Aliferis, Vanderbilt University, USA This presentation introduces the problems posed by feature selection and learning causal dependencies. What is feature selection? Why is it hard? What works best in practice? How to make progress using causality? Can causal discovery benefit from feature selection? We argue that machine learning researchers should abandon the usual motto of predictive modeling: “we don’t care about causality”. Feature selection may benefit from introducing a notion of causality:

4.455.10pm 
Using SVM WeightBased Methods to Identify Causally Relevant and
NonCausally Relevant Variables (ppt slides) Alexander Statnikov, Douglas Hardin and Constantin Aliferis, Vanderbilt University, USA Variable selection is often used to derive insights in the causal structure of the datagenerating process. For example, in biology and medicine, biomarkers are sought to better understand the factors that cause disease, determine its progression, and identify the members of the relevant molecular pathways. We conducted a simulation experiment to study SVM weightbased ranking and variable selection methods using two network structures that are often encountered in biological systems and are expected to occur in many other settings as well. We attempted to recover both causally and noncausally relevant variables using SVM weightbased methods under a variety of experimental settings (datagenerating network, noise level, sample size, and SVM penalty parameter). Our experiments show that sometimes SVMs can produce excellent classifiers that assign higher weights to irrelevant variables than to the relevant ones. Likewise, the application of the recursive variable selection technique SVMRFE, does not remedy this problem. More importantly, we found that when it comes to identifying causally relevant variables, SVM weightbased methods can fail by assigning higher weight or selecting (in the context of SVMRFE) variables that are relevant but noncausally so. Furthermore, even irrelevant variables can have higher weights or can be selected more often than the causally relevant ones. These results are corroborated by a theoretical analysis as well as recent research employing highfidelity resimulation in biological and medical domains. Thus, the totality of empirical evidence so far suggests that causal interpretation of current stateoftheart SVM variable selection results must be conducted with great caution by practitioners. We show that this problem is not linked to the specific variable selection techniques studied but rather that the maximum margin inductive bias, as typically employed by SVMbased methods, is locally causally inconsistent. New SVM methods may be needed to address this issue and this is an exciting and challenging area of research. 
5.105.35pm 
Discovery of linear acyclic models in the presence
of latent classes using ICA mixtures (pdf, slides) Shohei Shimizu (1,2), and Aapo Hyvärinen (1) 1. Helsinki Institute for Information Technology , Finland 2. The Institute of Statistical Mathematics, Tokyo, Japan Causal discovery is the task of finding plausible causal relationships from statistical data. Such methods rely on various assumptions about the data generating process to identify it from uncontrolled observations. We have recently proposed a causal discovery method based on independent component analysis (ICA) called LiNGAM, showing how to completely identify the data generating process under the assumptions of linearity, nongaussianity, and no hidden variables. In this paper, after briefly recapitulating this approach, we extend the framework to cases where latent (hidden) classes are present. The model identification can be accomplished using ICA mixtures. Experiments confirm the performance of the proposed method. See the report and JMLR paper for more details. 
5.356.00pm  Break (Poster and Coffee) 
6.006.25pm 
LassoOrderSearch: Learning Directed Graphical Model Structure using
L1Penalized Regression and Order Search. (slides, short paper) Mark Schmidt and Kevin Murphy, University of British Columbia, Canada To speed up the search for the best DAG structure given data, we propose to use the regularization paths from L1 penalized regression to rapidly find the best set of parents given a node ordering. This reduces the complexity of evaluating a node ordering from O(Nd3dK) to O(Nd5) time (O(Nd4) in the linearGaussian case), where N is the number of data cases, d is the number of nodes, and K is the maximum number of parents (fanin). Not only is this approach much faster than previous approaches, but our linear (instead of exponential) dependence on the number of parents allows us to tractably fit much more complex models. We provide experimental comparisons with several other heuristic search techniques illustrating the effectiveness of this approach. 
6.257.00pm  Discussion 
7.00pm  Adjourn 
A 
Application and Comparative Evaluation of Causal and NonCausal
Feature Selection Algorithms for Biomarker Discovery in
HighThroughput Biomedical Datasets
C.F. Aliferis (1,2), A. Statnikov (1), I. Tsamardinos (1,4), E. Kokkotou (5) and P.P. Massion (1,3) 1. Vanderbilt University School of Medicine, USA 2. Vanderbilt Ingram Comprehensive Cancer Center, USA 3. University of Crete, Greece 4. Beth Israel Deaconess Medical Center and Harvard Medical School, USA 
B 
Bayesian structure learning using dynamic programming and MCMC (slides, short paper, more info) Daniel Eaton and Kevin Murphy, University of British Columbia, Canada. We show how to significantly speed up MCMC sampling of DAG structures by using a powerful nonlocal proposal based on Koivisto’s dynamic programming (DP) algorithm, which computes the exact marginal posterior edge probabilities by analytically summing over orders. Furthermore, we show how sampling in DAG space can avoid subtle biases that are introduced by approaches that work only with orders, such as Koivisto’s DP algorithm and MCMC order samplers. 
C 
Supervised Feature Selection via Dependence
Estimation (short paper, more info) Le Song (1), Alex Smola (2), Arthur Gretton (3) and Karsten Borgwardt (4). 1. NICTA and University of Sydney, Australia 2. SML, National ICT Australia 3. MPI for Biological Cybernetics, Germany 4. Karsten Borgwardt LudwigMaximiliansUniversity, Germany We introduce a framework of feature filtering for supervised learning. It employs the HilbertSchmidt Independence Criterion (HSIC) as a measure of dependence between data and labels. The key idea is that good features should maximize such dependence. Feature selection for various supervised learning problems (including binary, multiclass and regression problems) can be unified under this framework, and the solution is approximated using a backwardelimination algorithm. Particularly, for binary problems, HSIC is also related to criteria such as Pearson’s correlation, signaltonoise ratio, Maximum Mean Discrepancy and the Kernel Target Alignment. We conducted experiments on various real world data, which demonstrate the usefulness of this framework. 
D 
From Perturbation Data to a Causal Functional Pathway Representation. (short paper) Nir Yosef (1), Alon Kaufman (2) and Eytan Ruppin (3) 1. School of Computer Science, TelAviv University, TelAviv, Israel 2. Center of Neural Computation, Hebrew University, Jerusalem, Israel 3. School of Medicine, TelAviv University, TelAviv, Israel 
E  Information Flows in Causal Networks (pdf) Nihat Ay (1,2) and Daniel Polani (3) 1. Max Planck Institute for Mathematics in the Sciences Leipzig, Germany 2. Santa Fe Institute Santa Fe, USA 3. University of Hertfordshire, Hatfield, UK We introduce a notion of causal independence based on virtual interventions, a fundamental concept of the theory of causal networks. Causal independence allows one to define a measure for the strength of a causal effect. We call this measure information flow. A theoretical result, examples and comparisons with conventional mutual information measures are given. 
F 
Causal Discovery Algorithms based on Y Structures (ppt slides,short paper) Subramani Mani (1) and Gregory F. Cooper (2) 1. Vanderbilt University, USA 2. University of Pittsburgh, USA 
G 
Principled selection of impure measures for consistent learning of
linear latent variable models (pdf) Ricardo Silva, Gatsby Computational Neuroscience Unit, University College London. In previous work, we have developed a principled way of learning the causal structure of linear latent variable models (Silva et al., 2006). However, we have considered the case for models with pure measures only. Pure measures are observed variables that measure no more than one latent variable. This paper presents theoretical extensions that justify the selection of some types of impure measures, allowing us to discover hidden variables that could not be identified in the pre vious case. 
H 
Experimental Learning of Causal Models with
Latent Variables (extended abstracts) Sam Maes (1), Stijn Meganck (2) and Philippe Leray (1) 1. INSA Rouen, France 2. Vrije Universiteit, Brussel, Belgium This article discusses graphical models that can handle latent variables without explicitly modeling them quantitatively. There exist several paradigms for such problem domains. Two of them are semiMarkovian causal models and maxi mal ancestral graphs. Applying these techniques to a problem domain consists of several steps, typically: structure learning from observational and experimental data, parameter learning, probabilistic inference, and, quantitative causal inference. A problem is that research in each of the existing approaches only focuses on one or a few of all the steps involved in the process of modeling a problem including latent variables. In other work we have investigated the integral process from observational and experimental data unto different types of efficient inference. The goal of this article is to focus on learning the structure of causal models in the presence of latent variables from a combination of observational and experimental data. 
I 
Evaluation of Local Causal Discovery Algorithm using Simulated Gene
Network of Malignant Mesotheliomas in Mice (short paper)
Changwon Yoo, Erik M. Brilz, Mark Pershouse and Elizabeth Putnam, University of Montana Missoula, USA Recent advances in cancer research have provided new insights in the molecular mechanisms underlying the transition of tumor progression. Thus, it is not surprising that the development of methods to discover causal gene regulation pathways from high throughput data, such as DNA microarray, is becoming a more important problem in cancer research. To this end, it is desirable to compare experiments of the system under complete interventions of some genes, e.g., knockout of some genes, with experiments of the system under no interventions. However, it is expensive and sometimes difficult (if not impossible) to conduct wet lab experiments of complete interventions of genes in animal models, e.g., a mouse model. Thus, it will be helpful if we can discover promising causal relationships among genes with observational data alone in order to identify promising genes to perturb in the system that can later be verified in wet laboratories. This paper describes method and evaluation of a causal analysis algorithm, the Equivalence Local Implicit latent variable scoring Method (EquLIM), on data generated from a gene network simulator that implements the process of malignant mesotheliomas in mice. We first apply EquLIM to a small amount of simulated data with no interventions (&le 100 cases) and compare the results to previous implementation version of EquLIM. The implementation that is described in this paper showed better prediction results in terms of positive prediction value and area under receiver operating characteristics. 
J 
Local Factor Analysis with Automatic Model
Selection and Data Smoothing Based Regularization (short paper)
Lei Shi and Lei Xu, Chinese University of Hong Kong Shatin, Hong Kong Local factor analysis (LFA) is regarded as an efficient approach that implements local feature extraction and dimensionality reduction. A further investigation is made on an automatic BYY harmony data smoothing LFA (LFAHDS) from the Bayesian YingYang (BYY) harmony learning point of view. On the level of regularization, an data smoothing based regularization technique is adapted into this automatic LFAHDS learning for problems with small sample sizes, while on the level of model selection, the proposed automatic LFAHDS algorithm makes parameter learning with automatic determination of both the component number and the factor number in each component. A comparative study has been conducted on simulated data sets and several real problem data sets. The algorithm has been compared with not only a recent approach called Incremental Mixture of Factor Analysers (IMoFA) but also the conventional twostage implementation of maximum likelihood (ML) plus model selection, namely, using the EM algorithm for parameter learning on a series candidate models, and selecting one best candidate by AIC, CAIC, BIC, and crossvalidation (CV). Experiments have shown that IMoFA and MLBIC, MLCV outperform MLAIC or MLCAIC. Interestingly, the data smoothing BYY harmony learning obtains comparably desired results compared to IMoFA and MLBIC but with much less computational cost. 
K 
Applications of causality to risk management, information quality
and retail Andre Elisseeff, Ulf Holm Nielsen and JeanPhilippe Pellet, IBM Zurich Research Lab, Switzerland In this presentation, we will show three examples where causality adds new functionalities that cannot be implemented by predictive modeling or classical data mining/machine learning approaches. We will first go through the use of causality to assess the risk and compliance level of a pharmaceutical manufacturing site. We will show why a causal analysis is necessary and discuss the methodology of building a causal model from scratch using expert opinion only. We will then describe the need for causality when linking the data collection process with the level of data quality. The latter is usually referred as a the quality of the syntax of the data (i.e. formatting, no missing values, no misspells). Causality allows to go a bit further by identifying potential flaws in the data collection process. At last, we will mention the benefits of applying the concept of causality when analyzing retail or marketing data with an emphasis on time series. 