Sandrine Dudoit Copyright (c) 2008 All rights reserved. http://works.bepress.com/sandrine_dudoit Recent documents in Sandrine Dudoit en-us Mon, 17 Nov 2008 07:01:53 PST 3600 Loss-based estimation with evolutionary algorithms and cross-validation http://works.bepress.com/sandrine_dudoit/51 http://works.bepress.com/sandrine_dudoit/51 Mon, 26 Nov 2007 12:23:40 PST Many statistical inference methods rely upon selection procedures to estimate a parameter of the joint distribution of explanatory and outcome data, such as the regression function. Within the general framework for loss-based estimation of Dudoit and van der Laan, this project proposes an evolutionary algorithm (EA) as a procedure for risk optimization. We also analyze the size of the parameter space for polynomial regression under an interaction constraints along with constraints on either the polynomial or variable degree. David Shilane Loss-Based Estimation with Cross-Validation A deletion/substitution/addition algorithm for classification neural networks, with applications to biomedical data http://works.bepress.com/sandrine_dudoit/50 http://works.bepress.com/sandrine_dudoit/50 Fri, 16 Nov 2007 22:33:25 PST Neural networks are a popular machine learning tool, particularly in applications such as protein structure prediction; however, overfitting can pose an obstacle to their effective use. Due to the large number of parameters in a typical neural network, one may obtain a network fit that perfectly predicts the learning data, yet fails to generalize to other data sets. One way of reducing the size of the parmeter space is to alter the network topology so that some edges are removed; however it is often not immediately apparent which edges should be eliminated. We propose a data-adaptive method of selecting an optimal network architecture using a deletion/substitution/addition algorithm. Results of this approach to classification are presented on simulated data and the breast cancer data of Wolberg and Mangasarian (1990). Blythe Durbin Loss-Based Estimation with Cross-Validation Resampling-based empirical Bayes multiple testing procedures for controlling generalized tail probability and expected value error rates: Focus on the false discovery rate and simulation stud http://works.bepress.com/sandrine_dudoit/49 http://works.bepress.com/sandrine_dudoit/49 Fri, 16 Nov 2007 22:17:44 PST This article proposes resampling-based empirical Bayes multiple testing procedures for controlling a broad class of Type I error rates, defined as generalized tail probability (gTP) error rates, $gTP(q,g) = \Pr(g(V_n,S_n) > q)$, and generalized expected value (gEV) error rates, $gEV(g) = \EV[g(V_n,S_n)]$, for arbitrary functions $g(V_n,S_n)$ of the numbers of false positives $V_n$ andtrue positives $S_n$. Of particular interest are error rates based on the proportion $g(V_n,S_n) = V_n/(V_n + S_n)$ of Type I errors among the rejected hypotheses, such as the false discovery rate (FDR), $FDR = \EV[V_n/(V_n+S_n)]$. The proposed procedures offer several advantages over existing methods. They provide Type I error control for general data generating distributions, with arbitrary dependence structures among variables. Gains in power are achieved by deriving rejection regions based on guessed sets of true null hypotheses and null test statistics randomly sampled from joint distributions that account for the dependence structure of the data. The Type I error and power properties of an FDR-controlling version of the resampling-based empirical Bayes approach are investigated and compared to those of widely-used FDR-controlling linear step-up procedures in a simulation study. The Type I error and power trade-off achieved by the empirical Bayes procedures under a variety of testing scenarios allows this approach to be competitive with or outperform the Storey and Tibshirani [2003] linear step-up procedure, as an alternative to the classical Benjamini and Hochberg [1995] procedure. Sandrine Dudoit Multiple Hypothesis Testing Prognosis of stage II colon cancer by non-neoplastic mucosa gene expression profiling http://works.bepress.com/sandrine_dudoit/48 http://works.bepress.com/sandrine_dudoit/48 Fri, 16 Nov 2007 22:11:08 PST We have assessed the possibility to build a prognosis predictor (PP), based on non-neoplastic mucosa microarray gene expression measures, for stage II colon cancer patients. Non-neoplastic colonic mucosa mRNA samples from 24 patients (10 with a metachronous metastasis, 14 with no recurrence) were profiled using the Affymetrix HGU133A GeneChip. Patients were repeatedly and randomly divided into 1000 training sets (TSs) of size 16 and validation sets (VS) of size 8. For each TS/VS split, a 70-gene PP, identified on the TS by selecting the 70 most differentially expressed genes and applying diagonal linear discriminant analysis, was used to predict the prognoses of VS patients. Mean prognosis prediction performances of the 70-gene PP were 81.8% for accuracy, 73.0% for sensitivity and 87.1% for specificity. Informative genes suggested branching signal-transduction pathways with possible extensive networks between individual pathways. They also included genes coding for proteins involved in immune surveillance. In conclusion, our study suggests that one can build an accurate PP for stage II colon cancer patients, based on non-neoplastic mucosa microarray gene expression measures. A. Barrier Microarray Data Analysis Oracle inequalities for multi-fold cross validation http://works.bepress.com/sandrine_dudoit/47 http://works.bepress.com/sandrine_dudoit/47 Fri, 16 Nov 2007 21:59:28 PST We consider choosing an estimator or model from a given class by cross validation consisting of holding a nonneglible fraction of the observations out as a test set. We derive bounds that show that the risk of the resulting procedure is (up to a constant) smaller than the risk of an oracle plus an error which typically grows logarithmically with the number of estimators in the class. We extend the results to penalized cross validation in order to control unbounded loss functions. Applications include regression with squared and absolute deviation loss and classification under Tsybakov's condition. Aad W. van der Vaart Loss-Based Estimation with Cross-Validation The cross-validated adaptive epsilon-net estimator http://works.bepress.com/sandrine_dudoit/46 http://works.bepress.com/sandrine_dudoit/46 Fri, 16 Nov 2007 21:55:38 PST Suppose that we observe a sample of independent and identically distributed realizations of a random variable, and a parameter of interest can be defined as the minimizer, over a suitably defined parameter set, of the expectation of a (loss) function of a candidate parameter value and the random variable. For example, squared error loss in regression or the negative log-density loss in density estimation. Minimizing the empirical risk (i.e., the empirical mean of the loss function) over the entire parameter set may result in ill-defined or too variable estimators of the parameter of interest. In this article, we propose a cross-validated ε-net estimation method, which uses a collection of submodels and a collection of ε-nets over each submodel. For each submodel s and each resolution level ε, the minimizer of the empirical risk over the corresponding ε-net is a candidate estimator. Next we select from these estimators (i.e. select the pair (s,ε)) by multi-fold cross-validation. We derive a finite sample inequality that shows that the resulting estimator is as good as an oracle estimator that uses the best submodel and resolution level for the unknown true parameter. We also address the implementation of the estimation procedure, and in the context of a linear regression model we present results of a preliminary simulation study comparing the cross-validated ε-net estimator to the cross-validated L1-penalized least squares estimator (LASSO) and the least angle regression estimator (LARS). Mark J. van der Laan Loss-Based Estimation with Cross-Validation Colon cancer prognosis prediction by gene expression profiling http://works.bepress.com/sandrine_dudoit/45 http://works.bepress.com/sandrine_dudoit/45 Sun, 26 Nov 2006 17:31:15 PST This study assessed the possibility to build a prognosis predictor, based on microarray gene expression measures, in stage II and III colon cancer patients. Tumour (T) and non-neoplastic mucosa (NM) mRNA samples from 18 patients (nine with a recurrence, nine with no recurrence) were profiled using the Affymetrix HGU133A GeneChip. The k-nearest neighbour method was used for prognosis prediction using T and NM gene expression measures. Six-fold cross-validation was applied to select the number of neighbours and the number of informative genes to include in the predictors. Based on this information, one T-based and one NM-based predictor were proposed and their accuracies were estimated by double cross-validation. In six-fold cross-validation, the lowest numbers of informative genes giving the lowest numbers of false predictions (two out of 18) were 30 and 70 with the T and NM gene expression measures, respectively. A 30-gene T-based predictor and a 70-gene NM-based predictor were then built, with estimated accuracies of 78 and 83%, respectively. This study suggests that one can build an accurate prognosis predictor for stage II and III colon cancer patients, based on gene expression measures, and one can use either tumour or non-neoplastic mucosa for this purpose. Alain Barrier Microarray Data Analysis Ischemic preconditioning modulates the expression of several genes, leading to the overproduction of IL-1Ra, iNOS, and Bcl-2 in a human model of liver ischemia-reperfusion http://works.bepress.com/sandrine_dudoit/44 http://works.bepress.com/sandrine_dudoit/44 Sun, 26 Nov 2006 17:22:46 PST Ischemia triggers an inflammatory response that precipitates cell death during reperfusion. Several studies have shown that tissues are protected by ischemic preconditioning (IP) consisting of 10 min of ischemia followed by 10 min of reperfusion just before ischemia. The molecular basis of this protective effect is poorly understood. We used cDNA arrays (20K) to compare global gene expression in liver biopsies from living human liver donors who underwent IP (n=7) or not (n=7) just before liver devascularization. Microarray data were analyzed using pairedt test with a type I error rate fixed at {alpha} = 2.5 106 (Bonferroni correction). We found that 60 genes were differentially expressed (36 over- and 24 underexpressed in preconditioning group). After IP, the most significantly overexpressed gene was IL-1Ra. This was confirmed by immunoblotting. Differentially expressed were genes involved in apoptosis (NOD2, ephrin-A1, and calpain) and in the carbohydrate metabolism. A significant increase in the amount of the anti-apoptotic protein Bcl-2 in preconditioned livers but no change in the cleavage of procaspase-3, -8, and -9 was observed. We also observed an increase in the amount in the inducible nitric oxide synthase. Therefore, the benefits of IP may be associated with the overproduction of IL-1Ra, Bcl-2, and NO countering the proinflammatory and proapoptotic effects generated during ischemia-reperfusion.--Barrier, A., Olaya, N., Chiappini, F., Roser, F., Scatton, O., Artus, C., Franc, B., Dudoit, S., Flahault, A., Debuire, B., Azoulay, D., Lemoine, A. Ischemic preconditioning modulates the expression of several genes, leading to the overproduction of IL-1Ra, iNOS, and Bcl-2 in a human model of liver ischemia-reperfusion. Alain Barrier Microarray Data Analysis Asymptotics of cross-validated risk estimation in estimator selection and performance assessment http://works.bepress.com/sandrine_dudoit/43 http://works.bepress.com/sandrine_dudoit/43 Sat, 25 Nov 2006 10:06:15 PST Risk estimation is an important statistical question for the purposes of selecting a good estimator (i.e., model selection) and assessing its performance (i.e., estimating generalization error). This article introduces a general framework for cross-validation and derives distributional properties of cross-validated risk estimators in the context of estimator selection and performance assessment. Arbitrary classes of estimators are considered, including density estimators and predictors for both continuous and polychotomous outcomes. Results are provided for general full data loss functions (e.g., absolute and squared error, indicator, negative log density). A broad definition of cross-validation is used in order to cover leave-one-out cross-validation, V-fold cross-validation, Monte Carlo cross-validation, and bootstrap procedures. For estimator selection, finite sample risk bounds are derived and applied to establish the asymptotic optimality of cross-validation, in the sense that a selector based on a cross-validated risk estimator performs asymptotically as well as an optimal oracle selector based on the risk under the true, unknown data generating distribution. The asymptotic results are derived under the assumption that the size of the validation sets converges to infinity and hence do not cover leave-one-out cross-validation. For performance assessment, cross-validated risk estimators are shown to be consistent and asymptotically linear for the risk under the true data generating distribution and confidence intervals are derived for this unknown risk. Unlike previously published results, the theorems derived in this and our related articles apply to general data generating distributions, loss functions (i.e., parameters), estimators, and cross-validation procedures. Sandrine Dudoit Loss-Based Estimation with Cross-Validation Gene expression profiling of nonneoplastic mucosa may predict clinical outcome of colon cancer patients http://works.bepress.com/sandrine_dudoit/42 http://works.bepress.com/sandrine_dudoit/42 Sat, 25 Nov 2006 10:00:13 PST PURPOSE This study assessed the possibility to build a prognosis predictor, based on microarray gene expression measures, in Stage II and III colon cancer patients. METHODS Tumor and nonneoplastic mucosa mRNA samples from 12 colon cancer patients were profiled using the Affymetrix HGU133A GeneChip. Six of 12 patients experienced a metachronous metastasis, whereas the 6 others remained disease-free for more than five years. Three datasets were constituted, including, respectively, the gene expression measures in tumor samples (T), in adjacent nonneoplastic mucosa samples (A), and the log-ratio of the gene expression measures (L). The step-down procedure of Westfall and Young and the k-nearest neighbor class prediction method were applied on T, A, and L. Leave-one-out cross-validation was used to estimate the generalization error of predictors based on different numbers of genes and neighbors. RESULTS The most frequent results were one false prediction with the A-based predictors (95 percent) and two false predictions with the T- and l-based predictors (65 and 60 percent, respectively). A-based predictors were more stable (i.e., less sensitive to changes of parameters, such as numbers of genes and neighbors) than T- and l-based predictors. Informative genes in A-based predictors included genes involved in the oxidative and phosphorylative mitochondrial metabolism and genes involved in cell-signaling pathways and their receptors. CONCLUSIONS This study suggests that one can build a prognosis predictor for Stage II and III colon cancer patients, based on microarray gene expression measures, and suggests the potential usefulness of nonneoplastic mucosa for this purpose. Alain Barrier Microarray Data Analysis