Mark J. van der Laan Copyright (c) 2008 All rights reserved. http://works.bepress.com/mark_van_der_laan Recent documents in Mark J. van der Laan en-us Mon, 07 Jan 2008 13:53:08 PST 3600 Why Prefer Double Robust Estimates? Illustration with Causal Point Treatment Studies http://works.bepress.com/mark_van_der_laan/181 http://works.bepress.com/mark_van_der_laan/181 Thu, 16 Nov 2006 02:39:37 PST In point treatment marginal structural models with treatment A, outcome Y and covariates W, causal parameters can be estimated under the assumption of no unobserved confounders. Three estimates can be used: the G-computation, Inverse Probability of Treatment Weighted (IPTW) or Double Robust (DR) estimates. The properties of the IPTW and DR estimates are known under an assumption on the treatment mechanism that we name "Experimental Treatment Assignment" (ETA) assumption. We show that the DR estimating function is unbiased when the ETA assumption is violated if the model used to regress Y on A and W is correctly specified. The practical consequence is that the IPTW estimate is biased at finite sample size when the ETA assumption is approximately or theoretically violated, whereas the finite sample bias for the DR estimate is negligible if the model used to regress Y on A and W is correctly specified. This result also implies that estimating point treatment causal parameters using a DR estimating function is more robust than using the G-computation formula. We conclude with a methodology to construct DR estimates for a general data structure and prove that such DR estimates are more robust than their corresponding maximum likelihood estimates. Romain Neugebauer Statistical Models Statistical Theory and Methods Survival Analysis Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples http://works.bepress.com/mark_van_der_laan/180 http://works.bepress.com/mark_van_der_laan/180 Thu, 16 Nov 2006 02:39:35 PST In Part I of this article we propose a general cross-validation criterian for selecting among a collection of estimators of a particular parameter of interest based on n i.i.d. observations. It is assumed that the parameter of interest minimizes the expectation (w.r.t. to the distribution of the observed data structure) of a particular loss function of a candidate parameter value and the observed data structure, possibly indexed by a nuisance parameter. The proposed cross-validation criterian is defined as the empirical mean over the validation sample of the loss function at the parameter estimate based on the training sample, averaged over random splits of the observed sample. The cross-validation selector is now the estimator which minimizes this cross-validation criterion. We illustrate that this general methodology covers, in particular, the selection problems in the current literature, but results in a wide range of new selection methods. We prove a finite sample oracle inequality, and asymptotic optimality of the cross-validated selector under general conditions. The asymptotic optimality states that the cross-validation selector performs asymptotically exactly as well as the selector which for each given data set makes the best choice (knowing the true data generating distribution).Our general framework allows, in particular, the situation in which the observed data structure is a censored version of the full data structure of interest, and where the parameter of interest is a parameter of the full data structure distribution. As examples of the parameter of the full data distribution we consider a density of (a part of) the full data structure, a conditional expectation of an outcome, given explanatory variables, a marginal survival function of a failure time, and multivariate conditional expectation of an outcome vector, given covariates. In part II of this article we show that the general estimating function methodology for censored data structures as provided in van der Laan, Robins (2002) yields the wished loss functions for the selection among estimators of a full-data distribution parameter of interest based on censored data. The corresponding cross-validation selector generalizes any of the existing selection methods in regression and density estimation (including model selection) to the censored data case. Under general conditions, our optimality results now show that the corresponing cross-validation selector performs asymptotically exactly as well as the selector which for each given data set makes the best choice (knowing the true full data distribution).In Part III of this article we propose a general estimator which is defined as follows. For a collection of subspaces and the complete parameter space, one defines an epsilon-net (i.e., a finite set of points whose epsilon-spheres cover the complete parameter space). For each epsilon and subspace one defines now a corresponding minimum cross-valided empirical risk estimator as the minimizer of cross-validated risk over the subspace-specific epsilon-net. In the special case that the loss function has no nuisance parameter, which thus covers the classical regression and density estimation cases, this epsilon and subspace specific minimum risk estimator reduces to the minimizer of the empirical risk over the corresponding epsilon-net. Finally, one selects epsilon and the subspace with the cross-validation selector. We refer to the resulting estimator as the cross-validated adaptive epsilon-net estimator. We prove an oracle inequality for this estimator which implies that the estimator minimax adaptive in the sense that it achieves the minimax optimal rate of convergence for the smallest of the guessed subspaces containing the true parameter value. Mark J. van der Laan Computation Statistical Models Statistical Theory and Methods Survival Analysis Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples http://works.bepress.com/mark_van_der_laan/179 http://works.bepress.com/mark_van_der_laan/179 Thu, 16 Nov 2006 02:39:33 PST In Part I of this article we propose a general cross-validation criterian for selecting among a collection of estimators of a particular parameter of interest based on n i.i.d. observations. It is assumed that the parameter of interest minimizes the expectation (w.r.t. to the distribution of the observed data structure) of a particular loss function of a candidate parameter value and the observed data structure, possibly indexed by a nuisance parameter. The proposed cross-validation criterian is defined as the empirical mean over the validation sample of the loss function at the parameter estimate based on the training sample, averaged over random splits of the observed sample. The cross-validation selector is now the estimator which minimizes this cross-validation criterion. We illustrate that this general methodology covers, in particular, the selection problems in the current literature, but results in a wide range of new selection methods. We prove a finite sample oracle inequality, and asymptotic optimality of the cross-validated selector under general conditions. The asymptotic optimality states that the cross-validation selector performs asymptotically exactly as well as the selector which for each given data set makes the best choice (knowing the true data generating distribution).Our general framework allows, in particular, the situation in which the observed data structure is a censored version of the full data structure of interest, and where the parameter of interest is a parameter of the full data structure distribution. As examples of the parameter of the full data distribution we consider a density of (a part of) the full data structure, a conditional expectation of an outcome, given explanatory variables, a marginal survival function of a failure time, and multivariate conditional expectation of an outcome vector, given covariates. In part II of this article we show that the general estimating function methodology for censored data structures as provided in van der Laan, Robins (2002) yields the wished loss functions for the selection among estimators of a full-data distribution parameter of interest based on censored data. The corresponding cross-validation selector generalizes any of the existing selection methods in regression and density estimation (including model selection) to the censored data case. Under general conditions, our optimality results now show that the corresponing cross-validation selector performs asymptotically exactly as well as the selector which for each given data set makes the best choice (knowing the true full data distribution).In Part III of this article we propose a general estimator which is defined as follows. For a collection of subspaces and the complete parameter space, one defines an epsilon-net (i.e., a finite set of points whose epsilon-spheres cover the complete parameter space). For each epsilon and subspace one defines now a corresponding minimum cross-valided empirical risk estimator as the minimizer of cross-validated risk over the subspace-specific epsilon-net. In the special case that the loss function has no nuisance parameter, which thus covers the classical regression and density estimation cases, this epsilon and subspace specific minimum risk estimator reduces to the minimizer of the empirical risk over the corresponding epsilon-net. Finally, one selects epsilon and the subspace with the cross-validation selector. We refer to the resulting estimator as the cross-validated adaptive epsilon-net estimator. We prove an oracle inequality for this estimator which implies that the estimator minimax adaptive in the sense that it achieves the minimax optimal rate of convergence for the smallest of the guessed subspaces containing the true parameter value. Mark J. van der Laan Loss-Based Estimation with Cross-Validation Tree-based Multivariate Regression and Density Estimation with Right-Censored Data http://works.bepress.com/mark_van_der_laan/178 http://works.bepress.com/mark_van_der_laan/178 Thu, 16 Nov 2006 02:39:31 PST We propose a unified strategy for estimator construction, selection, and performance assessment in the presence of censoring. This approach is entirely driven by the choice of a loss function for the full (uncensored) data structure and can be stated in terms of the following three main steps. (1) Define the parameter of interest as the minimizer of the expected loss, or risk, for a full data loss function chosen to represent the desired measure of performance. Map the full data loss function into an observed (censored) data loss function having the same expected value and leading to an efficient estimator of this risk. (2) Construct candidate estimators based on the loss function for the observed data. (3) Apply cross-validation to estimate risk based on the observed data loss function and to select an optimal estimator among the candidates. A number of common estimation procedures follow this approach in the full data situation, but depart from it when faced with the obstacle of evaluating the loss function for censored observations. Here, we argue that one can, and should, also adhere to this estimation road map in censored data situations.Tree-based methods, where the candidate estimators in Step 2 are generated by recursive binary partitioning of a suitably defined covariate space, provide a striking example of the chasm between estimation procedures for full data and censored data (e.g., regression trees as in CART for uncensored data and adaptations to censored data). Common approaches for regression trees bypass the risk estimation problem for censored outcomes by altering the node splitting and tree pruning criteria in manners that are specific to right-censored data. This article describes an application of our unified methodology to tree-based estimation with censored data. The approach encompasses univariate prediction, multivariate prediction, and density estimation, simply by defining a suitable loss function for each of these problems. The proposed method for tree-based estimation with censoring is evaluated using simulation studies and CGH copy number and survival data from breast cancer patients. Annette M. Molinaro Loss-Based Estimation with Cross-Validation Tree-based Multivariate Regression and Density Estimation with Right-Censored Data http://works.bepress.com/mark_van_der_laan/177 http://works.bepress.com/mark_van_der_laan/177 Thu, 16 Nov 2006 02:39:29 PST We propose a unified strategy for estimator construction, selection, and performance assessment in the presence of censoring. This approach is entirely driven by the choice of a loss function for the full (uncensored) data structure and can be stated in terms of the following three main steps. (1) Define the parameter of interest as the minimizer of the expected loss, or risk, for a full data loss function chosen to represent the desired measure of performance. Map the full data loss function into an observed (censored) data loss function having the same expected value and leading to an efficient estimator of this risk. (2) Construct candidate estimators based on the loss function for the observed data. (3) Apply cross-validation to estimate risk based on the observed data loss function and to select an optimal estimator among the candidates. A number of common estimation procedures follow this approach in the full data situation, but depart from it when faced with the obstacle of evaluating the loss function for censored observations. Here, we argue that one can, and should, also adhere to this estimation road map in censored data situations.Tree-based methods, where the candidate estimators in Step 2 are generated by recursive binary partitioning of a suitably defined covariate space, provide a striking example of the chasm between estimation procedures for full data and censored data (e.g., regression trees as in CART for uncensored data and adaptations to censored data). Common approaches for regression trees bypass the risk estimation problem for censored outcomes by altering the node splitting and tree pruning criteria in manners that are specific to right-censored data. This article describes an application of our unified methodology to tree-based estimation with censored data. The approach encompasses univariate prediction, multivariate prediction, and density estimation, simply by defining a suitable loss function for each of these problems. The proposed method for tree-based estimation with censoring is evaluated using simulation studies and CGH copy number and survival data from breast cancer patients. Annette M. Molinaro Human Genetics Multivariate Analysis Statistical Models Statistical Theory and Methods Survival Analysis The Two-Interval Line-Segment Problem http://works.bepress.com/mark_van_der_laan/176 http://works.bepress.com/mark_van_der_laan/176 Thu, 16 Nov 2006 02:39:27 PST In this paper, the NPMLE in the one-dimensional line segment problem is defined and studied, where line segments on the real line through two non-overlapping intervals are observed. The self-consistency equations for the NPMLE are defined and a quick algorithm for solving them is provided. Supnorm weak convergence to a Gaussian process and efficiency of the NPMLE is proved. The problem has a strong geological application in the study of the lifespan of species. Mark J. van der Laan The NPMLE in the Uniform Doubly Censored Current Status Data Model http://works.bepress.com/mark_van_der_laan/175 http://works.bepress.com/mark_van_der_laan/175 Thu, 16 Nov 2006 02:39:26 PST In biostatistical applications interest often focuses on the estimation of the distribution of time T between two consecutive events. If the initial event time is observed and the subsequent event time is only known to be larger or smaller than an observed point in time, then the data is described by the well understood singly censored current status model, also known as interval censored data, case I. Jewell, Malani and Vittinghoff (1994) extended this current status model by allowing the initial time to be unobserved, but with its distribution over an observed interval [A,B] known to be uniformly distributed; the data is referred to as doubly censored current status data. These authors used this model to handle applications in AIDS partner studies focusing on the nonparametirc maximum likelihood estimate (NPMLE) of the distribution function, G, of T. The model is a submodel of the current status model, but G is essentially the derivative of the distribution function of interest, F, in the current status model. In this paper we establish that the NPMLE of G is uniformly consistent and that the resulting estimators for square root n estimable parameters are efficient. We propose an iterative weighted Pool-Adjacent-Violator-Algorithm to compute the NPMLE of G. The rate of convergence of the NPMLE of F is also established. Mark J. van der Laan Survival Analysis The NPMLE in the Uniform Doubly Censored Current Status Data Model http://works.bepress.com/mark_van_der_laan/174 http://works.bepress.com/mark_van_der_laan/174 Thu, 16 Nov 2006 02:39:24 PST In biostatistical applications interest often focuses on the estimation of the distribution of time T between two consecutive events. If the initial event time is observed and the subsequent event time is only known to be larger or smaller than an observed point in time, then the data is described by the well understood singly censored current status model, also known as interval censored data, case I. Jewell, Malani and Vittinghoff (1994) extended this current status model by allowing the initial time to be unobserved, but with its distribution over an observed interval [A,B] known to be uniformly distributed; the data is referred to as doubly censored current status data. These authors used this model to handle applications in AIDS partner studies focusing on the nonparametirc maximum likelihood estimate (NPMLE) of the distribution function, G, of T. The model is a submodel of the current status model, but G is essentially the derivative of the distribution function of interest, F, in the current status model. In this paper we establish that the NPMLE of G is uniformly consistent and that the resulting estimators for square root n estimable parameters are efficient. We propose an iterative weighted Pool-Adjacent-Violator-Algorithm to compute the NPMLE of G. The rate of convergence of the NPMLE of F is also established. Mark J. van der Laan Statistical Theory and Methods The Nonparametric Maximum Likelihood Estimator in a Class of Doubly Censored Current Status Data Models with Application to Partner Studies http://works.bepress.com/mark_van_der_laan/173 http://works.bepress.com/mark_van_der_laan/173 Thu, 16 Nov 2006 02:39:22 PST The California Partners' Study is an ongoing investigation of heterosexual HIV transmission in partners of infected index cases (Padian, et al., 1987; Shiboski & Jewell, 1990). In addition to the HIV-status of the partner at the recruiting time one also observes the initial time of the partnership and a lower bound for the infection time of the index case. Following Jewell, Malani & Vittinghoff (1994) we assume that the infection time of the index case is uniformly distributed over the interval determined by the lower bound and the recruiting time, but no further assumptions are made. We consider an NPMLE of the distribution of the time T the partner is exposed to an infected index partner until infection. We show that the model is a doubly censored current status data model as introduced in Jewell, Malani & Vittinghoff (1994) with a special known distribution of the origin. We provide a modified iterative Weighted Pool Adjacent Violator Algorithm for computation of the NPMLE. It is shown that the NPMLE converges. In addition, we propose confidence intervals for smooth functionals of the distribution of T. Simulations show good performance of the algorithm, confidence intervals and provide a practical comparison of this NPMLE with the NPMLE if all partnerships are already in existence at the infection time of the index case as used in Shiboski & Jewell (1990). We apply our methodology to the California Partners' Study. We discuss the implications of our results for doubly censored current status data models with other known distributions of the origin. Mark J. van der Laan Statistical Theory and Methods The Cross-Validated Adaptive Epsilon-Net Estimator http://works.bepress.com/mark_van_der_laan/172 http://works.bepress.com/mark_van_der_laan/172 Thu, 16 Nov 2006 02:39:19 PST Suppose that we observe a sample of independent and identically distributed realizations of a random variable. Assume that the parameter of interest can be defined as the minimizer, over a suitably defined parameter space, of the expectation (with respect to the distribution of the random variable) of a particular (loss) function of a candidate parameter value and the random variable. Examples of commonly used loss functions are the squared error loss function in regression and the negative log-density loss function in density estimation. Minimizing the empirical risk (i.e., the empirical mean of the loss function) over the entire parameter space typically results in ill-defined or too variable estimators of the parameter of interest (i.e., the risk minimizer for the true data generating distribution). In this article, we propose a cross-validated epsilon-net estimation methodology that covers a broad class of estimation problems, including multivariate outcome prediction and multivariate density estimation. An epsilon-net sieve of a subspace of the parameter space is defined as a collection of finite sets of points, the epsilon-nets indexed by epsilon, which approximate the subspace up till a resolution of epsilon. Given a collection of subspaces of the parameter space, one constructs an epsilon-net sieve for each of the subspaces. For each choice of subspace and each value of the resolution epsilon, one defines a candidate estimator as the minimizer of the empirical risk over the corresponding epsilon-net. The cross-validated epsilon-net estimator is then defined as the candidate estimator corresponding to the choice of subspace and epsilon-value minimizing the cross-validated empirical risk. We derive a finite sample inequality which proves that the proposed estimator achieves the adaptive optimal minimax rate of convergence, where the adaptivity is achieved by considering epsilon-net sieves for various subspaces. We also address the implementation of the cross-validated epsilon-net estimation procedure. In the context of a linear regression model, we present results of a preliminary simulation study comparing the cross-validated epsilon-net estimator to the cross-validated L^1-penalized least squares estimator (LASSO) and the least angle regression estimator (LARS). Finally, we discuss generalizations of the proposed estimation methodology to censored data structures. Mark J. van der Laan Statistical Theory and Methods Survival Analysis