New Statistical Techniques
We have been developing three new, interrelated, statistical techniques. The first has been published as a Ph.D. thesis; we are preparing papers for journal submission, and a web interface, for it now. The second has been published, and we are completing work on a web interface for it now. The third is an offshoot of both, which we will be pursuing next.
TRK “Worst-Case” Regression
In Trotter’s 2011 Ph.D. thesis, we develop a new, very generally applicable statistic (Trotter, Reichart & Konz, in preparation, hereafter TRK) for fitting model distributions to data in two dimensions, where the data have both intrinsic uncertainties (i.e., error bars) and extrinsic scatter (i.e., sample variance) that is greater than can be attributed to the intrinsic uncertainties alone. A model distribution is described by the convolution of a (possibly asymmetric) 2D probability distribution that characterizes the extrinsic scatter, or “slop”, in the data in both the x and y dimensions with a model curve y_c(x; theta_m), where {theta_m} is a set of M model parameters that describe the shape of the curve. The task in fitting a model distribution to a data set is to find the values of the parameters {theta_m}, and of the parameters that characterize the slop, that maximize the joint probability, or likelihood function, of the model distribution and the data set. First, we derive a very general probabilistic formulation of the joint probability for the case where both the intrinsic uncertainties and the slop are normally distributed. We then demonstrate that there is a choice of rotated coordinate systems over which the joint probability integrals may be evaluated, and that different choices of this rotated coordinate system yield fundamentally different statistics with fundamentally different properties, and yield fundamentally different predictions for any given model parameterization and data set.
We show that one choice of rotated coordinates yields the traditional statistic advocated by D’Agostini (2005, hereafter D05), and another the statistic advocated by Reichart (2001, hereafter R01). The D05 statistic is non-invertible, i.e., it yields different model fits depending on whether one chooses to fit a model distribution to y vs. x or to x vs. y. The R01 statistic is invertible, but as we will demonstrate, suffers from a fatal flaw in that it does not reduce to chi^2 in the 1D limiting case of zero extrinsic scatter, and zero intrinsic uncertainty in either the x or y dimensions. We introduce a new statistic, TRK, that is both invertible and reduces to chi^2 in these 1D limiting cases. We demonstrate that the best-fit model distribution predicted by the TRK statistic, in the case of normally distributed intrinsic uncertainties and extrinsic scatter, is geometrically equivalent to the distribution that minimizes the sum of the squares of the radial distances of each data point centroid from the model curve, measured in units of the 1-sigma radius of the 2D convolved intrinsic and extrinsic Gaussian error ellipse.
However, unlike the D05 statistic, the TRK statistic is not scalable, i.e., it yields different best-fit model distributions depending on the choice of basis for each coordinate axis. We demonstrate that, in the limit of data that are entirely dominated by extrinsic scatter (i.e., zero intrinsic error bars), the predictions of TRK at extremes of scaling the x and y axes is equivalent to the predictions of D05 under an inversion of the x and y axes. However, D05 is limited to this binary choice of inversion or non-inversion, while TRK is free to explore a continuous range of scales between the two limiting cases. Moreover, when the data are dominated by intrinsic rather than extrinsic uncertainty, the range of scales that result in physically meaningful model distributions is significantly smaller, while the predictions of D05 are the same under inversion whether the data are dominated by intrinsic or extrinsic uncertainty. In the limit of data that are entirely dominated by intrinsic uncertainty (i.e., zero slop), this range of scales reduces to a single physically meaningful scale. We discuss these behaviors in the context of linear model distributions fit to various randomized data sets. We define a new correlation coefficient, R_{TRK}^2, that is analogous to the well-known Pearson correlation coefficient R_{xy}^2, and describe an algorithm for arriving at an “optimum scale” for a fit with TRK to a given data set, which yields linear fits that fall approximately midway between the inverted and non-inverted D05 fits. Lastly, we generalize this algorithm to non-linear fits, and to fits to data with asymmetric intrinsic or extrinsic uncertainties.
Robust Chauvenet Outlier Rejection
Introduction, from Maples et al. (2018)
Consider a sample of outlying and non-outlying measurements, where the non-outlying measurements are drawn from a given statistical distribution, due to a given physical process, and the outlying measurements are sample contaminants, drawn from a different statistical distribution, due to a different, or additional, physical process, or due to non-statistical errors in measurement. Furthermore, the statistical distribution from which the outlying measurements are drawn is often unknown. Whether (1) combining this sample of measurements into a single value, or (2) fitting a parameterized model to these data, outliers can result in incorrect inferences.
There are a great many methods for identifying and either down-weighting (see §2.1) or outright rejecting outliers. The most ubiquitous, particularly in astronomy, is sigma clipping. Here, measurements are identified as outlying and rejected if they are more than a certain number of standard deviations from the mean, assuming that the sample is otherwise distributed normally. Sigma clipping, for example, is a staple of aperture photometry, where it is used to reject signal above the noise (e.g., other sources, Airy rings, diffraction spikes, cosmic rays, and hot pixels), as well as overly negative deviations (e.g., cold pixels), when measuring the background level in a surrounding annulus.
Sigma clipping, however, is crude in a number of ways, the first being where to set the rejection threshold. For example, if working with ≈100 data points, 2-sigma deviations from the mean are expected but 4-sigma deviations are not, so one might choose to set the threshold between 2 and 4. However, if working with ≈10^4 points, 3-sigma deviations are expected but 5-sigma deviations are not, in which case a greater threshold should be applied.
Chauvenet rejection is one of the oldest, and also the most straightforward, improvement to sigma clipping, in that it quantifies this rejection threshold, and does so very simply (Chauvenet 1863). Chauvenet’s criterion for rejecting a measurement is NP(>|z|) < 0.5, (1) where N is the number of measurements in the sample, and P(>|z|) is the cumulative probability of the measurement being more than z standard deviations from the mean, assuming a Gaussian distribution. We apply Chauvenet’s criterion iteratively, rejecting only one measurement at a time for increased stability, but consider the case of (bulk) rejecting all measurements that meet Chauvenet’s criterion each iteration in §5. In either case, after each iteration (1) we lower N by the number of points that we rejected, and (2) we re-estimate the mean and standard deviation, which are used to compute each measurement’s z value, from the remaining, non-rejected measurements.
However, both traditional Chauvenet rejection, as well as its more-general, less-defined version, sigma clipping, suffer from neither the mean nor the standard deviation being “robust” quantities: Both are easily contaminated by the very outliers they are being used to reject. In §2, we consider increasingly robust (but decreasingly precise; see below) replacements for the mean and standard deviation, settling on three measures of central tendency (§2.1) and four measures of sample deviation (§2.2). We calibrate seven pairings of these, using uncontaminated data, in §2.3.
In §3, we evaluate these increasingly robust improvements to traditional Chauvenet rejection against different contaminant types: In §3.1, we consider the case of two-sided contaminants, meaning that outliers are as likely to be high as they are to be low; and in §3.2, we consider the (more challenging) case of one-sided contaminants, where all or almost all of the outliers are high (or low; we also consider in-between cases here). In §3.3, we consider the case of rejecting outliers from mildly non-Gaussian distributions.
In §3, we show that these increasingly robust improvements to traditional Chauvenet rejection do indeed result in increasingly accurate measurements, and they do so in the face of increasingly high contaminant fractions and contaminant strengths. But at the same time, these measurements are decreasingly precise. However, in §4, we show that one can make measurements that are both very accurate and very precise, by applying these techniques in sequence, with more-robust techniques applied before more-precise techniques.
In §5, we evaluate the effectiveness of bulk rejection, which can be significantly less demanding computationally. In §6, we consider the case of weighted data. In §7, we exercise both of these techniques with an astronomical example.
In §8, we show how RCR can be applied to model fitting, which first requires a generalization of this, traditionally, non-robust process. In §9, we compare RCR to Peirce rejection (Peirce 1852; Gould 1855), which is perhaps the next-most commonly used outlier-rejection technique (Ross 2003). Peirce rejection is a non-iterative alternative to traditional Chauvenet rejection, that can also be applied to model fitting, and that has a reputation of being superior to traditional Chauvenet rejection.
Summary, from Maples et al. (2018)
The most fundamental act in science is measurement. By combining multiple measurements, one can better constrain a quantity’s true value, and its uncertainty. However, measurements, and consequently samples of measurements, can be contaminated. Here, we have introduced, and thoroughly tested, an approach that, while not perfect, is very effective at identifying which measurements in a sample are contaminated, even if they constitute most of the sample, and especially if the contaminants are strong (making contaminated measurements easier to identify).
In particular, we have considered:
Both symmetrically (§3.1) and asymmetrically (§3.2) distributed contamination of both symmetrically (§3.1, §3.2, §3.3.2) and asymmetrically (§3.3.1) distributed uncontaminated measurements, and have developed robust outlier rejection techniques for all combinations of these cases.
The tradeoff between these techniques’ accuracy and precision, and have found that by applying them in sequence, from more robust to more precise, both can be achieved (§4).
The practical cases of bulk rejection (§5), weighted data (§6), and model fitting (§8), and have generalized the RCR algorithm accordingly.
Finally, we have developed a simple web interface so anyone can use the RCR algorithm. Users may upload a data set, and select from the above scenarios. They are returned their data set with outliers flagged, and with mu_1 and sigma_1 robustly measured. Source code is available here as well.
Parameter-Space Regression
In Maples et al. (2018), we introduced a new way of thinking about parametric regression: Given N weighted measurements, and an M parameter model, we calculated separate model solutions for all N!/[(N-M)!M!] combinations of measurements. Each model solution consisted of (1) a value for each of the model’s M parameters, and (2) a weight for each of these values, calculated from standard propagation of uncertainties. We then defined ways of taking (1) the weighted median and (2) the weighted mode of these parameter values, for each parameter, to get best-fit models that are significantly more robust (i.e., less sensitive to outliers) than what one gets by performing least-squares regression on the measurements.
This is a new way of thinking about regression – as something that takes place in a model’s parameter space, instead of in the measurement space. In fact, in Konz et al., in preparation, we show that least-squares regression, performed on measurements, is mathematically equivalent to taking the weighted mean of these parameter values, for each parameter. In this sense, least-squares regression is but one, limited, option, corresponding to the simplest thing one might try in parameter space.
In Maples et al. (2018), we have already presented two, more-robust examples of things one might try in parameter space. But many other possibilities exist. We briefly discuss what some of these possibilities might be, and in particular, how the options presented in Maples et al. (2018) could be improved upon in the future.
We also discuss how parameter uncertainties can be trivially measured in this new formalism, and how prior information can be incorporated just as easily.
Finally, in Maples et al. (2018), we considered only uncertainties in the dependent variable. Here, we show how uncertainties in both dependent and independent variables can be simultaneously, and again, trivially, incorporated in this new formalism. We also address how (1) asymmetric and (2) systematic uncertainties in each of these variables can be incorporated.