Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
examples - we need models for selecting a small subset of useful features from high-dimensional data, where the useful features are both rare and weak, this being crucial for e.g. supervised classfication of sparse high- dimensional data. A preceding step is to detect the presence of useful features, signal detection. This problem is related to testing a very large number of hypotheses, where the proportion of false null hypotheses is assumed to be very small. However, reliable signal detection will only be possible in certain areas of the two-dimensional sparsity-strength parameter space, the phase space.
In this report, we focus on two families of distributions, N and χ2. In the former case, features are supposed to be independent and normally distributed. In the latter, in search for a more sophisticated model, we suppose that features depend in blocks, whose empirical separation strength asymptotically follows the non-central χ2ν-distribution.
Our search for informative features explores Tukey's higher criticism (HC), which is a second-level significance testing procedure, for comparing the fraction of observed signi cances to the expected fraction under the global null.
Throughout the phase space we investgate the estimated error rate,
Err = (#Falsely rejected H0+ #Falsely rejected H1)/#Simulations,
where H0: absence of informative signals, and H1: presence of informative signals, in both the N-case and the χ2ν-case, for ν= 2; 10; 30. In particular, we find, using a feature vector of the approximately same size as in genomic applications, that the analytically derived detection boundary is too optimistic in the sense that close to it, signal detection is still failing, and we need to move far from the boundary into the success region to ensure reliable detection. We demonstrate that Err grows fast and irregularly as we approach the detection boundary from the success region.
In the χ2ν-case, ν > 2, no analytical detection boundary has been derived, but we show that the empirical success region there is smaller than in the N-case, especially as ν increases.
2012. , 28 p.