In mulitple testing experiments, the p values from a regular t-test will introduce a lot of false positive. In R, the function p.adjust provides serveral ways to adjust the p-values.
However, the speed of Benjamini–Hochberg procedure in p.adjust is not optimized. In this case, I rewrote it in Rcpp, which is basically c++. The algorithm is explained here.
The cpp function of FDR control:
library(Rcpp)
cppFunction(' NumericVector FDRcontrol(NumericVector p, double alpha){
// the input should be ascending order p value //
int size = p.size();
double threshold1, threshold2;
NumericVector passed(size);
int i = 0;
while ( i < size){
threshold1 = alpha * (i+1) / size;
threshold2 = alpha * (i+2) / size;
if (threshold1 > p[i]){
if (threshold2 < p[i+1]){
for (int j = 0; j <= i; j++){
passed[j] = 1;
}
}
}
i++;
}
return(passed);
}')
Now, start benchmarking,
library(dplyr)
library(rbenchmark)
library(data.table)
set.seed(100) #Again, reporoducible
df <- data.table(p = rnorm(100000,-0.5:0.5)) %>%
filter(p < 1 & p > -1) %>%
mutate(p = abs(p)) %>%
tbl_df
df
## Source: local data frame [62,512 x 1]
##
## p
## 1 0.6315312
## 2 0.5789171
## 3 0.3830287
## 4 0.8186301
## 5 0.1401379
## 6 0.4101139
## 7 0.5962745
## 8 0.7016340
## 9 0.3766205
## 10 0.4706833
## .. ...
# The goal is to put a 1 if FDR < alpha
# here set FDR threshold as 0.5, which is half the time is random
benchmark(df %>% arrange(p) %>% mutate(fdr = FDRcontrol(p,0.5)),
df %>% mutate(padj = p.adjust(p,method='BH'),FDR=ifelse(padj<0.5,1,0)))
## test
## 1 df %>% arrange(p) %>% mutate(fdr = FDRcontrol(p, 0.5))
## 2 df %>% mutate(padj = p.adjust(p, method = "BH"), FDR = ifelse(padj < 0.5, 1, 0))
## replications elapsed relative user.self sys.self user.child sys.child
## 1 100 3.272 1.000 3.267 0.004 0 0
## 2 100 7.304 2.232 7.159 0.142 0 0
df1 = df %>%
arrange(p) %>%
mutate(fdr = FDRcontrol(p,0.5)) %>%
mutate(padj = p.adjust(p,method='BH'),FDR=ifelse(padj<0.5,1,0))
df1
## Source: local data frame [62,512 x 4]
##
## p fdr padj FDR
## 1 6.717138e-06 1 0.4199017 1
## 2 2.243902e-05 0 0.5541057 0
## 3 2.659197e-05 0 0.5541057 0
## 4 3.975755e-05 0 0.5898058 0
## 5 4.863623e-05 0 0.5898058 0
## 6 7.776861e-05 0 0.5898058 0
## 7 7.790700e-05 0 0.5898058 0
## 8 8.057130e-05 0 0.5898058 0
## 9 8.491573e-05 0 0.5898058 0
## 10 1.065339e-04 0 0.6312590 0
## .. ... ... ... ...
Although the Rcpp function is faster, it does not have the flexibility to choose a thershold for FDR after adjusting the p values.
And also, Rcpp is a good way to increase the speed of R.