Liquid biopsy using Blood
Detection and localization of surgically resectable cancers with a multi-analyte blood test
Objectives
In a Medical test, we want to maximize sensitivity (op. of False negatives) and specificity (op. of False positives)
Procedure
We searched for the minimum number of short Amplicons that would allow us to detect at least one driver gene mutation in each of the eight tumor types evaluated. Once this plateau was reached, raising the number of amplicons would not increase sensitivity substantially but would decrease specificity (increase the probability of false-positive results). For instance that could happen if some mutations happen simultaneously when on cancer, but they may occur separately on non-cancer patients, so that by including both as indicators of cancer, we don't increase probability of detecting cancer when it's there, but we do increase probability of false positives.
Note that as far as the discussion of which Biomarkers to use, they seem to consider the diagnostic function consisting of: "if any of the markers is present, predict cancer", at least for the discussion about which markers they chose. I think they trained a ML model later, which can take into account more complicated dependencies. But why not take these dependencies into account when choosing the markers?
Experiment
We used CancerSEEK to study 1005 patients who had been diagnosed with stage I to III cancers of the ovary, liver, stomach, pancreas, esophagus, colorectum, lung, or breast. The healthy control cohort consisted of 812 individuals of median age 55 (range 17 to 88) with no known history of cancer, high-grade dysplasia, auto-immune disease, or chronic kidney disease.
Statistical method
Binary classification (detecting whether cancer is present or not). The presence of a mutation in an assayed gene or an elevation in the level of any of these proteins would classify a patient as positive. It was therefore imperative to use rigorous statistical methods to ensure the accuracy of the test. We used log ratios to evaluate mutations and incorporated them into a Logistic regression algorithm that took into account both mutation data and protein biomarker levels to score CancerSEEK test results (supplementary materials).
The mean sensitivities and specificities were determined by 10 iterations of 10-fold Cross-validations. The receiver operating characteristic (ROC) curves for the entire cohort of cancer patients and controls in one representative iteration is shown in Fig. 2A.
What's log ratios? Ah log Likelihood ratio. Then the ROC curve is probably obtained by sweeping the threshold to consider the ratio as indicative of the marker being "detected" or "undetected" Actually not quite : Log ratios and omega scores. For each mutation, the log ratio of these two p-values (which are just cumulative distributions), p C / pN was then calculated, and the minimum and maximum of these log ratios across the six wells were eliminated so that the results would be less sensitive to outliers. We considered the log ratio of the p-values rather than the standard log-likelihood ratio because the relatively low number of data points available did not allow a robust estimation of the densities of the MAF distributions (particularly for p C ).
The median sensitivity of CancerSEEK among the eight cancer types evaluated was 70% (P < 10−96 one-sided binomial test) and ranged from 98% in ovarian cancers to 33% in breast cancers (Fig. 2C). At this sensitivity, the specificity was >99%
Is the binomial test technically justified if the predictor has been obtained from the data? I mean just use cross-validation..., hmm (or PAC-Bayes... maybe)
The importance ranking of the ctDNA and protein features used in CancerSEEK are provided in table S9, and a Principal component analysis displaying the clustering of individuals with and without cancer is shown in fig. S3.
MAF normalization. All mutations that did not have >1 Supermutant in at least one well were excluded from the analysis. ?
The cancer training set included only those in which the same mutation was present in the plasma and in the corresponding primary tumor, with an MAF > 5% in the tumor. ??
Multiclass classificaiton (detecting which cancer is present). Given that driver gene mutations are usually not tissue-specific, the vast majority of the localization information was derived from protein markers.
Other cancer biomarkers—such as metabolites, mRNA transcripts, miRNAs, or methylated DNA sequences—could be similarly combined to increase sensitivity and localization of cancer site.
MAF: mutant allele frequency
GRAIL