How can a ZERO from an analytics product get a great ROI and improve innovation for your enterprise?
Recently we released a new version of our analytics product ReSurfX::vysen 2.0 – initially showcasing the power of our Adaptive Hypersurface Technology (AHT)™ using RNAseq and Microarray based gene expression analysis.
Here we prove the power of vysen using unprecedented results with enormous impact for analytics and outcomes. This forms one basis for vysen being the most accurate product in the market.
We analyzed sequencing quality control consortium (SEQC) benchmark data from two sites (Mayo Clinic and Beijing Genomics Institute) that had each carried out sequencing (RNAseq) analysis on two different tissue samples with 5 replicates each. We found that ReSurfX::vysen :
- Identified ‘ZERO’ false positive differentially expressed gene from over 10 million calls IN 180 (EVERY within-sample 3-replicate) COMPARISONS EACH INVOLVING 58,051 GENES, proving remarkable false positive control.
- Is incredibly MORE SENSITIVE AND REPRODUCIBLE THAN ANY OTHER APPROACH through direct comparison, as well as comparing ALL POSSIBLE between-sample THREE-REPLICATE COMBINATIONS. This involved 200 vysen comparisons of 58,051 genes in over 11 million DEG calls.
(this version was edited* on May 09, 2017 – the originally released version has been archived).
Here we share the details of this unprecedented accuracy offered. [Learn more about ReSurfX:: vysen here!]
We took part of the dataset from SEQC (Sequencing Quality Control Consortium) that made the most sense for real-life applications i.e., comparing gene expression in two samples [the Universal Human Reference RNA (UHRR, from 10 pooled cancer cell lines, Agilent Technologies, Inc.) and the Human Brain Reference RNA (HBRR, from multiple brain regions of 23 donors, Life Technologies, Inc.] – or as the samples were labelled by the consortium as A vs B. The analytics and outcomes presented below can be reproduced from our trial version of the product in the market (soon to be incorporated into the paid product also).
We took RNAseq data from two sites in the study (Site1 and Site2 - BGI & Mayo, respectively) of data generated by the Illumina® platform. The data had been generated from multiple sites to compare reproducibility and study variability. Each of the sites sequenced FIVE technical replicates of each of the two samples.
First, some salient and unique features of product output:
- ReSurfX::vysen gives you a 0 vs 1 for No Change vs Change, respectively between conditions compared (removing need for a subjective p-value, fdr-thresholding).
- vysen has an automatable outlier detection system that will be available in the product in the future.
- Interestingly, the product performs extremely well even with fewer replicates (which is a rarity in analytic approaches available).
Summary of our analytics and the power of our product and technology:
- Incredibly, when we took all possible within-sample three-replicate comparisons from both sites for both samples, ReSurfX::vysen produced ‘ZERO’ differentially expressed genes – IN EVERY 180 COMPARISONS EACH INVOLVING 58,051 GENES. No analytic result has been reported to-date with this level of false positive control.
- Given the enormous implication of the result above, we went on to study all combinations of between-sample three-replicates from Site1-SampleA vs Site1-SampleB. We could reproducibly identify nearly the same genes to sensitivity level of 1.25 fold in three replicate comparisons between the two tissues at both the sites. In principle, if there are no false positives in our analysis of within-sample data as shown in the previous point, if these results are reproducibly found between all these comparisons - then it will be reasonable to claim the DEGs identified here represent true positives.
a. We compared ALL POSSIBLE between-sample THREE REPLICATE COMBINATIONS from each of the sites (total 200 comparisons).
b. We show that the number of differentially expressed genes (DEGs) found AND the overlap among them to be INCREDIBLY consistent. Further, the number of DEGs identified is significantly larger than had been reported by any other study, indicating that ReSurfX:: vysen identifies DEGs without compromising on the differentials despite the level of false positive reduction shown above . THIS SHOWS that vysen is INDEED, MORE SENSITIVE THAN ANY OTHER APPROACHES AVAILABLE, EVEN AFTER EXTENSIVE DEVELOPMENT BY THE FIELD.
- These results are consistent with our claim and value proposition – we can get more accurate results than analytics available in the market and sector with FEWER replicates/subjects!
- Even with additional filters applied to conform to useful practical significance of real-life datasets and outcomes, this level of accuracy and reproducibility has never been reported. Thus, Adaptive Hypersurface technology (AHT) – in this case proven for gene-expression comparative analytics should be an IMMENSELY POWERFUL value to our analytics tool box to reap value in an increasingly data-centric world.
NOTE: the analytics, technology or the product development did not use this data as training set at any point (so no inherent and specific overfitting in the technology and product to the data presented here).
Outline of analysis details: [WE INVITE YOU TO COMPARE YOUR RESULT AND SHARE IT WITH THE WORLD AS COMMENT HERE]
SEQC data (GSE47792; Nat Biotechnol. 2014 Sep;32(9):903-14) representing the 5 replicates of the Universal Human Reference RNA (UHRR, from 10 pooled cancer cell lines, Agilent Technologies, Inc.) and the Human Brain Reference RNA (HBRR, from multiple brain regions of 23 donors, Life Technologies, Inc.) were used.
The data were downloaded as SRA files from the NCBI/GEO database, converted to BAM files (all samples collected as forming a replicate were combined) using NCBI/Samtools (Li, H, et. al., Bioinformatics. 2009 25(16):2078-9), and HISAT2 (Kim D, Langmead B and Salzberg SL., Nature Methods 2015 12(4): 357–360). The resulting sample reads in BAM format files were processed using vysen viz., counted for number of reads spanning each gene ID using Ensemble (Yates et. al., Nucleic Acids Res. 2016 44 Database issue:D710-6) Hs_GRCh38_87 genome annotation and normalized to suit the AHT workflow before comparative analytics.
Quantitative support (click on images to see them at higher resolution):
Fig. 1: Overlap of BGI vs Mayo sites sample AvsB comparisons. 5 replicates each Sample A vs Sample B from Site1 (BGI) vs Site2 (Mayo).
Note, in all outputs shown below – no external thresholding (like fold change, p-value, fdr were applied – instead, ReSurfX::vysen call of 1 or 0 for DGE or not was used. vysen also gives a confidence value (that makes sense internally in a comparison and in conjunction with DGE call).
Fig.2: Number of DGEs found in each of the 180 possible 3-replicate comparisons in each of the two samples from each of the two sites. [Total 180 comparisons of 58,021 genes].
Remarkably, no DGEs were detected by ReSurfX::vysen in all 180 comparisons.
Fig. 3: Remarkable consistency among all possible 3-replicate comparison between Sample A and Sample B in data generated at Site Mayo and Site BGI. [100 comparisons for each site]
Shown in the graph are number of comparisons called the same gene as DGEs (Count) out of maximum possible 100. Note: 58021 genes in all cases. 100 implies all possible comparisons called a gene DGE, 0 implies all possible comparisons called the gene non-DGE (not differentially expressed between the two samples).
The table shows the calculated of total true and false calls: Click here to see compiled, detailed data. For the sake of estimating error - All total calls of a gene (Count) above 50 (of 100 comparisons) are considered a true positive call, all calls of a gene below 50 (of 100 comparisons) are considered a true negative, a gene called DGE 50 times is counted both as a false positive and as a true positive. 45,749 of 58,051 genes were called identically as DGEs or non-DGEs in all 100 comparisons from data generated at site Mayo & 45,476 of 58,051 genes were called identically as DGEs or non-DGEs in all 100 comparisons from data generated at site BGI.
BACK TO YOUR ROI: [Learn more about ReSurfX:: vysen here!]
- ReSurfX::vysen find differentials highly accurately and reproducibly, improving the value of your data analytics efforts, manpower and other resource utilization, time to market with a correct product and outcomes of your business or research interest.
- ReSurfX::vysen has been previously used at a customer site and our solution and technology doesn’t find differentials in comparative analytics when it really is not present with reliable analytic accuracy/reproducibility, as confirmed by a variety of other analytics at that time.
REITERATING OTHER VALUE PROPOSITIONS OF ReSurfX::vysen:
- Extremely high accuracy, which by itself leads to novel insight due to high accuracy data in your workflow and knowledge discerning process
- Highly reproducible as above on a variety of datasets – enabling automation of knowledge extraction from large volumes of data.
- Data-source agnostic – so AHT is applicable to a large variety of problems to analyze alone and in combinations using the same principle and technological modules we have developed.
- Solutions end-to-end in your enterprise – data generation – discovery – clinical trials – outcomes/value-based product sale and patient outcomes – quantitative support for evidence based medicine – inherent personalization in outcomes prediction – Advance Outcomes Alert System™ (AOAS™).
*the edit on May 09, 2017 involved a change in data in a step prior to what a customer would provide to ReSurfX::vysen
**Fig. 1 was changed on 07/17/17 -- the previous version can be seen here.