Overcoming the Curse of Dimensionality with Combinatorics

Large volumes of data (Big data) usually display a problem termed ‘curse of dimensionality (CoD)’. Often, many statistical practitioners will scoff at and be very circumspect of solutions that claim to do well in data analytics and decisions/outcomes that seem to overcome the curse of dimensionality (CoD). It is particularly refreshing to see that in a recent blog for MassBio that the author (Loralyn Mears) considers that CoD can be overcome – even though in the context of wondering how the recent efforts to consolidate data in Pharma as ‘data lake(s)’ and ‘stream computing’ are going to overcome CoD.

Curse of Dimensionality is experienced in Big Data due to many reasons, including:

  • ‘multiple testing problem’ – i.e., too many evaluations in a dataset that increases the chance of error, and
  • ‘insufficient statistical power’ - meaning each of two or more treatments or populations being compared have lower number in each group than needed for what is considered reliable statistical analysis.

 

Remember, p-value which is a central concept in statistics is the probability of some hypothesis being evaluated (for example each patient falling into a group having a certain disease or not), not a guarantee of a particular level of error. The correct use of the concept of p-value is a landmine, and guidance are issued to reduce misuse of it as well as in popular media (e.g., John Oliver talk show, Vox article) .

The market offerings of ReSurfX leverage our invention, the Adaptive Hypersurface Technology (AHT) that overcome the CoD effectively, with earliest record of public release of the motivating concept by Suresh Gopalan in 2004, and as provisional patent application in 2005 (over 15 patents have been issued as of early 2020). Hence solutions based on AHT as analytics belong to Machine Learning (ML) category (as opposed to statistics) and powerful outcomes often arise by packing AHT with other tools as Artificial Intelligence (AI). For example, we have a solution explicitly named ‘Advance Outcomes Alert System (AOAS)’ in 2012 – with multitude of applications, including predicting clinical trials directions, patient outcomes and enabling value-based delivery of high cost medical care products.

More recently, in an exclusive white paper series ‘Improving Outcomes Through Enhanced Data Analytics and Artificial Intelligence’ we have been releasing results from extensive comparisons of our data-source agnostic AHT based solutions to widely used ones in the sector and an approach used in another commercial product (NextBio – part of BaseSpace product of Illumina, Inc.). The results presented there besides being informative in general to the community at large, highlights with powerful examples the often overlooked problems as well as the superior performance of AHT based approaches that is robust across data with wide range of error properties.

These solutions are part of our enterprise-grade cloud based analytics platform ReSurfX::vysen. Naturally, vysen leverages stream computing and other modern IT advancements to deliver these advantages in offering solutions to deliver ‘Better Decisions and Outcomes’ improving ROI and innovation through accuracy, personalization and novel insights.

Another interesting part of the blog by Loralyn Mears is the fact that she brings up another important aspect, ‘combinatorics’, that provides immense opportunities as well as difficulties in the context of effective Big Data utilization. In simpler terms, she points out that the sector has been good at combinatorially using few (about three) variables - i.e., in different combinations to achieve/predict an outcome of interest - and as more parameters of interest comes into play the problem becomes enormous, the ‘combinatorial explosion problem’.

The concept of AHT that we have been using, and offering as solutions and products to the market, simplistically identifies a ‘number of parameters’ that is suited for evaluation of an outcome accurately with combinatorics inherently embedded in it. Interestingly, the input parameters are not pre-determined and used for each evaluation meaning it can have combinatorial set of inputs to determine an output or outcome. In other words, the set of parameters used by AHT can be different to evaluate each subject or other goal (like a gene’s expression in the current version of ReSurfX::vysen) which inherently adds robustness, accuracy and personalization which is a secret to success of AHT performing that effectively in the face of unknown error properties in large datasets.

We used the MassBio blog using ‘curse of dimensionality’ and ‘combinatorics’ in one place (though in opposite context) as an opportunity to highlight some special aspects of AHT and the advantages and values it brings to your enterprise. We are proud to be ahead of the pack and offering you powerful solutions overcoming those two problems in utilizing Big Data and leveraging the opportunities opened up by our solutions.

In this LINK, you can reach the author of the blog (Suresh Gopalan) and the rest of ReSurfX team to utilize our solutions and products in your setting.  

CHECK OUT OUR OTHER BLOGS AND WHITE PAPERS HERE.

Leave a comment. If you want one with images etc., add a hyperlink or let us know and we can arrange for that