AI for Enterprises. Part 1: Where are we in tackling the popular adage GIGO? [Extended version]

The recent AI wave driven by tremendous ability to utilize data helps gain more knowledge from data, make valuable predictions, create novel applications (e.g., driverless cars) and improve nearly all sectors of business as well as our everyday life. This great stride is brought about by a confluence of progress in IT, and other technologies that enable generation and or collection of data, storage, handling and compute with concomitant attempts to improve sophisticated approaches to process data. In this post we explore the status of our ability to tackle the popular adage “Garbage In, Garbage Out (GIGO)” to maximize the opportunities and enable progress. This is the first of a two-part post on adopting AI at scale.  Centered on the theme of GIGO, we will also explore early successes of (true and pseudo) AI-based solutions, implications for subsequent progress, preparing businesses for reaping value from these solutions and setting appropriate strategies for digital and AI transformation. We take this opportunity to highlight advantages of the approach the outcomes intelligence company ReSurfX is taking with excellent success.

This is a longer version of the post with same title at Massachusetts Technology Leadership Council (MassTLC) - published on June 20, 2022 that was also spotlighted in MassTLC Newsletter of June 22, 2022.

In the last decade we have made tremendous progress on digital technologies to facilitate far advanced and newer uses of data that nearly every data based application is now being referred to as Artificial Intelligence (AI). The popular adage “Garbage In, Garbage Out (GIGO)” needs no elaboration to practitioners as well as business leaders. When we talk about AI or digital advances both in enterprises and in the society a major factor in play is our ability to tackle the GIGO.  While we reached this tipping point of AI now, many of the tools we use today use more or less the same conceptual principles that have been tried before and dismissed as unsuitable to achieve those goals as little as a decade or two ago. In this post we will explore early successes of (true and pseudo) AI-based solutions, implications for subsequent progress, preparing businesses for reaping value from digital and AI solutions.

Consistent with the theme indicated in the title we will focus on where we are in handling GIGO, some reasons for why things are as they are, scan the emerging landscape through my lens as the leader of the outcomes intelligence company ReSurfX (with due thanks to our other team members). To give a perspective, ReSurfX is an outcomes intelligence company improving innovation and ROI to enterprises from their data intensive initiatives leveraging a novel machine learning (ML) approach ‘Adaptive Hypersurface Technology’ (AHT). Given the use of a novel approach as the mothership, we do many times the validation than most in the industry to evaluate the AHT based solutions by themselves as well as to evaluate these solutions against other alternative ML and analytical approaches available in the market. AHT based solutions incorporated into our SaaS product and tested at scale yield both complementary insights and enhanced outcomes that is robust and valuable.

In using data at scale (Big Data), both the data input and solutions for utilizing them that feed processed data from or at each stage of application, our thematic GIGO statement plays a role. The former in the previous statement relates to input data quality and the latter of processing pipelines to handle problems that pertain to data quality and other properties that are often related to Big Data. The larger the workflow (i) in terms of number of operations they are subject to or (ii) in long product/application development cycles such as drug development in Pharma where downstream activities are often far more expensive than early stages – and the effect of errors propagate along the pipeline (and over years when the business has long product development cycles) with compounding effect.

When referring to Big Data applications here, we refer both to those that utilize enormous volumes of data from one or few sources as well as those that are applicable to smaller data volumes through combining a variety of them including near real-time data from the edge (such as from smartphone, wearables, IoT etc.).

When referring to data quality there are also others factors such as   bias of the society embedded in the data, that at this point need other kinds of conscious efforts and involving people from multiple specialties besides data-driven or computation based approaches. The remedy for these may not be amenable to automation for some time to come, or the reliability will be sporadic and dangerous. Some facets of this class of problem are discussed in the blog posts How does AI, Waymo self-driving taxi launch and Google indicate room for another search engine? and How Bad Data Is Undermining Big Data Analytics.

Market definition of AI, ‘riding the wave’ and classes of early successes

Unlike often misstated, AI is not synonymous with machine learning; the same applies to using machine learning as a synonym for neural network or deep learning. AI is a combination of data, analytics and machine learning with other technologies, tools and at least at this stage highly dependent on specialized knowledge so we can improve what we get out of the automated systems. Though there are excellent genuine and novel cases (e.g., autonomous vehicles), significant proportion of solutions are using pseudo-AI at this time. The latter ones are the classical ‘riding the wave’ and rely on ‘tinkerbell effect’ either (i) aimed at small niches and riding on the buzz word AI and will fall apart in short period, or (ii) useful solutions that might have been hard to get market acceptance in the past but now aligned with the term AI getting accepted.

We know that, sophisticated and well understood solutions that were considered as not practically useful just over a decade ago gives valuable new insights when applied to large volumes of data (Big Data) and are being utilized by many organizations in every industry spanning logistics, autonomy,  finance, medicine, sentiment and behavior prediction etc. These successes despite shedding light on the immense advantages we can derive, also imply our current advantages are often dependent on the value of far and fewer correct insights and despite the significantly large insights that are incorrect. These highlight the amount of resources and effort we are wasting to get that insight.

Nevertheless, in toto these solutions, technologies and applications have proven to confer extremely valuable business advances, and enabling societal progress. We can attribute significant proportion of early successes to these two different factors (i) Sheer scale: in many cases our current ability to apply those solutions (dismissed a decade or so ago) at enormous scale in terms of data (Big Data) increase the predictive power and consequently the chances to uncover insights otherwise not possible and get a blockbuster, (ii) Prior deductions needing additional support: this class of success comes from a lot of practical knowledge based intuitions or deductions but still had insufficient proof to justify expenditure that are waiting in the wings, and we pick those insights from among the enormous amount of wrong ones.

Even Glass Far Less Than Half Full Provides Immense Business Value, However Leaders Beware

The value of these early successes referred to above even from the small proportion of correct insights demonstrate enormous promise and consequentially providing the impetus and resources to advance development and applications of AI. Thus the glass is far from half full.

Possible novel applications and continued innovations in data and digital innovations are expected to evolve and mature for at least a decade or two. However, with the immense potential for uncovered value that same fact also demonstrates that we can achieve enormously improved outcomes and better utilization of resources by paying attention to problems, or conversely the enormous amount of unproductive resource expended at this time. If business and technical leaders from other disciplines do not keep themselves adequately informed of these problems this will result in building up wrong capabilities, infrastructure and commitment to solutions that will not be useful beyond the short term. Hence there are significant efforts on educating business leaders on data, associated technology and AI to help with their digital transformation strategies.

However value of the progress happens to be enormous indicating the sophistication we operate with at any point despite way off from what we want or can do, and of other significant shortcomings we are able to reap huge rewards and make incredible societal progress. Yes glass-half-full is a great positive attitude that historically got us to remarkable progress.

My favorite example to highlight the point of ‘immense value despite making huge assumptions or unknowns’ and ‘progress even with significant known limitations’ from applications of innovation referred to above is aircraft and spaceships. In aircraft design and development nearly all theories that guide practical aspects of design make approximations such as ‘air is an incompressible fluid'– despite that we have had incredible successes. If you have been to some museum and see the pieces of spacecraft we sent men to moon or keep sending to outer orbits, you ca see that the parts are not necessarily sophisticated.

However, the example above should not be used as a motivation to be content with that as status quo (doing the best we can) in this case as it will back fire in the near future.  The returns from current level of practical usability will soon become far from effective. In addition, the capabilities built will be obsolete or far from suited for the emerging needs; with the rapid and iterative progress the information from them will become confusing. With these, refocusing will take enormous effort in addition to time lost and need for many resources afresh.

The GIGO Monster is a Formidable Challenger In Extracting Insights from Data Every Step of the Way

GIGO is a major factor limiting the extent of value extraction and reliability of knowledge extracted from data. GIGO effects stem at every major step of utilization of data, including as quality of: data generation, collection, cleaning, assembly, exchange, quality of input for processing and intermediate outputs from processing steps and evaluation approaches. Several failed attempts by Google in the past to utilize their immense prowess with data for healthcare related applications, and the famous Watson is now a failed effort in healthcare for IBM can both be significantly attributed to data quality (GIGO effects). When referring to input data quality there are also others deeper factors such as   bias of the society embedded in the data and difficult to detect or automate.

Despite the enormous efforts and the incredible sophistication in data and digital solutions that led to this ‘AI wave’ they all surprisingly suffer from classical (i.e., well known) problems that significantly reduce the knowledge extraction capability and reliability of predictions. These include: (i) even specialized solutions often do not perform uniformly across different datasets of same kind from different origins or from different times, (ii) difficulty and misuse of metrics of accuracy, such as that known problems with misuse of probability or p-values – that even spilled to mass media – e.g. John Oliver exposes how the media turns scientific studies into "morning show gossip” – relevant YouTube video, and (iii) solutions developed are often highly specialized for input data and the buyers not yet sophisticated enough to evaluate them and recognize the success scenarios and limitations applicable to their use (e.g., effectively testing for the classical case of overfitting). These classes of problem can be considered a specialized form of GIGO manifestation where error does not stem from the errors in data but due to data properties at scale not conforming to assumptions used in the processing and insight extraction approaches, need for evaluation of practices suited for using these innovations etc. - an aspect of this is discussed in the ReSurfX blog “Overcoming the Curse of Dimensionality with Combinatorics”.

Besides the quality of the input data, solutions that utilize them and feeding processed data from each stage of application our thematic GIGO plays a role. The former in the previous statement relates to input data quality and the latter the quality of processing pipelines to handle problems that pertain to data quality in general and properties that are often specific to or exacerbated in Big Data.

Emphasizing and reinforcing above, when validating solutions we build at ReSurfX we find that even with technologies and applications where we have invested enormous amount of brain power and resources, we often have over 30% error in the information derived from data (a form of processed prior knowledge used as truth) that are currently used to train models and develop AI solutions.

Slaying the GIGO Monster for a Better Tomorrow

In the rush to reap the low hanging fruits as in our capitalistic pursuits and proverbial rat race even in the research community, GIGO the monster of a problem is often is overlooked as an area for innovation or to extend innovations to tackle the fundamental problems that significantly reduce the value from the AI wave.

One major cause for data quality and other performance related GIGO problems that limit the ability of digital sophistication often boil down to simple fact that error properties are unknown in Big Data (i.e., they vary even across a dataset and in ways that are not predictable), thus limiting ability to model errors and in turn limiting scalability of solutions for reliability of value extraction. In recent times several efforts are underway to tackle this challenge. The effects of this data property and the aforementioned classical problems manifests as the classical GIGO adage. However, the problems like inherent bias are far less amenable to such data-driven or computation based automated approaches, and at least for some time to come and will also involve painstaking efforts involving people from multiple specialties.

We at ReSurfX posited that dramatic improvements in accuracy and novel insights can only happen through innovation outside the mainstream framework – given that error properties are often non-uniform in big data, and most analytic shortcomings result from model assumptions not robust enough to handle that [CIO Review, 2017].

ReSurfX as an outcomes intelligence technology company improves innovation and ROI of enterprises through accurate and robust novel insights and advance prediction of outcomes direction from their data-intensive activities. ReSurfX does this by leveraging a novel data-source agnostic machine learning approach the ‘Adaptive Hypersurface Technology’ (AHT) that we developed that significantly overcomes GIGO among other problems that affect most ML and AI solutions. We provide functionalities based on AHT through an enterprise SaaS platform ReSurfX::vysen. The remarkable predictive power of the solution System Response based Triggers and Outcomes Predictor (SyRTOP) in ReSurfX::vysen leveraging AHT is evident in terms of accuracy, robustness, novelty of insights and ability to predict outcomes far in advance. We are developing the latter of those values in previous sentence as an Advance Outcome Alert System. For example we have shown that SyRTOP can predict adverse drug interactions identified and recommendations effected by FDA (US Food and Drug Administration) and AHA (American Heart Association) based on post-market surveillance of drugs (i.e., after approval and continued monitoring in the large population of users) from the system response using a single reporter variable. Expanded In addition ReSurfX::vysen can provide accurate knowledge repositories (KRs) of proprietary customer data that can improve other predictors and workflows they use through use of highly accurate ReSurfX::vysen processed data as input. More details on these functionalities in and features of the ReSurfX::vysen delivery platform, and solutions being developed by leveraging AHT are in posts ReSurfX in 2021 – Best-in-class Outcome Predictors, Innovation Catalysts and ROI Multipliers and other blog posts in the ReSurfX website.

A version of this post ending with a summary at this point is in the Massachusetts Technology Leadership Council (MassTLC) website.

The Beachhead Solutions in ReSurfX::vysen, Reasons for Their Choice, Power and Value and Broad Extensibility

The above mentioned single reporter variable based system response in the beachhead product and solutions in the market is gene expression. ReSurfX chose gene expression data as a beachhead application and demonstrated that SyRTOP applied to drug treatment response data can predict subtle and impactful medical, biological and chemical drug interactions. The insights could also predict drug interactions with highly impactful validations such as those uncovered by USFDA and American Heart Association from post-market surveillance data and had issued guidance not to co-prescribe those drugs. Our at-scale value evaluation and results have proved that these advantages and improvements of ReSurfX solutions powered by AHT are the norm and not an exception.  Such powerful prior knowledge as support of insights uncovered by SyRTOP provides support for value of previously unknown insights uncovered.

Few important aspects deserve highlighting here:

  1. Even with gene expression as system reporter the application space is very large when chosen with right business and expense based needs – what was proven for drug interactions has many other applications – including in diseases progression, early or pre-disease detection, effectiveness of treatments, choice of drugs and treatments near real-time and more.
  2. The number of applications directly adjacent to (similar to the data and outcomes of the use case) and extensions of the use case outlined are large and poised to add innovation and outcome improvement value.
  3. AHT is data-source agnostic, and addition of more reporter variables would only increase the power of the solution, this much power of the predictor even with a single system status reporter is immensely more remarkable.

However the choice of gene expression as the beachhead has several other important reasons including:

  1. The data generation platforms from which input is currently enabled represent two extremes of data properties used in most real-world prediction solutions for decisions ad outcomes involving multiple system state reporters.
  2. Analysis of data from these platforms have been researched extensively by research community for over two decades, and despite that we finding over 30% error in knowledge repositories is an incredible improvement
  3. The datatypes represented by the beachhead are amenable to both numerical validation and biomedical validation such as data use associated information content analysis.

ReSurfX efforts in at-scale validation of product and functionalities (i.e., technology de-risking) and comparative evaluation to other approaches available used all these advantages. Results of at-scale validations using all these approaches also demonstrated significant improvement by AHT based solution in ReSurfX::vysen over alternative approaches available, including contemporary machine learning solutions.

The data we have released prove that component functionalities used individually improve existing workflow of enterprises, thus catering to the need of such data/digital products to enterprises to be integrative. Important needs of enterprises (and are characteristics of any integrative software applications) including their need to continue to utilize solutions currently use and providing novel as well as additive value by being disruptive without disrupting their established approaches. The other important characteristics and features of ReSurfX::vysen and some details of the product expansion plan are outlined in ReSurfX in 2021 – Best-in-class Outcome Predictors, Innovation Catalysts and ROI Multipliers and in other information throughout the ReSurfX website.

In a forum with many medical professionals and other innovation and investment folks, we heard that the current working knowledge in the field of medicine is that the standard error rate is 15%. Though it is difficult to assess at this time what percentage of that error is amenable to improvement, there are many other facets (including those stated earlier) that we think that the above estimate does not include that can be improved in the current AI wave. Be it for actionable insights or prediction of outcomes like we do at ReSurfX or other processing steps that are part of many analytics and ML (hence AI) are all dependent on the prior data that we call Knowledge Repositories. Validating ReSurfX solutions along the way our development, prioritization and articulation of value to customers have shown that current KRs that out ML and AI efforts rely on often have over 30% errors that does not even take the kind of errors and reasons included in the 15% error mentioned above into consideration – likely one of the key reasons for failure of Watson and other data solutions mentioned earlier and is a determinant of the effectiveness of all new solutions. The approach we took at ReSurfX to tackle this problem and provide productized solutions to our customers in terms of custom repositories that they can use for their workflow needs, rather than AHT based solutions directly being designed to predict from data reinforces the power of thinking outside the mainstream framework to design solutions. On top of that there are on a significant at-scale efforts where all past data are now digitized (e.g., electronic medical records – EMRs) and huge NLP efforts (among many others) are ongoing including through application of advances like gpt-3 or other parts of openCV that are tailored to extract medically relevant keywords for use with other knowledge gleaning parts of AI solutions – which are bound to have even more variations and inaccuracies solely based on fragmented and also from processes not well standardized historically.

Despite the data-source agnostic mothership AHT making the applications from solutions developed using AHT sector-agnostic, the primary focus of ReSurfX is on Pharma, Biotech and Healthcare enterprises, Healthcare economics - i.e., the healthcare sector.

In summary, we highlighted that the remarkable strides in data and digital solutions in the last couple of decades by “the Big Data wave that transformed into practically applicable solutions as the AI wave” that is likely to continue adding value to nearly all facets of society for a long time to come.  We also noted that these advances are mired to a significant and surprising degree by classical problem of GIGO, and indicated how much more effective these advances can confer to commercial and social needs by tackling these needs, and how businesses can strategize their digital transformation by being educated on these. We outlined some causes and approaches to overcome this problem to look out for in developing your technology (including digital and AI) strategies, including the premise and a novel ML approach AHT developed by ReSurfX and leveraged in powerful market solutions that continue to expand.

ReSurfX solutions including the SaaS product ReSurfX::vysen help maximize value of your enterprise by improving innovation and ROI through powerful, novel and integrative solutions by leveraging AHT that are inherently built to tackle problems including those discussed in this post. These ReSurfX solutions solve problems that fall at the intersection of many disciplines leveraging AHT, evolving advances, other sector insights and market needs that are the key to embedding modern advances into digital and AI strategy effectively.

**First sentence was edited for structure on 06/22/22.
**Links to external contents edited on 06/25/22.
** The words "[Extended version]" was added to tile 06/30/22

Comments are closed.