Unlocking the Potential of Data for AI

Fabio Urbina - AIML Solutions Architect, Sashank Yakkali - Data Engineer, Atul Kakrana - Head Of Data Science | 12th November 5mins

Data Readiness

Your organization has decided to embark on a new journey and implement AI and ML solutions. However, you notice many blockers related to the state of your data: Data lies in personal accounts, instrument-attached computers, inaccessible cloud drives, the LIMS systems, and fragmented across different devices. Worse, the context of the data is unclear, with different versions and file-types, and lacks information on how the data was generated. Even if the data is accessible, another crucial question remains: will my data even be useful for generating valuable AI/ML solutions? Without knowledge of what data exists or what state it’s in, it is impossible to propose, let alone evaluate, value-driven AI/ML solutions.

These questions lead back to a single initial thought: where do we start? Despite the importance of data in driving AI-value, many organizations find themselves struggling with data readiness. Most biopharmaceutical organizations are not ready for AI deployment. Figure 1 shows common issues with data usability including data silos (Figure 1A), data quality (Figure 1B), whether the data is meaningful for downstream use-cases or not (Figure 1C), and data management (Figure 1D). Enabling Data Readiness at your organization is the critical first step for developing key strategic AI applications.

data_readlines

Figure 1. Common problems encountered when evaluating a Data Read strategy at biopharmaceutical organizations.

Achieving Data Readiness

The potential of an AI/ML solution to generate value depends heavily on the data that powers it. Every dollar spent on data generation is an investment in future downstream applications, but without Data Readiness, the return on investment is unclear and at risk. The fundamental concepts of Data Readiness are data integration, data quality, observability, and metadata lineage.

Data Quality and Observability

To determine if AI/ML can generate value for your organization, the first question that needs to be answered is “Is my data useful or not?” AI is only as good as the data it is trained with. Enabling Data Quality and Observability is required to answer this critical question, giving an indicator of the data’s value and relevance. Data Quality is a complex concept, with multiple factors including accuracy, consistency, and reliability of data. Some common attributes of Data Quality include the following:

Completeness: How much missing or incomplete data is there?
Accuracy: How error-free is the data?
Timeliness: How outdated is the data?
Comprehensive: How much does the data cover the question being asked?

Data Quality metrics give a general view of the value and useability of data for downstream data science or AI/ML applications. In addition to answering “Is this data is useful”, Data Quality can help uncover and monitor biases that may exist in the data, preventing decision-making headaches down the road. Data Observability is defined as the ability to gain comprehensive insights into the behavior, performance, and quality of data systems, processes, and workflows across the data production ecosystem. It serves as an indicator for what downstream applications can be developed and timely alerts to data problems when they occur. In absence of data observability, the data issues can go undetected for months, leading to inaccurate models and lost value.

support_data_readlines

Figure 2: Several Core Concept are required to support Data Readiness.

Data Integration and Data Availability

Data must be available in its desired format for downstream AI/ML applications. This requires data extracted and integrated into a common data model or a form amenable to AI/ML applications. Timeliness of data availability is also a core concept of data readiness. Data is being generated constantly, and the time-to-value of this new data depends on how soon data is integrated into current AI/ML solutions. Data that takes months to integrate would not be considered “available”, and constantly outdated AI applications would be less valuable. Another way to think about it is like this: The sooner your data is integrated into an AI/ML solution, the sooner the data is generating value.

Data Lineage and Metadata Management

Data Lineage and Metadata management ties closely to the Data Observability. Data Lineage provides a snapshot of data’s history, from source to the current version. This includes tracking every time data is changed. Data Observability and metadata management together enable data reusability, thus empowering organizations for secondary applications and other potential AI solutions.

Data Readiness is synonymous with AI/ML Readiness

Data is the key component for quality, consistency, and value of AI-driven solutions. With a data-centric model of AI/ML solutions, the concept of Data Readiness has emerged to become the key factor in generating value from AI. We have discussed here key components of Data Readiness which we highly encourage organizations adopt to be ready for future AI and ML applications and solutions development. Data Readiness enables data to meet the FAIR principles of Findability, Accessibility, Interoperability, and Reusability. A Data Ready organization will find themselves in an advantageous position for new AI technology of the future. We at Zifo are dedicated to unlocking the value for AI/ML and it starts with Data Readiness.

This is the first article of a three-part series. It introduces the concept of Data Readiness. The next articles will focus on Data Observability and Data Products.

If you would like to know more about our experience, please contact info@zifornd.com.