More Data, More Problems:
Data generation has never been a problem in the life sciences industry. Labs have been producing enormous amounts of data for many years and it is only increasing thanks to recent technology advances in multi-omics and image based analytical techniques.
Our data requirements have always been intensive and are only getting more complicated thanks to such fields as omics (genomics, transcriptomics, proteomics, metabolomics etc.), biomarker analysis, clinical statistics to name a few.
The time spent and complexity of managing data is becoming increasingly cumbersome compared to our data generation capacity. Scientists are often faced with insurmountable challenges in accessing, cleaning and managing data, rather than spending their time on analysing and inferring from it.
The most common problems faced by scientists are:
The availability and accessibility of data for business and exploratory needs are perhaps the foremost problems faced by scientists working in data-siloed organizations. Data is often stored in native or custom formats specific to each instrument or scientific application niche to a business process. Scientists often hit roadblocks when trying to search for related data from their processes or running behind “data owners” to get access to relevant data.
Lack of Data Standards:
Data exchange is often seen necessary for driving innovation between organizations and within organizations, but it’s always impeded by the lack of data standards among instruments, labs, business units, and entire organizations. The need for data standards was clearly understood during the Ebola outbreak of 2014, when data sharing helped scientists trace the origins of the virus and control the endemic. The need is reiterated in the current COVID-19 pandemic, with the WHO defining data sharing and reporting protocols. In spite of defining a protocol, the COVID-19 pandemic has clearly highlighted how lack of data sharing standards in real world scenario exposes the shortcomings of any data infrastructure, and in turn the health infrastructure.
Ownership and usage:
Most organizations do not have clearly defined owners for their data, and often the IT teams that manage the applications end up as de facto owners. This leads to scientists and research teams being dependent on IT teams to access and analyse data.
There’s also the unwanted evolution of the IT team into a common data analytics team, performing ETL operations on behalf of the data users.
Lack of Self-service Analytical tools or knowledge leads to dependency on data scientists to perform ETL and create inferences without complete knowledge about the data process. Organizations often end up having to hire generalist data scientists who work across business functions without the scientific focus required for drug discovery research.
Data democratization is the process of making data accessible to everyone. There would be no gatekeepers creating bottlenecks at the data gateway. Democratization is not limited to just access. It requires that the accessed data is understandable, and teams have a way to analyse and function using said data. The earliest proponents of large scale Data Democratization, AirBnB have made a stellar proof of concept for organizations to follow.
In order to understand how democratized data is within an organization, asking yourself a few simple questions would help:
- directly access all data relevant to your position
- search for datasets and inferences made by other teams related to your position
- access unprocessed data to perform exploratory analysis
- create new datasets, that can be added to a central repository
- set up new experiments with the existing models and data
- access metadata information related to all datasets
If even one of these questions is answered No, then your organization needs to work on democratizing their data.
Consider an example of a batch control process in a bioprocess lab where your scientist is trying to improve the batch-end quality.
Usually Quality Analysis is performed at the end of each batch process by performing full factorial tests on measured variables. Data for analyses must be gathered from quality parameter sensors in order to identify accurate changes to the input. This is usually done with an open- loop control, which does not happen in real-time.
Instead, if you can enable the entire process to be data-centric, with data-driven models, allowing your team to access and analyse data at every step, including the end result, it allows scientists and engineers to work on controlling and optimizing the batch performance.
Business units are always in need of data from different sources for optimization and decision-making. Enabling them to be data-centric can help achieve a paradigm shift in labs and industries of the future.
How do you democratize Data?
Set up data infrastructure:
Create/set up data storage, platforms and analytical tools depending on your end-use with the core components of democratization in mind. Apply FAIR (findability, accessibility, interoperability, and reusability) principles to all data over time.
Restructure your team:
Move towards a decentralized data team, where any team has the ability/knowledge to perform the role of a data scientist
Implement Metadata Hubs:
A Metadata Hub is part of your infrastructure that goes a long way in enabling your data to be democratized. Metadata Hubs help in enabling data to be accessible by indexing datasets, contents of tables, and its connection to other data. Using semantics, this allows any data user to understand the origin, context and historical use of a data source before they decide to work with it.
Quality checks and data stewards:
One of the biggest challenges in decentralizing data is maintaining quality of the datasets created by different teams and ensuring consistency. Open data formats, in-house QC tools and automated checks and anomaly detectors go a long way in helping with quality checks. Data stewards can help in implementing data governance standards for democratized data to ensure quality of content and metadata.
Realigning focus by training your team on ETL:
Having a decentralized team helps users perform Extract, Transform and Load operations on their own data, instead of having to rely on IT teams and Data Engineers. This reduces the burden of data engineers who can work towards more organizationally relevant ideas of improving and maintaining data infrastructure. It creates an agile process for business units to have analysis-ready data whenever they need it.
“Democracy is the art of thinking independently together”- Alexander Meiklejohn
Data Democratization allows life sciences and healthcare organizations to uncover opportunities from research data thus expediting decision-making and in turn allowing solutions to reach patients faster.
A data democratized organization has teams that are data-centric and empowered through a connected ecosystem and knowledge of handling and analysing data.
Numerous hidden patterns and undiscovered insights can come to light leading to accelerated drug discovery and development from existing data.
Have any thoughts or questions on this? Please email us at email@example.com