MORE DATA, MORE PROBLEMS:
Data generation has never been a problem in the life sciences industry. Labs have been producing enormous amounts of data for many years and it is only increasing thanks to recent technological advances in multi-omics and image-based analytical techniques.
Our data requirements have always been intensive and are only getting more complicated thanks to such fields as omics (genomics, transcriptomics, proteomics, metabolomics etc.), biomarker analysis, and clinical statistics to name a few.
The time spent and complexity of managing data is becoming increasingly cumbersome compared to our data generation capacity. Scientists are often faced with insurmountable challenges in accessing, cleaning and managing data, rather than spending their time on analysing and inferring from it.
The most common problems faced by scientists are:
The availability and accessibility of data for business and exploratory needs are perhaps the foremost problems faced by scientists working in data-siloed organizations. Data is often stored in native or custom formats specific to each instrument or scientific application niche to a business process. Scientists often hit roadblocks when trying to search for related data from their processes or running behind “data owners” to get access to relevant data.
Lack of Data Standards:
Data exchange is often seen as necessary for driving innovation between organizations and within organizations, but it’s always impeded by the lack of data standards among instruments, labs, business units, and entire organizations. The need for data standards was clearly understood during the Ebola outbreak of 2014 when data sharing helped scientists trace the origins of the virus and control the endemic. The need is reiterated in the current COVID-19 pandemic, with the WHO defining data sharing and reporting protocols. In spite of defining a protocol, the COVID-19 pandemic has clearly highlighted how the lack of data sharing standards in real-world scenarios exposes the shortcomings of any data infrastructure, and in turn the health infrastructure.
Ownership and usage:
Most organizations do not have clearly defined owners for their data, and often the IT teams that manage the applications end up as de facto owners. This leads to scientists and research teams being dependent on IT teams to access and analyse data.
There’s also the unwanted evolution of the IT team into a common data analytics team, performing ETL operations on behalf of the data users.
Lack of Self-service Analytical tools or knowledge leads to dependency on data scientists to perform ETL and create inferences without complete knowledge about the data process. Organizations often end up having to hire generalist data scientists who work across business functions without the scientific focus required for drug discovery research.