Processing neurology clinical data for knowledge discovery: Scalable data flows using distributed computing

Document Type

Book Section


The rapidly increasing capabilities of neurotechnologies are generating massive volumes of complex multi-modal data at a rapid pace. This neurological big data can be leveraged to provide new insights into complex neurological disorders using data mining and knowledge discovery techniques. For example, electrophysiological signal data consisting of electroencephalogram (EEG) and electrocardiogram (ECG) can be analyzed for brain connectivity research, physiological associations to neural activity, diagnosis, and care of patients with epilepsy. However, existing approaches to store and model electrophysiological signal data has several limitations, which make it difficult for signal data to be used directly in data analysis, signal visualization tools, and knowledge discovery applications. Therefore, use of neurological big data for secondary analysis and potential development of personalized treatment strategies requires scalable data processing platforms. In this chapter, we describe the development of a high performance data flow system called Signal Data Cloud (SDC) to pre-process large-scale electrophysiological signal data using open source Apache Pig. The features of this neurological big data processing system are: (a) efficient partitioningof signal data into fixed size segments for easier storage in high performance distributed file system, (b) integration and semantic annotation of clinical metadata using an epilepsy domain ontology, and (c) transformation of raw signal data into an appropriate format for use in signal analysis platforms. In this chapter, we also discuss the various challenges being faced by the biomedical informatics community in the context of Big Data, especially the increasing need to ensure data quality and scientific reproducibility. © Springer International Publishing AG 2016.