Data and Parallelism Design

Description of Model and Data
Technical Description of the Parallel Application

Description of Model and Data

As aforementioned, GENIE3 is a Python package designed to compute the relative pairwise significance between genes using random forests. Both the paper and the package can be accessed openly online. We obtained healthy human gene expression data from the Expression Atlas, specifically from the HipSci Project. The expression data contains 193 human subject samples of RNA sequencing of 49,196 genes. In terms of computational complexity, given N = 49,196, this would mean there would be N random forest computations, 1000*N decision trees, and the resulting pairwise significance matrix will contain N^2 entries, which is more than 2.4 billion. This definitely calls for the parallel program that we have designed.

Technical Description of the Parallel Application

The parallel application is a hybrid model consists of two parts, as indicated below:

Part 1: Using AWS SageMaker to parallelize and orchestrate the random forest computation pairwise gene correlation across multiple EC2 instances, which comprise a distributed memory parallelism system.
Part 2: Using PySpark on AWS EMR Spark Hadoop Cluster to build grep tool, using flatmap and de-duplicate to reduce pairwise gene expressions and get significant pairwise correlations as graph edges and weights, distinct gene names as graph vertices. It is also a distributed memory parallelism. However, to compare performance across different instances, we also run it on a single node Spark instance to get a sense of baseline performance.
Platform: AWS SageMaker and AWS EMR Spark Hadoop Cluster

AWS Service	AWS SageMaker	AWS Spark on Single Node	AWS EMR Spark Hadoop Cluster
Infrastructure	Experimented across: ml.m4.xlarge; ml.m5.2xlarge; ml.m5.4xlarge;ml.m5.10xlarge;ml.c5.2xlarge	m4.xlarge: - ECUs, 4 vCPUs, 2.4 GHz, -, 16 GiB memory, EBS only; G3.4xlarge: - ECUs, 16 vCPUs, 2.7 GHz, -, 122 GiB memory, EBS only)	m4.xlarge: - ECUs, 4 vCPUs, 2.4 GHz, -, 16 GiB memory, EBS only; G3.4xlarge: (not used due to vCPU limit)
Parallelism	Distributed memory computation across several instances; Shared memory computation using multiple processes	Shared-Memory Parallel Programming on a Single node	Distributed Memory in a cluster; Shared Memory; Parallel Data Processing
Programming model	GENIE3 gene expression analysis programming model; Within a SageMaker Notebook instance, use different notebooks to orchestrate multiple jobs on separate instances.	Spark, grep with filter	Spark, grep with filter, then use flatMap() and distinct() to de-duplicate

Table of contents

Description of Model and Data

Technical Description of the Parallel Application