Performance Part I - AWS SageMaker

Performance Evaluation for GENIE3 on AWS SageMaker

Performance Evaluation for GENIE3 on AWS SageMaker

We ran detailed experiments to investigate the effect of speed-up at different combinations of parallelism set-up and to find out the optimal set-up. As mentioned, the levels of parallelism includes number of instances, instance type (entails number of CPUs, GPUs and amount of memory), number of jobs for SKLearn random forest estimator, and number of processes for Python multiprocessing.

Experiment 1: Comparison of shared memory parallelism solutions

Does parallelism actually work? Is there a difference between Python multiprocessing and SKLearn multiple jobs? Should we use both or which is better?

Type of instance	No. of CPUs	Python multiprocessing	No. of SKLearn jobs	No. of genes computed	Total processing time including start-up (s)	Elapsed time (s)	Speed-up (for elapsed time)
ml.m4.xlarge	4	1	1	10	161	101.89	1
ml.m4.xlarge	4	4	1	10	108	44.43	2.29
ml.m4.xlarge	4	1	4	10	108	49.44	2.07
ml.m4.xlarge	4	4	4	10	134	43.03	2.37

Takeaways:

Using more processes certainly reduced runtime significantly, though it did not necessarily achieve speed-up linear to the number of processes used.
Running on only 10 genes, the speed-up effect of using 4 Python processes is slightly better than using 4 SKLearn jobs.
When trying to use both Python multiprocessing and SKLearn multiple-jobs together, the elapsed run time minimally decreased but the total processing time actually increased significantly compared to using either one. This suggests that Python multiprocessing and SKLearn multiple-jobs might be using similar parallel processing solutions and using both might cause interference and not improvement.

Experiment 2: Optimal configuration of shared memory parallelism

At a higher number of genes, is Python multiprocessing better or SKLearn multiple-jobs better? How many jobs is “all jobs” in SKLearn n_jobs parameter?

Type of instance	No. of CPUs	Python multiprocessing	No. of SKLearn jobs	No. of genes computed	Elapsed time (s)	Speed-up (for elapsed time)
ml.m4.xlarge	4	1	1	100	1018.9 (Extrapolated)	1
ml.m4.xlarge	4	1	All	100	488.20	2.09
ml.m4.xlarge	4	1	32	100	502.27	2.03
ml.m4.xlarge	4	32	1	100	977.04	1.04

Takeaways:

Running with 32 jobs using SKLearn is much more efficient than running with 32 Python processes. Going forward, we have decided to use SKLearn multiple-jobs instead of Python multiprocessing.
With 32 processes, Python multiprocessing has so much overhead that it cancelled out the gains of parallelism and has similar runtime to the extrapolated linear runtime.
SKLearn running with all jobs is slightly faster than with 32 jobs, though not significantly.

Experiment 3: Cost-effectiveness of vertical scaling

How much speed-up can we get from using instances with more CPUs? What about GPU instances? Are the more expensive EC2 instances worth it?

Type of instance	No. of CPUs	No. of SKLearn jobs	No. of genes computed	Total processing time including start-up (s)	Speed-up	Price per hour and mark-up
ml.m4.xlarge	4	32	100	591	1	$0.24 (1)
ml.m5.2xlarge	8	32	100	374	1.58	$0.461 (1.92)
ml.m5.4xlarge	16	32	100	347	1.70	$0.922 (3.84)
ml.m5.10xlarge	40	32	100	342	1.73	$2.40 (10)

Note: Speed-up for this experiment is calculated based on total processing time which is the total billable time.

Takeaways:

GPU instances were actually not available to us because we were using a SKLearn estimator wrapper to run our custom script.
The speed-up (with the instance with 4 CPUs as the baseline) is significant when the number of CPUs increases to 8, however from 8 to 16 and 16 to 40 CPUs, the speed-up is minimal.
Economically, it is not worth it to use the larger instances as the speed-up does not match the mark-up in the cost. Horizontal scaling with smaller instances might be more cost-effective.

Experiment 4: Effect of memory on throughput and successful completion of job

Type of instance	No. of CPUs	Memory	No. of SKLearn jobs	No. of genes computed	Elapsed time (s)	Speed-up (compared across same throughput)
ml.m5.2xlarge	8	32 GB `memory error`	All	6000	15947.86	1
ml.m5.2xlarge	8	32 GB `memory error`	All	3000	8178.34	1
ml.m5.4xlarge	16	64 GB `memory error`	All	6000	13236.91	1.20
ml.m5.4xlarge	16	64 GB	All	3000	6470.35	1.26
ml.c5.2xlarge	8	16 GB `memory error`	All	6000	14034.08	1.14
ml.c5.2xlarge	8	16 GB `memory error`	All	3000	6891.09	1.29
ml.c5.4xlarge	16	32 GB `memory error`	All	6000	10550.77	1.51
ml.c5.4xlarge	16	32 GB `memory error`	All	3000	5401.38	1.51
ml.c5.9xlarge	32	72 GB `memory error`	All	6000	9715.57	1.64
ml.c5.9xlarge	32	72 GB	All	3000	5081.83	1.61
ml.c5.18xlarge	72	144 GB	All	6000	10317.16	1.55

Note: Speed-up is compared across same throughout, i.e. 6000 genes or 3000 genes.

Takeaways:

When running separate instances to get the final output, we unexpectedly ran into memory error. This meant that even though the jobs finished computing all of the genes, the memory was not large enough to write the pairwise correlations one by one to the output file.
If running on 6000 genes, only the instance with 144 GB of memory could output without error; If running on 3000 genes, only the instances with 64 GB or 72 GB of memory could output without error. This brings in a new dimension to consider when choosing instances.
The highest speed-up was 1.64 even though we tried using as 9 times as many CPUs at one point. This points to limited gains in speed-up with increasing CPU number, probably as a result of communications overhead.

Table of contents

Performance Evaluation for GENIE3 on AWS SageMaker

Experiment 1: Comparison of shared memory parallelism solutions

Experiment 2: Optimal configuration of shared memory parallelism

Experiment 3: Cost-effectiveness of vertical scaling

Experiment 4: Effect of memory on throughput and successful completion of job