When submitting N instances of a job, generally N/2 jobs run in the expected time and the other N/2 jobs take longer to complete. The system has 10 nodes each with 32 slots and uses a shared filesystem (GlusterFS). All of the executables and data files are located on the shared file system; however, the problem does not seem to be an I/O or network bottleneck.
When submitting 2 instances, the two times are the following:
When submitting 22 instances, the difference in times are more drastic. The two categories that the times fall into are the following:
Does anybody have insight into this issue?