[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow exception with LamMpi jobs



Hi,

We have a condor cluster (60 CPUs) where we're trying to run parallel
jobs using LamMpi.

$CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

All the basic configuration is now OK, as we're able to run parallel
jobs via condor_submit on 8 nodes, and it completes correctly. But
when we run the same job (same binary and data files) on 32 nodes
instead of 8, we receive a shadow exception in the log file:

000 (18789.000.000) 03/14 10:55:35 Job submitted from host: <10.7.7.250:55139>
...
014 (18789.000.000) 03/14 10:58:10 Node 0 executing on host: <10.7.7.20:59381>
...
014 (18789.000.001) 03/14 10:58:28 Node 1 executing on host: <10.7.7.11:59230>
...
[many lines later]
...
007 (18789.000.000) 03/14 11:06:49 Shadow exception!
        Error from starter on slot3@xxxxxxxxxxxxxx: Failed to transfer files
        0  -  Run Bytes Sent By Job
        47481929728  -  Run Bytes Received By Job

After this, the job goes idle, and then tries to start again within
minutes, failing again with the same error. We're monitoring the
cluster with Ganglia, and it doesn't look like the cluster is running
out of resources while trying to run the job (memory, CPU, network
bandwidth, or disk space). We had similar problems months ago while
using Condor 6.8.2, but at least at that time the jobs were starting
correctly on 32 nodes, and the shadow exception was a rare problem,
while now it's systematic.

I've posted the relevant ShadowLog here: http://pastebin.com/f4ac86994
and the relevant SchedLog here: http://pastebin.com/f6c52778e

Please let me know if more info or logs are needed. Any help will be
greatly appreciated!

Regards,
Pasquale