[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] result files upload problem



Hi,

In our cluster, we're having a problem during the upload of the result
files from the running nodes to the cluster head node. The job is
parallel, and runs otherwise fine, but when generating multi-GB files
and copying them back at the end of the job, we get this on the job
logfile:

022 (18800.000.000) 03/17 13:18:55 007 (18800.000.000) 03/17 13:18:55
Shadow exception!
        JobDisconnectedEvent::writeEvent() called without startd_addr
        0  -  Run Bytes Sent By Job
        69064941568  -  Run Bytes Received By Job

We're also monitoring the cluster with Ganglia, and the load on the
headnode is OK until the results transfer period, when the load goes
to over 10, and the head-node becomes mostly unresponsive. After about
20 min, all the jobs in the condor cluster go idle (Ganglia estimate),
with the load of the head node still above 10. After we receive the
shadow exception, a condor_q reveals all jobs in IDLE mode, even if
they could still be running without problems because unrelated to this
job. The shadow exception is emitted about 40 minutes after the job
stops running on the nodes (Ganglia estimate), and the value of
STARTER_UPLOAD_TIMEOUT = 3600 is currently used.

Regards,
Pasquale

$CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
$CondorPlatform: X86_64-LINUX_RHEL3 $