[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] result files upload problem



More on this:

007 (18811.000.000) 03/17 19:29:22 Shadow exception!
        Assertion ERROR on (result)
        56330838016  -  Run Bytes Sent By Job
        69064957952  -  Run Bytes Received By Job

and:

/home/condor/log/ShadowLog:3/17 19:29:22 (18811.0) (23988):
condor_write(): Socket closed when trying to write 13 bytes to
<10.7.7.15:44456>, fd is 33
/home/condor/log/ShadowLog:3/17 19:29:22 (18811.0) (23988):
Buf::write(): condor_write() failed
/home/condor/log/ShadowLog:3/17 19:29:22 (18811.0) (23988): ERROR
"Assertion ERROR on (result)" at line 232 in file NTreceivers.C

Any idea?

Thanks,
Pasquale

On Mon, Mar 17, 2008 at 1:42 PM, Pasquale Tricarico <tricaric@xxxxxxx> wrote:
> Hi,
>
>  In our cluster, we're having a problem during the upload of the result
>  files from the running nodes to the cluster head node. The job is
>  parallel, and runs otherwise fine, but when generating multi-GB files
>  and copying them back at the end of the job, we get this on the job
>  logfile:
>
>  022 (18800.000.000) 03/17 13:18:55 007 (18800.000.000) 03/17 13:18:55
>  Shadow exception!
>         JobDisconnectedEvent::writeEvent() called without startd_addr
>         0  -  Run Bytes Sent By Job
>         69064941568  -  Run Bytes Received By Job
>
>  We're also monitoring the cluster with Ganglia, and the load on the
>  headnode is OK until the results transfer period, when the load goes
>  to over 10, and the head-node becomes mostly unresponsive. After about
>  20 min, all the jobs in the condor cluster go idle (Ganglia estimate),
>  with the load of the head node still above 10. After we receive the
>  shadow exception, a condor_q reveals all jobs in IDLE mode, even if
>  they could still be running without problems because unrelated to this
>  job. The shadow exception is emitted about 40 minutes after the job
>  stops running on the nodes (Ganglia estimate), and the value of
>  STARTER_UPLOAD_TIMEOUT = 3600 is currently used.
>
>  Regards,
>  Pasquale
>
>  $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
>  $CondorPlatform: X86_64-LINUX_RHEL3 $
>