[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Daemon performance



Hello,

I have been using data transfers such as,

when_to_transfer_output=ON_EXIT
should_transfer_files=yes
stream_error=true
stream_output=true

When the scheduler server gets busy jobs seem to die and get placed back into the queue because they cant keep up with the file transfer and I/O (i think). Is there a way to figure this out? 

In the Schedd log I see for a particular job,

...cur_host=1, status=2
...cur_host=1, status=2
...cur_host=1, status=2
...cur_host=1, status=2
Shadow pid 23323 for job 145.3 exited with status 107
Match record (slot1@xxxxxxx 145.3) for group user deleted
Deleting Shadow rec for PID 23323, job (145.3)
Maked job as IDLE 

Now on the shadowlog I see this around the exact same time,
condor_read(): socket closed when trying to read 5 bytes from startd slot1@xxxxxxx
IO: EOF reading packet header
Can no longer talk to condor_starter
FileLock::obtain(1) ... now WRITE
FileLock::obtain(2) ... now UNLOCKED
Trying to reconnect...
Trying to reconnnect disconnected job


Any thoughts or ideas why the deamons would be behaving like this? Are there any tuning parameters I can use for a more optimal performance?






--
--- Get your facts first, then you can distort them as you please.--