[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Daemon performance


I have been using data transfers such as,


When the scheduler server gets busy jobs seem to die and get placed back into the queue because they cant keep up with the file transfer and I/O (i think). Is there a way to figure this out? 

In the Schedd log I see for a particular job,

...cur_host=1, status=2
...cur_host=1, status=2
...cur_host=1, status=2
...cur_host=1, status=2
Shadow pid 23323 for job 145.3 exited with status 107
Match record (slot1@xxxxxxx 145.3) for group user deleted
Deleting Shadow rec for PID 23323, job (145.3)
Maked job as IDLE 

Now on the shadowlog I see this around the exact same time,
condor_read(): socket closed when trying to read 5 bytes from startd slot1@xxxxxxx
IO: EOF reading packet header
Can no longer talk to condor_starter
FileLock::obtain(1) ... now WRITE
FileLock::obtain(2) ... now UNLOCKED
Trying to reconnect...
Trying to reconnnect disconnected job

Any thoughts or ideas why the deamons would be behaving like this? Are there any tuning parameters I can use for a more optimal performance?

--- Get your facts first, then you can distort them as you please.--