[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Job causes condor_schedd to crash

One of our clients has seen an issue with HTCondor 7.6.3 where a job
will cause the schedd to crash. It's a single job within a
many-thousand job cluster, and there doesn't seem to be anything
particular about the job that causes this, and it's not reproducible
on demand.

What seems to be happening is that the job is starting about 1800
times within a 30 second period and the job_history.log file ends up
with approximately 30 million lines containing that specific
job.process ID. The schedd dies repeatedly until the job is removed
from the history file.

I don't see any mentions of fixes for this in subsequent release
notes, so I was just wondering if anyone else has seen it.

Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

twitter: @cyclecomputing