[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SCHEDD dying on multiple process submission



On 5/17/07, Scillieri, John <John.Scillieri@xxxxxxxxxxxxxxxxx> wrote:
All,

I'm having trouble with a job that sporadically kills the SCHEDD on the
submission host. I think the job in question is our large job that
queues up multiple processes (about 25). I've attached the majority of
the job description file below, is there something I'm doing that is bad
etiquette or unsupported?  The MasterLog file reports "The SCHEDD (pid
XXXX) died due to exception ACCESS_VIOLATION" if that helps anyone. I've
submitted the job both as a standalone submission and as a piece within
a DAG and it happens both ways.

Also, because the schedd keeps dying on start-up there's no way for me
to use condor_rm to remove the bad jobs. Is there another way to
manually remove a job from scheduling?

Options1 - accept the loss of all jobs in the queue and delete the
job_queue.log file in your SPOOL directory. restart, problem gone

Option 2 - take a look inside the file (while condor is off) and see
if you can work out how to remove a specific job (I did this myself a
long time ago in 6.6 but I doubt I could rememebr how to do it now and
certainly not for 6.8 without some fiddling - it's not that hard as I
recall though)

Matt