[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange Condor Behavior - Possible Bug



The first thing I'd suggest is to look at the starter log files for one of the problem jobs, which would be on btbal3600 or 3610 based on your logs below. Looks like maybe you're using partitionable slots? It'd be StarterLog.slot1 if not, or StarterLog.slot1_* if so.

That may give you a bit more insight into why the termination took place. Sounds like there's precious little in the stdout and stderr from what you wrote.

I've seen this sort of thing if a job balloons its memory and gets nailed by the kernel's out-of-memory killer, though your "memory-used" and "memory requested" figure in the log file shows 1709, so that may be unlikely, but if you see "oom" in the /var/log/syslog file then that indicates that the killer was triggered and you can glean the details from the syslog. How much memory do the exec nodes have?

You can implement your 12-hour time limit internally to the job using a periodic_hold or periodic_remove _expression_:

periodic_hold = ( time() - JobCurrentStartDate > 12*$(HOUR) )
periodic_hold_reason = "Job exceeded 12-hour runtime limit."


 

Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax

michael.v.pelletier@xxxxxxxxxxxx