[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Hawkeye module and condor_q problems in condor-6.6.6



Hi,

I'm having problems with hawkeye modules under condor v6.6.6: sometimes I get continuous: 'Cron: Job 'blah' is still running!' messages, even though I can't find the processes in the process list any more. is there any way to fix this short of bouncing the startd?

Second: we deal with some very large-footprint condor jobs in the vanilla universe. Most of it (static FORTRAN array space) gets swapped out, but in the event that a job gets killed on a machine, it will then never run on another machine because its ImageSize is greater than the (per-vm) memory available on the machine. I have been running a command:

condor_qedit -name lawrence -constraint \
'JobStatus == 1 && ImageSize > 0.0' \
ImageSize 0.0

which works, but condor_q then says:

-- Schedd: rockwell.fnal.gov : <131.225.52.131:32774>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 --- ???? ---
3034.0   jocelyn         6/6  11:23   0+03:45:52 R  0   1389.5 AnalysisFramework_
 --- ???? ---
 --- ???? ---
3037.0   jocelyn         6/6  11:23   0+03:45:29 R  0   1389.5 AnalysisFramework_
3040.0   jocelyn         6/6  11:23   0+03:45:22 R  0   1385.5 AnalysisFramework_
3041.0   jocelyn         6/6  11:23   0+03:45:18 R  0   1106.5 AnalysisFramework_
 --- ???? ---
3043.0   jocelyn         6/6  11:24   0+03:44:51 R  0   1108.5 AnalysisFramework_
3044.0   jocelyn         6/6  11:24   0+03:44:51 R  0   1106.5 AnalysisFramework_
 --- ???? ---
3048.0   jocelyn         6/6  11:24   0+03:45:03 R  0   1396.0 AnalysisFramework_

where the " --- ???? --- " lines represent the jobs that were edited (job numbers 3033.0, 3035.0, 3036.0, 3042.0 and 3047.0 here). Is this a bug or something I did wrong? Regardless, how do I fix or workaround the problem?

Thanks,
Chris.

--
Chris Green, MiniBooNE / LANL. Email greenc@xxxxxxxx
Tel: (630) 840-2167. Fax: (630) 840-3867