[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor_q analyze question



Brandon Leeds wrote:
Hi All,

Hi Brandon, some hopefully helpful comments below....

We are trying to understand why a job appears to be running and accumulating cpu time in the condor_q output,

Note that if you just do "condor_q", the time you are seeing is RUN_TIME, i.e. wall clock time. To see CPU time you need to pass the "-cputime" flag to condor_q. The cpu time is then display instead of wall clock time; note that Condor only updates cpu time periodically, so you will not notice cpu time incrementing every second with condor_q.

but are told by the end user that his job is no longer accessing the files it should be along the computations typical pathway. In hopes of understanding if the priority is so low that it is starving ,

If his job is marked as running with condor_q, then there is not a low priority starvation issue.

Some thoughts:

a) the job will still be displayed as "running" in condor_q even if it is currently suspended at the execute node because the SUSPEND expression in the config file evaluated to true. You can do a condor_status to see if the node running the job is in suspended state. Or if the user specified an job log file (log=<some-file>) in the submit description file, that log will also state if the job was suspended.

b) the job will still be displayed as "running" in condor_q when in fact files are being staged (copied) onto or off of the execute node.

c) if the user is expecting to see output files "grow" as the job runs, note there are many circumstances where they may not happen. for instance, if the job is vanilla and file transfer is being used (i.e. no shared file system), the job's files will only get updated when the job completes or optionally is preempted. if the job is standard universe, files may only get updated when the program does a sync to disk - i.e. file I/O may be cached in RAM for long periods of time.

he looked at using the analyze
flag to condor_q. Unfortunately we get this result:


$ condor_q -pool condor -name blaze -analyze 527579.0
Error: Collector has no record of schedd/submitter



This error is saying your pool does not have a schedd (submission point) named "blaze". If "blaze" is a hostname, perhaps the fully-qualified name? Also you can do "condor_status -schedd" to see a list of all possible values you can use with the "-name" option.

Or perhaps the login of the user named "blaze" ? Then perhaps you meant to use the "-submitter" option to condor_q instead of the "-name" option.

-Todd

--
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257