[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor jobs cannot resume



Hi,
 I am submitting a condor job along with several java class files. My condor job is a shell script and it works perfectly.
I am sending condor_vacate_job signal to the submitted job and it becomes idle again. When I check with condor_q,
the status of my job shows something like:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
  94.0   XXXXX          9/4  22:55   0+00:00:59 I  0   2197.3 test.sh helloWorld

Now, the problem is, sometimes the SIZE field shows 0.0 and then condor reschedules my job again to another machine / the same one..
But sometimes, the SIZE field shows a large number like in this particular example (2197.3), as a result of which condor cannot relocate my job
to any other machine. When I try to see whats happening with condor_q -analyze, it shows something like:

"094.000:  Run analysis summary.  Of XYZ machines,
    XYZ are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
    Last successful match: Thu Sep  4 22:56:31 2008
    Last failed match: Thu Sep  4 23:08:47 2008
    Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints
   Check the Requirements _expression_ below:

Requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)"

Where, the requirements field was not set by my condor submit file. So my guess is, condor cannot find another machine which has memory "Memory * 1024" >= ImageSize,
which in my example might be 2197.3 ?
or cannot find a machine that has Disk space >= DiskUsage of my program.

 Now, my questions are:

1. if I donot send the condor_vacate signal, and let the job run, then it finishes just fine. So, why if the job gets interrupted, condor cannot find a match for the job?
Do any of you know any workaround that I can apply? I am sure my job does not take much memory.. I am sending a number of .Class files with my job though.

2. Another question would be, what does the virtual image size of a job / image size that condor's .log file shows mean? does it include the sizes of my input files as well?

3. I am not sure whether SIZE shown by condor_q actually shows disk size used or memory size. What is it really?


Did any of you ever come across such problem or have some idea about anything related to these? I will appreciate any sort of help. Thank you.

Tan
--
--
Tanzima Zerin Islam
Graduate Student
School of Electrical & Computer Engineering
Purdue University