[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem with periodic_release and globus_resubmit



On 12-Feb-09, at 10:23 AM, Steven Timm wrote:
Patrick--we are doing almost the same thing here at FermiGrid.

That's great! It gives me hope that I'll get this to work.

On the jobs that are idle:
1) what does the UserLog say

http://pastie.org/387553

2) What does condor_q -ana say

2389.000:  Run analysis summary.  Of 3 machines,
      1 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      2 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
	Last successful match: Thu Feb 12 11:55:09 2009

The one resource is rejected because I have (TARGET.Name =!= LastMatchName0) in my Requirements section.


3) what does condor_q -globus say

-- Submitter: ms-gavia-testing.phys.UVic.CA : <142.104.63.16:65121> : ms-gavia-testing.phys.UVic.CA
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE
2389.0 dev07 UNSUBMITTED fork [?????] run- run.sh



In our experience, for all Grid universe jobs,
GridJobStatus is never undefined. If it is undefined, something is very wrong. I just did a query on my queue with more than 1000 jobs and it
is defined for all.  I don't see how the GlobusResubmit statement
you have below could ever be true.

Well, I don't know what to tell you, here's the output of condor_q -l in one of my perpetually Idle jobs: http://pastie.org/387562

Also,

[root@ms-gavia-testing ~]# condor_q -constraint "(GridJobStatus =?= UNDEFINED) && (NumSystemHolds >= NumJobMatches)" -- Submitter: ms-gavia-testing.phys.UVic.CA : <142.104.63.16:65121> : ms-gavia-testing.phys.UVic.CA
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
2389.0   dev07           2/12 11:51   0+00:00:00 I  0   0.0  run-run.sh

So, it _is_ evaluating to true.


Also you mentioned a periodic_hold statement below but you
did not actually show us what it was.

Sorry, I'm just using periodic_release and globus_resubmit. I meant periodic_release.

Thanks for the interest, but I'm still not sure what could be wrong.

--patrick