Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problem with periodic_release and globus_resubmit
- Date: Thu, 12 Feb 2009 13:17:17 -0800
- From: Patrick Armstrong <patricka@xxxxxxx>
- Subject: Re: [Condor-users] Problem with periodic_release and globus_resubmit
On 12-Feb-09, at 10:23 AM, Steven Timm wrote:
Patrick--we are doing almost the same thing here at FermiGrid.
That's great! It gives me hope that I'll get this to work.
On the jobs that are idle:
1) what does the UserLog say
http://pastie.org/387553
2) What does condor_q -ana say
2389.000: Run analysis summary. Of 3 machines,
1 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
2 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
Last successful match: Thu Feb 12 11:55:09 2009
The one resource is rejected because I have (TARGET.Name =!=
LastMatchName0) in my Requirements section.
3) what does condor_q -globus say
-- Submitter: ms-gavia-testing.phys.UVic.CA : <142.104.63.16:65121> :
ms-gavia-testing.phys.UVic.CA
ID OWNER STATUS MANAGER HOST EXECUTABLE
2389.0 dev07 UNSUBMITTED fork [?????] run-
run.sh
In our experience, for all Grid universe jobs,
GridJobStatus is never undefined. If it is undefined, something is
very wrong. I just did a query on my queue with more than 1000 jobs
and it
is defined for all. I don't see how the GlobusResubmit statement
you have below could ever be true.
Well, I don't know what to tell you, here's the output of condor_q -l
in one of my perpetually Idle jobs: http://pastie.org/387562
Also,
[root@ms-gavia-testing ~]# condor_q -constraint "(GridJobStatus =?=
UNDEFINED) && (NumSystemHolds >= NumJobMatches)"
-- Submitter: ms-gavia-testing.phys.UVic.CA : <142.104.63.16:65121> :
ms-gavia-testing.phys.UVic.CA
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2389.0 dev07 2/12 11:51 0+00:00:00 I 0 0.0 run-run.sh
So, it _is_ evaluating to true.
Also you mentioned a periodic_hold statement below but you
did not actually show us what it was.
Sorry, I'm just using periodic_release and globus_resubmit. I meant
periodic_release.
Thanks for the interest, but I'm still not sure what could be wrong.
--patrick