[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Flocking - jobs matched but not started



To see what the job is doing on the flocked-to pool, you have to run
condor_q from there, using the -name option to point to the original
submitter.  It winds up looking something like this:

condor_q -name <name of submitter's schedd> -analyze <job>

In your case, I think it's

condor_q -name ws-60-56.dhcp.plymouth.ac.uk -analyze 48

Maybe somebody can correct me on that, if I don't have it right (I'm a
little rusty)


Also, how long have you waited for the job to run?  We've regularly seen
(and people from the Condor team have confirmed) significant lag times on
flocked jobs starting.  For example, the last time we flocked there were 12
jobs, and it took almost an hour for any of them to start running.  I don't
really know where the lag comes from, so maybe there are config options that
could be changed to decrease it.

Michael.

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of John Horne
> Sent: Friday, August 19, 2005 10:20 AM
> To: Condor-Users Mail List
> Subject: RE: [Condor-users] Flocking - jobs matched but not started
> 
> On Fri, 2005-08-19 at 10:57 -0500, Michael Rusch wrote:
> > For what it's worth, I have had what sounds like a similar problem for
> quite
> > awhile, though it has been much harder for me to debug, since I don't
> have
> > access to logs on the flocked-to pool.  Out of curiosity, when your jobs
> > "match" but don't run, are they still listed as idle in the queue?
> >
> Yes, as a snippet of condor_q shows:
> 
> -- Submitter: ws-60-56.dhcp.plymouth.ac.uk : <141.163.60.56:44957> :
> ws-60-56.dhcp.plymouth.ac.uk
>  ID      OWNER    SUBMITTED     RUN_TIME ST PRI SIZE CMD
>   48.0   john     8/18 18:05   0+00:17:26 I  0   1.6  loop.remote 200
> 
> >
> > When you condor_q -analyze, are they shown as having machines that are
> available to
> > run the job?  I'm trying to figure out if this is the same problem, in
> which
> > case I may have 2 cents to put in...
> >
> No I don't see that. condor -q shows:
> 
> ==============================================================
> [root@ws-60-56 log]# condor_q -analyze 48.0
> 
> 
> -- Submitter: ws-60-56.dhcp.plymouth.ac.uk : <141.163.60.56:44957> :
> ws-60-56.dhcp.plymouth.ac.uk
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> ---
> 048.000:  Run analysis summary.  Of 0 machines,
>       0 are rejected by your job's requirements
>       0 reject your job because of their own requirements
>       0 match, but are serving users with a better priority in the pool
>       0 match, match, but reject the job for unknown reasons
>       0 match, but will not currently preempt their existing job
>       0 are available to run your job
>         Last successful match: Fri Aug 19 17:13:00 2005
>         Last failed match: Fri Aug 19 17:15:00 2005
>         Reason for last match failure: no match found
> 
> WARNING:  Be advised:
>    No resources matched request's constraints
>    Check the Requirements expression below:
> 
> Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && ((CkptArch ==
> Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) ||
> (CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024) >=
> ImageSize)
> 
> 
> WARNING:  Be advised:   Request 48.0 did not match any resource's
> constraints
> ==============================================================
> 
> However, this just will be picked up on the remote server and matched
> with a client in it's pool. So I think the 'condor_q -analyze' is a bit
> misleading here as it seems to show a job which is having a problem
> running. Having said that though, the condor_q command is looking at the
> job and seeing if it can run locally (which it can't). In my case I have
> stopped startd so it won't run but must be flocked.
> 
> 
> 
> John.
> 
> --
> ---------------------------------------------------------------
> John Horne, University of Plymouth, UK  Tel: +44 (0)1752 233914
> E-mail: John.Horne@xxxxxxxxxxxxxx       Fax: +44 (0)1752 233839
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users