[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.8.n: job scheduling process delays



Kevin.Buckley@xxxxxxxxxxxxx wrote:
> 
> > You may want to check the StartLog on the machine in question.  It
> > appears that there may be some network issues between the shadow and
> > starter.
> >
> > When the shadow returned 100, I believe that is the OS errno.
> >
> > For Linux that is:
> > #define ENETDOWN        100     /* Network is down */
> 
> OK, will do.

No, the shadow exit code is not a Unix errno value.

See: http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=MagicNumbers

100 - JOB_EXITED - The job exited (not killed)

This is the normal status value for a completed job.

> > Also, if "<IP.AD.DR.ESS:port>" is the actual value in the logs then you
> > likely have some condor_config issues (again check your execute node),
> > which I believe could lend to the afore mentioned error.
> 
> Nope, that was just me anonymising things.

Back to your original question, this is entirely a scalability issue.
Prior to the 6.9.3 release the schedd simply couldn't handle more than
a few thousand jobs in the job queue without a severe degradation in
performance.  I believe your previous message stated you had around
17,500 jobs in the queue - this simply won't work with Condor 6.8.

The easiest temporary solution is to only have a few thousand jobs in
the queue.  Since it was a single user with that many jobs, maybe they
can submit them manually in chunks or convert to using DAGMan.  There
was a current thread, "restricting the number of jobs", that talked
about this:

https://lists.cs.wisc.edu/archive/condor-users/2009-October/msg00068.shtml

-- 
Daniel K. Forrest		Space Science and
dan.forrest@xxxxxxxxxxxxx	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison