[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] BLAST jobs go to 0% CPU; condor thinks they'rerunning



I don't know what you mean by your question: are the jobs still alive when
the CPU drops to 0%.  The processes still exist, as I can see them using
Task Manager (I'm in Windows XP--no ps command), but they never get any CPU
time.

But, the good news is that it's working now.  Why?  I have no idea.  After
having these problems, I switched a couple of the machines to the UWCS
default settings for starting, suspending, preempting jobs, etc.  I screwed
up one of them pretty badly, which made startd crash constantly, so that
that node disappeared from the pool.  After that, running the BLAST worked
fine.  When I fixed the config script and the node came back, it still
worked fine.  I had not modified the config on that node at all when it
wasn't working...it was the same as the rest, but after breaking it and
fixing it, it worked.

Go figure.

Michael.

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Jaime Frey
> Sent: Monday, March 07, 2005 1:56 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] BLAST jobs go to 0% CPU; condor thinks
> they'rerunning
> 
> On Tue, 1 Mar 2005, Michael Rusch wrote:
> 
> > I have a four machine condor pool (three are dual-processor, so there
> are 7
> > virtual machines), with all machines running Windows XP.
> >
> > I have tried several times to submit a job cluster that has sixteen
> > individual jobs/processes.  They're all BLAST searches, for those who
> are
> > familiar with BLAST.  Each job uses two input files and a batch script
> > issues the two commands necessary (formatdb and blastall).  There are a
> > total of four input files and the submit script queues one process for
> every
> > ordered pair of input files (for 4x4 = 16 jobs).
> >
> > Every time I've submitted the cluster it completes the first four jobs
> > (searching a single input file against each of the other ones), and it
> runs
> > the others for about a minute, after which the execute machine beeps
> (it's
> > the "Asterisk" sound), and then processes drop down to 0% CPU.  They do
> not
> > drop down at the same time, but close to one another.  Condor_q reports
> that
> > they are still running, but they are not.  In one case, they resumed for
> a
> > brief period of time after several hours of not doing anything.  Nothing
> in
> > the condor logs.
> >
> > If you run the jobs without condor, it works fine (though it takes
> forever).
> > Also, I noticed that for some reason the jobs when run through condor
> use
> > significantly more CPU than when you just run individually on the local
> > machine.
> 
> Are the jobs still alive when the CPU drops to 0%? You can check by
> logging into the machines, running ps and looking for processes named
> condor_exec.exe. If you are programming savvy and know something about
> the BLAST code, you can attach to them with a debugger to see why they're
> stuck.
> 
> +----------------------------------+---------------------------------+
> |            Jaime Frey            |  Public Split on Whether        |
> |        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
> |  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
> +----------------------------------+---------------------------------+
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users