[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] BLAST jobs go to 0% CPU; condor thinksthey'rerunning



I've run into the same problem.

Apparently when BLAST starts up, it loads the database into memory.
During this time, the CPU usage is essentially 0% because all of the
activity is I/O. While this is going on, the size in memory increases
steadily until the database is loaded.

Once BLAST is ready, the cpu usage jumps to 100% and the size in memory
fluctuates wildly (according to the Task Manager) until the job is done.
By wildly, I mean from 100MB to 500MB back down to 100MB in a matter of
seconds.

There are two ways that BLAST appears to get stuck during the database
loading phase.
1. BLAST continuously allocates memory for an unreasonably long period
of time. Perhaps it's working just fine, but spending a couple hours
loading the database is bothersome.
A job with 3 hours of time on the remote machine, an image size of more
than half a gig and no cpu time to speak of, will get a condor_hold and
a condor_release to make it try again elsewhere.

2. BLAST doesn't allocate memory, and remains at 0% cpu indefinately.
Often, this happens to all jobs in the cluster. In one case, changing
the job_renice_increment from 19 to 10 appeared to do the trick. In
another case, that change had no effect.
The solution for this is elusive, and help would be appreciated.


Presently we have 4 dual-processor machines dedicated to the BLAST
searches. These dedicated machines have the database staged on the local
filesystem to save time. These are the machines that the
job_renice_increment=10 trick worked on.

The database isn't staged on machines in the general pool, so a database
file is transferred with each job. Everything worked well when the files
were under 25MB. With larger files, BLAST starts but doesn't allocate
memory and stays at 0% cpu until killed. Changing the
job_renice_increment didn't do anything for these jobs.

-James Martin




"David E. Konerding" wrote:
> 
> Michael Rusch wrote:
> 
> >I don't know what you mean by your question: are the jobs still alive when
> >the CPU drops to 0%.  The processes still exist, as I can see them using
> >Task Manager (I'm in Windows XP--no ps command), but they never get any CPU
> >time.
> >
> >But, the good news is that it's working now.  Why?  I have no idea.  After
> >having these problems, I switched a couple of the machines to the UWCS
> >default settings for starting, suspending, preempting jobs, etc.  I screwed
> >up one of them pretty badly, which made startd crash constantly, so that
> >that node disappeared from the pool.  After that, running the BLAST worked
> >fine.  When I fixed the config script and the node came back, it still
> >worked fine.  I had not modified the config on that node at all when it
> >wasn't working...it was the same as the rest, but after breaking it and
> >fixing it, it worked.
> >
> >Go figure.
> >
> >Michael.
> >
> >
> >
> I don't know anything abotu the Windows version of Condor's detection of
> user activity on the machine, or its ability to suspend jobs temporarily.
> I am guessing somehow Windows detected the machine was not idle (mayeb
> you logged in to administrate the machine?) and
> suspended the job.
> 
> With UNIX, you can typically run ps and see the job in the T state
> (stopped) when condor has suspended it.
> Microsoft makes "Services for UNIX" and distributes it on their web
> site.  It includes a win32 ps command which may tell you whether the job
> has been stopped.
> 
> Dave
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users