[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] BLAST jobs go to 0% CPU; condor thinksthey'rerunning



FYI: in my case, the database was small and not staged on the individual
machines.  It wasn't working at first, and then it was--not sure what
changed.  I didn't check memory allocation, and I can't reproduce the
problem because it's working now.

Michael.

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of James Martin
> Sent: Thursday, April 14, 2005 4:39 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] BLAST jobs go to 0% CPU;condor
> thinksthey'rerunning
> 
> I've run into the same problem.
> 
> Apparently when BLAST starts up, it loads the database into memory.
> During this time, the CPU usage is essentially 0% because all of the
> activity is I/O. While this is going on, the size in memory increases
> steadily until the database is loaded.
> 
> Once BLAST is ready, the cpu usage jumps to 100% and the size in memory
> fluctuates wildly (according to the Task Manager) until the job is done.
> By wildly, I mean from 100MB to 500MB back down to 100MB in a matter of
> seconds.
> 
> There are two ways that BLAST appears to get stuck during the database
> loading phase.
> 1. BLAST continuously allocates memory for an unreasonably long period
> of time. Perhaps it's working just fine, but spending a couple hours
> loading the database is bothersome.
> A job with 3 hours of time on the remote machine, an image size of more
> than half a gig and no cpu time to speak of, will get a condor_hold and
> a condor_release to make it try again elsewhere.
> 
> 2. BLAST doesn't allocate memory, and remains at 0% cpu indefinately.
> Often, this happens to all jobs in the cluster. In one case, changing
> the job_renice_increment from 19 to 10 appeared to do the trick. In
> another case, that change had no effect.
> The solution for this is elusive, and help would be appreciated.
> 
> 
> Presently we have 4 dual-processor machines dedicated to the BLAST
> searches. These dedicated machines have the database staged on the local
> filesystem to save time. These are the machines that the
> job_renice_increment=10 trick worked on.
> 
> The database isn't staged on machines in the general pool, so a database
> file is transferred with each job. Everything worked well when the files
> were under 25MB. With larger files, BLAST starts but doesn't allocate
> memory and stays at 0% cpu until killed. Changing the
> job_renice_increment didn't do anything for these jobs.
> 
> -James Martin
> 
> 
> 
> 
> "David E. Konerding" wrote:
> >
> > Michael Rusch wrote:
> >
> > >I don't know what you mean by your question: are the jobs still alive
> when
> > >the CPU drops to 0%.  The processes still exist, as I can see them
> using
> > >Task Manager (I'm in Windows XP--no ps command), but they never get any
> CPU
> > >time.
> > >
> > >But, the good news is that it's working now.  Why?  I have no idea.
> After
> > >having these problems, I switched a couple of the machines to the UWCS
> > >default settings for starting, suspending, preempting jobs, etc.  I
> screwed
> > >up one of them pretty badly, which made startd crash constantly, so
> that
> > >that node disappeared from the pool.  After that, running the BLAST
> worked
> > >fine.  When I fixed the config script and the node came back, it still
> > >worked fine.  I had not modified the config on that node at all when it
> > >wasn't working...it was the same as the rest, but after breaking it and
> > >fixing it, it worked.
> > >
> > >Go figure.
> > >
> > >Michael.
> > >
> > >
> > >
> > I don't know anything abotu the Windows version of Condor's detection of
> > user activity on the machine, or its ability to suspend jobs
> temporarily.
> > I am guessing somehow Windows detected the machine was not idle (mayeb
> > you logged in to administrate the machine?) and
> > suspended the job.
> >
> > With UNIX, you can typically run ps and see the job in the T state
> > (stopped) when condor has suspended it.
> > Microsoft makes "Services for UNIX" and distributes it on their web
> > site.  It includes a win32 ps command which may tell you whether the job
> > has been stopped.
> >
> > Dave
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users