[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Blast job crash !



Thank you for your response Dimitri

Dimitri Maziuk <dmaziuk@...> writes:

> The $15 question is how many slots are hitting the share at once.

For the time being I use a test-pool with 10-12 slot minimum (and about 20-
25 max). I place all files input and output on the NAS with NFS access (or 
CIFS)

> We blast about 8,000 sequences weekly on a ~40-core cluster. The search 
> database is custom, > 10GB (I forget exactly how many).

In future (when my test are completed with Condor) we will use approximately 
120-140 CPU in the cluster and probably even more later.
The size of job is not always the same, I do my test with little jobs in 
real grid we need to run bigger jobs

> Placing input and output files on a basic linux nfs server (opteron 
> supermicro w/ decent amount of ram & desktop-level sata drives in 
> software raid) works fine here. 

The NAS is not administrate by me, and the administrator tell me that "there 
is no overloading". It's a relatively big server and there's lot of 
input/ouput possible etc...
So the problem is not about NAS performance apparently. Can it be a NFS 
Server/Client configuration problem?

> Placing the database on nfs doesn't 
> work, we rsync the database on each host at the start of the batch.

You mean that I've to transfer all files in each machines whenever ? it's 
gonna be more binding (slower) than use CIFS I think.

I hope there's a solution to use NFS correctly...

For the time the real pool use Sun Grid Engine (SGE) which is complicate to 
configure and not real meet our needs. But with it there's no NFS problem, 
so the problem can come from my Condor configuration.

If there's somebody use Condor in a same/close topology as me I need your 
help.

Thank you.
Regards.

-* Romain *-