I have a four machine condor pool (three are dual-processor, so there are 7 virtual machines), with all machines running Windows XP.
I have tried several times to submit a job cluster that has sixteen individual jobs/processes. They’re all BLAST searches, for those who are familiar with BLAST. Each job uses two input files and a batch script issues the two commands necessary (formatdb and blastall). There are a total of four input files and the submit script queues one process for every ordered pair of input files (for 4x4 = 16 jobs).
Every time I’ve submitted the cluster it completes the first four jobs (searching a single input file against each of the other ones), and it runs the others for about a minute, after which the execute machine beeps (it’s the “Asterisk” sound), and then processes drop down to 0% CPU. They do not drop down at the same time, but close to one another. Condor_q reports that they are still running, but they are not. In one case, they resumed for a brief period of time after several hours of not doing anything. Nothing in the condor logs.
If you run the jobs without condor, it works fine (though it takes forever). Also, I noticed that for some reason the jobs when run through condor use significantly more CPU than when you just run individually on the local machine.