[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] parallel universe: machine_count does not mean number of jobs?



Hi

I'm just doing my first steps with the parallel universe. I've started with 
the tiny example from the manual
--------------8><-----------------8><--------------8><-----------
$ cat test.sub
######################################
## Parallel example submit description file
######################################
universe = parallel
executable = /bin/cat
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 20
queue
--------------8><-----------------8><--------------8><-----------
so far so good. I've set up a couple of our quad-core compute nodes (dedicated 
to run condor jobs) to use my submit host as the DedicatedScheduler. After 
submitting the job, some take off and then the job finishes (successfully from 
the point of view of condor):
--------------8><-----------------8><--------------8><-----------
$ cat logfile
000 (683458.000.000) 09/08 08:32:16 Job submitted from host: 
<10.20.30.1:41791>
...
014 (683458.000.000) 09/08 08:32:45 Node 0 executing on host: 
<10.10.16.6:32809>
...
014 (683458.000.003) 09/08 08:32:45 Node 3 executing on host: 
<10.10.16.8:60916>
...
014 (683458.000.005) 09/08 08:32:45 Node 5 executing on host: 
<10.10.16.9:52771>
...
014 (683458.000.016) 09/08 08:32:45 Node 16 executing on host: 
<10.10.16.51:34268>
...
015 (683458.000.000) 09/08 08:32:45 Node 0 terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Node
        0  -  Run Bytes Received By Node
        0  -  Total Bytes Sent By Node
        0  -  Total Bytes Received By Node
...
005 (683458.000.000) 09/08 08:32:45 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...
--------------8><-----------------8><--------------8><-----------

And not surprisingly, only the named 4 jobs produced output:
find . -name "outfile.*" -size "+0c"
./outfile.0
./outfile.3
./outfile.16
./outfile.5

All others are empty. I guess I'm missing something here, can someone please 
tell me into which wall I'm currently running?

Cheers

Carsten