[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Parallel MPI Job remaining idle



Hi Guys,

I'm relatively new to condor but this one has got me stumped. I'm attempting to submit a MPI job using the parallel universe using the following command file

universe                  = parallel
executable              = /home/condor/release/etc/examples/mp1script
arguments               = cpi
machine_count        = 2
log                          = cpi.log
output                     = cpi.$(NODE).out
error                        = cpi.$(NODE).err
transfer_input_files   = cpi
queue

The job is getting queued and then remaining idle and I'm unable to work out why.
It looks like there are 24 machines which meet the criteria (these are slots on physical machine) however it is failed to schedule the job.
Usually following the line "Sent ad to 1 collectors for condor@xxxxxxxxxx" is "Using negotiation protocol: NEGOTIATE" however in this case it doesn't seem to attempt it.
If anyone has any ideas what to look at or what is causing it that would be great.

Logs shown below:-

The SchedLog:
07/10/13 15:45:58 (pid:13106) Sent ad to central manager for condor@xxxxxxxxxx
07/10/13 15:45:58 (pid:13106) Sent ad to 1 collectors for condor@xxxxxxxxxx
07/10/13 15:45:58 (pid:13106) Inserting new attribute Scheduler into non-active cluster cid=29 acid=-1
07/10/13 15:46:15 (pid:13106) Number of Active Workers 1
07/10/13 15:46:15 (pid:16253) Number of Active Workers 0
07/10/13 15:46:22 (pid:13106) Number of Active Workers 1
07/10/13 15:46:22 (pid:16255) Number of Active Workers 0
07/10/13 15:50:59 (pid:13106) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
07/10/13 15:50:59 (pid:13106) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
07/10/13 15:50:59 (pid:13106) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
07/10/13 15:50:59 (pid:13106) Sent ad to central manager for condor@xxxxxxxxxx
07/10/13 15:50:59 (pid:13106) Sent ad to 1 collectors for condor@xxxxxxxxxx
07/10/13 15:50:59 (pid:13106) Inserting new attribute Scheduler into non-active cluster cid=29 acid=-1


Running condor_q -better-analyse
029.000:  Request has not yet been considered by the matchmaker.

User priority for condor@xxxxxxxxxx is not available, attempting to analyze without it.
---
029.000:  Run analysis summary.  Of 28 machines,
      4 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
     24 are available to run your job

The Requirements _expression_ for your job is:

    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
    ( ( TARGET.HasFileTransfer ) ||
      ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

Your job defines the following attributes:

    FileSystemDomain = "york.ac.uk"
    DiskUsage = 1500
    ImageSize = 2
    RequestDisk = 1500
    RequestMemory = 1

The Requirements _expression_ for your job reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[1]          24  TARGET.OpSys == "LINUX"
[7]          28  TARGET.HasFileTransfer

Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( TARGET.OpSys == "LINUX" )       24
2   ( TARGET.Arch == "X86_64" )       28
3   ( TARGET.Disk >= 1500 )           28
4   ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
                                      28
5   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "york.ac.uk" ) )
                                      28

Thanks,
Matt