[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Dedicated scheduler with CCB (connection refused: connection to startd failed)



Hello!

My parallel job does not start on my startd connected via CCB.

I have three machines behind a firewall on a private network (the first - for schedd (Submit Node), the second - for collector (Cluster Manager) and the third - for startd (Inner Worker Node)). And I have another computer with a startd daemon outside (Outer Worker Node).

My parallel tasks work on the InnerÂWorker Node machine as expected. I have the following settings on the InnerÂWorker Node:
DedicatedScheduler = "DedicatedScheduler@parallel_schedd@submit.htcondor"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
RANK = Scheduler =?= $(DedicatedScheduler)

SoÂI connected the Outer Worker Node to Cluster ManagerÂusing the CCB mechanism. And I added the same settings for the DedicatedScheduler to the Outer Node. And the `condor_status` command on the Submit Node and on Cluster Manager shows me the correct information about my startd daemons: 32 slots on the Inner Worker Node (linux) and 4 slots on theÂOuter Worker Node (windows).

But when I try to submit a parallel task consisting of two subtasks, I get the error in logs:Â
condor_schedd[1286]: attempt to connect to <10.7.128.15:50371> failed: Connection refused (connect errno = 111).
condor_schedd[1286]: Failed to send REQUEST_CLAIM to startd slot1@w7-demo15 <10.7.128.15:50371?addrs=10.7.128.15-50371&alias=htcondor-remote> for DedicatedScheduler@parallel_schedd: SECMAN:2003:TCP connection to startd slot1@w7-demo15 <10.7.128.15:50371?addrs=10.7.128.15-50371&alias=htcondor-remote> for DedicatedScheduler@parallel_schedd failed.

Does this mean that the Submit Node is trying to open a tcp connection to the Outer Worker Node?ÂIf so, how would I configure schedd and startd nodes to submit my parallel tasksÂin the conditions of CCB? (connections only from the Outer Worker Node to Cluster Manager are possible)
Could you help me overcome this, please?

Thank you very much in advance!

P.S. diagnostic on the Submit Node:

# condor_q -better-analyze -verbose -allusers -debug
Fetching Machine ads... 36 ads.
Fetching job ads... 2 ads

The Requirements _expression_ for job 2.000 is

  ((OpSys == "LINUX") && (Arch == "X86_64")) && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)

Job 2.000 defines the following attributes:

  DiskUsage = 1
  ImageSize = 1
  RequestDisk = DiskUsage
  RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

The Requirements _expression_ for job 2.000 reduces to these conditions:

    ÂSlots
Step  ÂMatched ÂCondition
----- Â-------- Â---------
[0] Â Â Â Â Â32 ÂOpSys == "LINUX"


002.000: ÂJob has not yet been considered by the matchmaker.


002.000: ÂRun analysis summary ignoring user priority. Of 36 machines,
   4 are rejected by your job's requirements [slot1@w7-demo15, slot2@w7-demo15, slot3@w7-demo15, slot4@w7-demo15]
   0 reject your job because of their own requirements
   2 match and are already running your jobs
   0 match but are serving other users
  Â30 are able to run your job

The Requirements _expression_ for job 2.001 is

  ((OpSys == "WINDOWS") && (Arch == "X86_64")) && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)

Job 2.001 defines the following attributes:

  DiskUsage = 1
  ImageSize = 1
  RequestDisk = DiskUsage
  RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

The Requirements _expression_ for job 2.001 reduces to these conditions:

    ÂSlots
Step  ÂMatched ÂCondition
----- Â-------- Â---------
[0] Â Â Â Â Â 4 ÂOpSys == "WINDOWS"


002.001: ÂJob has not yet been considered by the matchmaker.


002.001: ÂRun analysis summary ignoring user priority. Of 36 machines,
  Â32 are rejected by your job's requirements [slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, slot2@xxxxxxxxxxxxxxxx]
   0 reject your job because of their own requirements
   0 match and are already running your jobs
   0 match but are serving other users
   4 are able to run your job

--
Sincerely yours,
Ivan Ergunov                         mailto:hozblok@xxxxxxxxx