[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to find the name of the condor_collecter and the name of condor_schedd daemon?Thank you very much.



 Dear Todd:
        Thank you very much for your reply.
        But the problem is still there. Can you help me to resolve it?Thank you very much.
        In fact, I have already modified the /etc/condor/condor_config  file for pool A (named as 181.nodeljA) ,add the following content into the file:
          FLOCK_TO =188.nodeljB
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS), $(IP_ADDRESS)
 
          CONDOR_GAHP = $(SBIN)/condor_c-gahp
C_GAHP_LOG  = /tmp/CGAHPLog.$(USERNAME)
C_GAHP_WORKER_THREAD_LOG = /tmp/CGAHPWorkerLog.$(USERNAME)
C_GAHP_WORKER_THREAD_LOCK = /tmp/CGAHPWorkerLock.$(USERNAME)
     
 and the same file for pool B (named a188.nodeljB) is modified:
FLOCK_FROM=181.nodeljA
FLOCK_TO=
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
ALLOW_ADMINISTRATOR = $(CONDOR_HOST), $(IP_ADDRESS)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
ALLOW_READ=*.nodeljB
ALLOW_WRITE=*.nodeljB
ALLOW_NEGOTIATOR = xxx@$(CONDOR_HOST), $(IP_ADDRESS)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS), $(IP_ADDRESS)
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTD    = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_READ_COLLECTOR  = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTD     = $(ALLOW_READ), $(FLOCK_FROM)
USE_NFS         = True
LOCK            = $(LOCAL_DIR)/lock/condor
       
SEC_DEFAULT_NEGOTIATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
 
        and the condor submit description file(named as sub.txt) looks like as:
universe=grid
executable=/data/condor_test/CondorTest.class
input=/data/condor_test/list.txt
arguments=CondorTest181795_2014-05-14_152801.mp4
log=/data/condor_test/condor.log
error=/data/condor_test/condor.error
grid_resource=condor  188.nodeljB   188.nodeljB
+remote_universe=10
+remote_requirements=True
+remote_ShouldTransferFiles='YES'
queue

when I run the command : condor_submit sub.txt
All the jobs are held.And the log tell me that:
012 (031.239.000) 06/14 08:54:52 Job was held.
        GridResource missing pool name
        Code 0 Subcode 0


And the content of /var/log/condor/GridmanagerLog.xxx is as follows:

06/14/16 09:01:06 ******************************************************
06/14/16 09:01:06 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
06/14/16 09:01:06 ** /usr/sbin/condor_gridmanager
06/14/16 09:01:06 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1)
06/14/16 09:01:06 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
06/14/16 09:01:06 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $
06/14/16 09:01:06 ** $CondorPlatform: x86_64_RedHat6 $
06/14/16 09:01:06 ** PID = 3607
06/14/16 09:01:06 ** Log last touched 6/14 08:54:57
06/14/16 09:01:06 ******************************************************
06/14/16 09:01:06 Using config source: /etc/condor/condor_config
06/14/16 09:01:06 Using local config sources:
06/14/16 09:01:06    /etc/condor/condor_config.local
06/14/16 09:01:06 config Macros = 62, Sorted = 62, StringBytes = 1644, TablesBytes = 2272
06/14/16 09:01:06 CLASSAD_CACHING is ENABLED
06/14/16 09:01:06 Daemon Log is logging: D_ALWAYS D_ERROR
06/14/16 09:01:06 Daemoncore: Listening at <0.0.0.0:55344> on TCP (ReliSock) and UDP (SafeSock).
06/14/16 09:01:06 DaemonCore: command socket at <192.168.1.181:55344?addrs=192.168.1.181-55344>
06/14/16 09:01:06 DaemonCore: private command socket at <192.168.1.181:55344?addrs=192.168.1.181-55344>
06/14/16 09:01:09 [3607] Found job 32.0 --- inserting
06/14/16 09:01:09 [3607] Found job 32.1 --- inserting
06/14/16 09:01:10 [3607] (32.0) doEvaluateState called: gmState GM_HOLD, remoteState -1
06/14/16 09:01:10 [3607] (32.1) doEvaluateState called: gmState GM_HOLD, remoteState -1
06/14/16 09:01:15 [3607] No jobs left, shutting down
06/14/16 09:01:15 [3607] Got SIGTERM. Performing graceful shutdown.
06/14/16 09:01:15 [3607] **** condor_gridmanager (condor_GRIDMANAGER) pid 3607 EXITING WITH STATUS 0

I think that the reason is I have not give the right parament for grid_resource  the submit description file.
The result of command "condor_status -schedd" is as follows:
[root@188 ~]# condor_status -schedd
Name            Machine      RunningJobs   IdleJobs   HeldJobs

188.nodeljB     188.nodeljB            0          0          0

                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs


               Total                 0                  0                  0

and the result of command "condor_status  -collector" is as follows:

[root@188 ~]# condor_status -collector
Name                                             Machine                                          RunningJobs IdleJobs HostsTotal

"Condor Pool of LJ"@151.nodelj               151.nodelj                                                 0        0          0
"CPLJ"@188.nodeljB                           188.nodeljB                                                0        0          4
"Condor Pool of LJ"@188.nodeljB              188.nodeljB                                                0        0          0


Can you help me to find the method to this problem?
Thank you very much.
Best regards.

    
Date: Mon, 13 Jun 2016 12:46:37 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to find the name of the
condor_collecter and the name of condor_schedd daemon?Thank you very
much.
Message-ID: <575EF17D.3010607@xxxxxxxxxxx>
Content-Type: text/plain; CHARSET=US-ASCII; format=flowed
 
On 6/12/2016 3:23 AM, HTCondor wrote:
> Dear all,
>      I am configuring the HTCondor Flock. There is an parament ind the
> submit description file for a job, that is grid_resource
>     According to the manul of HTCondor, the third field is the name of
> the remote pool's  condor_collecter , and the second field is the name
> of the remote condor_schedd daemon.
>    Can you tell me how to find the detail value of them ?
>    Thank you very much.
> Best regards
>
>      David
>
 
Hi David,
 
Flocking is HTCondor's way of allowing jobs that cannot immediately run
within the pool of machines where the job was submitted to instead run
on a different HTCondor pool. If a machine within HTCondor pool A can
send jobs to be run on HTCondor pool B, then we say that jobs from
machine A flock to pool B.
 
If Flocking is what you want, you don't need to mess around with grid
universe, grid_resource, or any of that.    On the condor_config for
machines in pool A just modify the FLOCK_TO line to include the hostname
of pool B central manager, and on the condor_configs for machines in
pool B just modify the FLOCK_FROM line to include the hostname of pool A
central manager.
 
Details are in:
 
http://research.cs.wisc.edu/htcondor/manual/v8.4/5_2Connecting_HTCondor.html
 
 
regards,
Todd

btdan@xxxxxxx