[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Routing of jobs to a different condor pool
- Date: Tue, 31 Jul 2018 14:02:31 +0200
- From: "R. Florian von Cube" <ralf.florian.von.cube@xxxxxxx>
- Subject: Re: [HTCondor-users] Routing of jobs to a different condor pool
Yes, we have multiple routers running. Assigning each a unique identifier as you suggested solved the problem. Thank you!
According to http://research.cs.wisc.edu/htcondor/manual/v8.6/3_5Configuration_Macros.html#SECTION004519000000000000000 none but the first router to start up should be running if their JOB_ROUTER_NAME is not unique. However, apart from my initial problem, the routers work(ed) as they shouldâ
> On 30. Jul 2018, at 17:23, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
> Hi Florian,
> Are there multiple job routers running on the pool? If so, it's possible that they have the same identifier and are fighting over the same jobs. Note this in the job ad:
>> RoutedBy = "jobrouter"
> If this is the case, you want to give the "RemoteRouter" a unique identifier with the JOB_ROUTER_NAME configuration variable.
>> On Jul 30, 2018, at 10:13 AM, R. Florian von Cube <ralf.florian.von.cube@xxxxxxx> wrote:
>> Hi all,
>> Iâm having troubles routing jobs from one condor pool to another. A job submitted in pool1 takes the following route:
>> JOB_ROUTER_REMOTE = $(JOB_ROUTER)
>> JOB_ROUTER_REMOTE_ARGS = -local-name JOB_ROUTER_REMOTE
>> JOB_ROUTER_REMOTE_LOG = $(LOG)/RemoteRouterLog
>> JOB_ROUTER_REMOTE_ENVIRONMENT = "_CONDOR_JOB_ROUTER_LOG=$(LOG)/RemoteRouterLog _CONDOR_JOB_ROUTER_LOCK=$(LOCK)/RemoteRouterLock _CONDOR_ROUTER_NAME=RemoteRouter"
>> DAEMON_LIST = $(DAEMON_LIST), JOB_ROUTER_REMOTE
>> JOB_ROUTER_POLLING_PERIOD = 10
>> PIPE_BUFFER_MAX = 102400
>> JOB_ROUTER_REMOTE.JOB_ROUTER_ENTRIES = \
>> [ \
>> name = "RemoteRouteVanilla"; \
>> requirements = ( target.INPUT_FILES is undefined && target.JobUniverse is 5 && target.JobWasRouted isnt True && target.WantDocker is undefined && target.RouteMeToCentral is True ); \
>> GridResource = "condor sg03 sg03"; \
>> set_remote_jobuniverse = 5; \
>> set_Rank = 2000; \
>> delete_RouteMeToCentral = True; \
>> ] \
>> The job reaches pool2 (sg03), appears for a few seconds in condor_q, and starts running. Immediately after starting it stops and disappears from condor_q. With condor_history I see i.a. the following ClassAds:
>> SubmitterGlobalJobId = "sg02#428.0#1532950161"
>> ExitStatus = 0
>> Iwd = "/var/lib/condor/spool/9025/0/cluster259025.proc0.subproc0"
>> RouteName = "RemoteRouteVanilla"
>> SubmitterId = "sg02"
>> LastHoldReasonCode = 16
>> GlobalJobId = "sg03#259025.0#1532950179"
>> LastRemoteHost = "slot1@sg04"
>> StartdPrincipal = "execute-side@matchsession/xxx.xxx.xxx.148"
>> RoutedBy = "jobrouter"
>> ReleaseReason = "Data files spooled"
>> RemoveReason = "JobRouter orphan (by user condor)â
>> ClusterId = 259025
>> JobStatus = 3
>> LastJobStatus = 2
>> LastPublicClaimId = â<xxx.xxx.xxx.148:9618>#1532078126#3103#..."
>> RoutedFromJobId = "427.0"
>> I know the job started running, as I touched a file in a specific /tmp-location in my executable and it appeared on the execution machine. However, as mentioned above, it stops running after a few seconds. On my submit machine the only log I get is that the job was submitted.
>> Do you have any idea, what is going wrong? The HTCondor version is 8.6.5 on all machines.
>> Thank you,
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> The archives can be found at:
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at: