[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Routing of jobs to a different condor pool



Hi Florian,

Are there multiple job routers running on the pool?  If so, it's possible that they have the same identifier and are fighting over the same jobs.  Note this in the job ad:

> RoutedBy = "jobrouter"

If this is the case, you want to give the "RemoteRouter" a unique identifier with the JOB_ROUTER_NAME configuration variable.

Brian

> On Jul 30, 2018, at 10:13 AM, R. Florian von Cube <ralf.florian.von.cube@xxxxxxx> wrote:
> 
> Hi all, 
> 
> Iâm having troubles routing jobs from one condor pool to another. A job submitted in pool1 takes the following route: 
> 
> JOB_ROUTER_REMOTE = $(JOB_ROUTER)
> 
> JOB_ROUTER_REMOTE_ARGS = -local-name JOB_ROUTER_REMOTE
> JOB_ROUTER_REMOTE_LOG = $(LOG)/RemoteRouterLog
> JOB_ROUTER_REMOTE_ENVIRONMENT = "_CONDOR_JOB_ROUTER_LOG=$(LOG)/RemoteRouterLog _CONDOR_JOB_ROUTER_LOCK=$(LOCK)/RemoteRouterLock _CONDOR_ROUTER_NAME=RemoteRouter"
> 
> DAEMON_LIST = $(DAEMON_LIST), JOB_ROUTER_REMOTE
> JOB_ROUTER_POLLING_PERIOD = 10
> PIPE_BUFFER_MAX = 102400
> 
> JOB_ROUTER_REMOTE.JOB_ROUTER_ENTRIES = \
>         [ \
>                 name = "RemoteRouteVanilla"; \
>                 requirements = ( target.INPUT_FILES is undefined && target.JobUniverse is 5 && target.JobWasRouted isnt True && target.WantDocker is undefined && target.RouteMeToCentral is True ); \
>                 GridResource = "condor sg03 sg03"; \
>                 set_remote_jobuniverse = 5; \
>                 set_Rank = 2000; \
>                 delete_RouteMeToCentral = True; \
>         ] \
> 
> The job reaches pool2 (sg03), appears for a few seconds in condor_q, and starts running. Immediately after starting it stops and disappears from condor_q. With condor_history I see i.a. the following ClassAds:
> 
> SubmitterGlobalJobId = "sg02#428.0#1532950161"
> ExitStatus = 0
> Iwd = "/var/lib/condor/spool/9025/0/cluster259025.proc0.subproc0"
> RouteName = "RemoteRouteVanilla"
> SubmitterId = "sg02"
> LastHoldReasonCode = 16
> GlobalJobId = "sg03#259025.0#1532950179"
> LastRemoteHost = "slot1@sg04"
> StartdPrincipal = "execute-side@matchsession/xxx.xxx.xxx.148"
> RoutedBy = "jobrouter"
> ReleaseReason = "Data files spooled"
> RemoveReason = "JobRouter orphan (by user condor)â
> ClusterId = 259025
> JobStatus = 3
> LastJobStatus = 2
> LastPublicClaimId = â<xxx.xxx.xxx.148:9618>#1532078126#3103#..."
> RoutedFromJobId = "427.0"
> 
> I know the job started running, as I touched a file in a specific /tmp-location in my executable and it appeared on the execution machine. However, as mentioned above, it stops running after a few seconds. On my submit machine the only log I get is that the job was submitted.
> 
> Do you have any idea, what is going wrong? The HTCondor version is 8.6.5 on all machines.
> 
> Thank you,
> Florian
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/