[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Routing of jobs to a different condor pool



Hi all, 

Iâm having troubles routing jobs from one condor pool to another. A job submitted in pool1 takes the following route: 

JOB_ROUTER_REMOTE = $(JOB_ROUTER)

JOB_ROUTER_REMOTE_ARGS = -local-name JOB_ROUTER_REMOTE
JOB_ROUTER_REMOTE_LOG = $(LOG)/RemoteRouterLog
JOB_ROUTER_REMOTE_ENVIRONMENT = "_CONDOR_JOB_ROUTER_LOG=$(LOG)/RemoteRouterLog _CONDOR_JOB_ROUTER_LOCK=$(LOCK)/RemoteRouterLock _CONDOR_ROUTER_NAME=RemoteRouter"

DAEMON_LIST = $(DAEMON_LIST), JOB_ROUTER_REMOTE
JOB_ROUTER_POLLING_PERIOD = 10
PIPE_BUFFER_MAX = 102400

JOB_ROUTER_REMOTE.JOB_ROUTER_ENTRIES = \
        [ \
                name = "RemoteRouteVanilla"; \
                requirements = ( target.INPUT_FILES is undefined && target.JobUniverse is 5 && target.JobWasRouted isnt True && target.WantDocker is undefined && target.RouteMeToCentral is True ); \
                GridResource = "condor sg03 sg03"; \
                set_remote_jobuniverse = 5; \
                set_Rank = 2000; \
                delete_RouteMeToCentral = True; \
        ] \

The job reaches pool2 (sg03), appears for a few seconds in condor_q, and starts running. Immediately after starting it stops and disappears from condor_q. With condor_history I see i.a. the following ClassAds:

SubmitterGlobalJobId = "sg02#428.0#1532950161"
ExitStatus = 0
Iwd = "/var/lib/condor/spool/9025/0/cluster259025.proc0.subproc0"
RouteName = "RemoteRouteVanilla"
SubmitterId = "sg02"
LastHoldReasonCode = 16
GlobalJobId = "sg03#259025.0#1532950179"
LastRemoteHost = "slot1@sg04"
StartdPrincipal = "execute-side@matchsession/xxx.xxx.xxx.148"
RoutedBy = "jobrouter"
ReleaseReason = "Data files spooled"
RemoveReason = "JobRouter orphan (by user condor)â
ClusterId = 259025
JobStatus = 3
LastJobStatus = 2
LastPublicClaimId = â<xxx.xxx.xxx.148:9618>#1532078126#3103#..."
RoutedFromJobId = "427.0"

I know the job started running, as I touched a file in a specific /tmp-location in my executable and it appeared on the execution machine. However, as mentioned above, it stops running after a few seconds. On my submit machine the only log I get is that the job was submitted.

Do you have any idea, what is going wrong? The HTCondor version is 8.6.5 on all machines.

Thank you,
Florian