[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] evicted multicore jobs



On 11/14/2017 8:06 AM, Mihai Ciubancan wrote:



Hi everyone,

I'm running multicore jobs and all are evicted and I don't understand what
is the reason.

Hi Mihai,

When you say "evicted", do you mean the job went from state Running back to state Idle (i.e. still queued and waiting to be rescheduled), or did something else happen?

Doing just a quick look at the shadow log below, looks like job 21509.0 was either removed or put on hold.

Is the job still in the queue (i.e. does it still appear with condor_q)?

If not, what does "condor_history -l 21509.0" reveal? If the job was removed or placed on hold, the attributes RemoveReason or HoldReason, respectively, should tell why.

regards
Todd



This the shadow log output for a job(21509.0):

1/14/17 15:34:00 (21509.0) (3772013): Calling Timer handler 33
(dc_touch_log_file)
11/14/17 15:34:00 (21509.0) (3772013): Return from Timer handler 33
(dc_touch_log_file)
11/14/17 15:34:02 (21509.0) (3772013): Calling Timer handler 13
(checkPeriodic)
11/14/17 15:34:02 (21509.0) (3772013): Return from Timer handler 13
(checkPeriodic)
11/14/17 15:34:15 (21509.0) (3772013): Calling Timer handler 2 (check_parent)
11/14/17 15:34:15 (21509.0) (3772013): Return from Timer handler 2
(check_parent)
11/14/17 15:34:35 (21509.0) (3772013): In handleJobRemoval(), sig 10
11/14/17 15:34:35 (21509.0) (3772013): setting exit reason on
slot1_1@xxxxxxxxxxxxxx to 102
11/14/17 15:34:35 (21509.0) (3772013): Resource slot1_1@xxxxxxxxxxxxxx
changing state from EXECUTING to FINISHED
11/14/17 15:34:35 (21509.0) (3772013): Requesting graceful removal of job.
11/14/17 15:34:35 (21509.0) (3772013): Entering
DCStartd::deactivateClaim(graceful)
11/14/17 15:34:35 (21509.0) (3772013):
DCStartd::deactivateClaim(DEACTIVATE_CLAIM,...) making connection to
<192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
<192.168.181.182:9618>,,size=134,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): SharedPortClient: sent connection
request to <192.168.181.182:9618> for shared port id 8572_41ea_3
11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
<192.168.181.182:9618>,,size=785,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): encrypting secret
11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
<192.168.181.182:9618>,,size=164,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
<192.168.181.182:9618>,,size=21,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
<192.168.181.182:9618>,,size=23,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:35 (21509.0) (3772013): DCStartd::deactivateClaim:
successfully sent command
11/14/17 15:34:35 (21509.0) (3772013): CLOSE TCP <192.168.181.13:29201> fd=5
11/14/17 15:34:35 (21509.0) (3772013): Killed starter (graceful) at
<192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
11/14/17 15:34:37 (21509.0) (3772013): Calling Handler <HandleSyscalls> (1)
11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=452,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:37 (21509.0) (3772013): Inside
RemoteResource::updateFromStarter()
1/14/17 15:34:37 (21509.0) (3772013): condor_write(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:37 (21509.0) (3772013): Return from Handler
<HandleSyscalls> 0.000604s
11/14/17 15:34:39 (21509.0) (3772013): Calling Handler <HandleSyscalls> (1)
11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=158,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:39 (21509.0) (3772013): Inside
RemoteResource::updateFromStarter()
11/14/17 15:34:39 (21509.0) (3772013): Inside RemoteResource::resourceExit()
11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): Job 21509.0 is being evicted from
slot1_1@xxxxxxxxxxxxxx
11/14/17 15:34:39 (21509.0) (3772013):
Daemon::startCommand(QMGMT_WRITE_CMD,...) making connection to
<81.180.86.133:20346>
11/14/17 15:34:39 (21509.0) (3772013): CONNECT bound to
<81.180.86.133:3323> fd=5 peer=<81.180.86.133:20346>
11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
<81.180.86.133:20346>,,size=818,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
<81.180.86.133:20346>,,size=40,timeout=300,flags=0,non_blocking=0)
............................
11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP <81.180.86.133:3323> fd=5
11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP  fd=20
11/14/17 15:34:39 (21509.0) (3772013): **** condor_shadow (condor_SHADOW)
pid 3772013 EXITING WITH STATUS 102

And the Startd log for the same job is:

11/14/17 15:34:35 Calling Handler
<SharedPortEndpoint::HandleListenerAccept> (0)
11/14/17 15:34:35 SharedPortEndpoint: received command 76
SHARED_PORT_PASS_SOCK on named socket
ca0b82630723aa5347090cd7dabd050dd2c84dee3cbd9f6af878bffd3daa11fc/8572_41ea_3
11/14/17 15:34:35 SharedPortEndpoint: received forwarded connection from
<192.168.181.13:29201>.
11/14/17 15:34:35 Return from Handler
<SharedPortEndpoint::HandleListenerAccept> 0.000208s
11/14/17 15:34:35 Calling Handler
<DaemonCommandProtocol::WaitForSocketData> (2)
11/14/17 15:34:35 Calling HandleReq <command_handler> (0) for command 403
(DEACTIVATE_CLAIM) from condor_pool@xxxxxxxx <192.168.181.13:29201>
11/14/17 15:34:35 slot1_1: Called deactivate_claim()
11/14/17 15:34:35 slot1_1: In Starter::kill() with pid 1446, sig 15 (SIGTERM)
11/14/17 15:34:35 Send_Signal(): Doing kill(1446,15) [SIGTERM]
11/14/17 15:34:35 slot1_1: Using max vacate time of 600s for this job.
11/14/17 15:34:35 Return from HandleReq <command_handler> (handler:
0.000193s, sec: 0.000s, payload: 0.000s)
11/14/17 15:34:35 Return from Handler
<DaemonCommandProtocol::WaitForSocketData> 0.000480s


Have any one any idea?


Thanks in advance,
Mihai

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/