[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] evicted multicore jobs



Hi Todd,

I see, so I have to look somewhere else for solving the problem...is not
the htcondor

Thank you,
Mihai



> On 11/15/2017 4:11 AM, Mihai Ciubancan wrote:
>> Hi Todd,
>>
>> Sorry that I didn't mentioned in the first email, all the jobs are
>> removed
>> by the user:
>>
>>   condor_history 21509.0 -l|grep Remove
>> PeriodicRemove = false
>> RemoveReason = "via condor_rm (by user griduser01)"
>> OnExitRemove = true
>>  > I also want to mention that I'm running Atlas(CERN) jobs with a
>> Docker
>> universe  and the memory and cpu requested are:
>>
>> RequestCpus = 8
>> RequestMemory = 15000
>> While the machines have 8 cores and 16GBRAM.
>> I notice that the jobs are killed also  while the input files are
>> transferred, so very few memory is used.
>> I din't find any pattern(clue) for job killing process. Jobs can run for
>> few minutes or up to 1 hour.
>>
>> Best wishes,
>> Mihai
>>
>
> Hi Mihai,
>
> According to HTCondor, someone (or some script or service) contacted the
> condor_schedd on your submit machine, authenticated as user griduser01,
> and told HTCondor to remove job 21509.0.  When HTCondor is told to
> remove a job, HTCondor will behave exactly as the logs and history ad
> shows.  I.e. it will kill any running instance of that job and the
> remove the job from the queue.
>
> Maybe some admin at your site setup a script somewhere that periodically
> scans the queue and issues condor_rm on jobs it doesn't like?
>
> If the job was submitted by a remote service, seems like the remote
> service decided to remove it after a while...
>
> regards
> Todd
>
>
>>
>>
>>> On 11/14/2017 8:06 AM, Mihai Ciubancan wrote:
>>>>
>>>>
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm running multicore jobs and all are evicted and I don't understand
>>>> what
>>>> is the reason.
>>>
>>> Hi Mihai,
>>>
>>> When you say "evicted", do you mean the job went from state Running
>>> back
>>> to state Idle (i.e. still queued and waiting to be rescheduled), or did
>>> something else happen?
>>>
>>> Doing just a quick look at the shadow log below, looks like job 21509.0
>>> was either removed or put on hold.
>>>
>>> Is the job still in the queue (i.e. does it still appear with
>>> condor_q)?
>>>
>>> If not, what does "condor_history -l 21509.0" reveal?  If the job was
>>> removed or placed on hold, the attributes RemoveReason or HoldReason,
>>> respectively, should tell why.
>>>
>>> regards
>>> Todd
>>>
>>>
>>>
>>>> This the shadow log output for a job(21509.0):
>>>>
>>>> 1/14/17 15:34:00 (21509.0) (3772013): Calling Timer handler 33
>>>> (dc_touch_log_file)
>>>> 11/14/17 15:34:00 (21509.0) (3772013): Return from Timer handler 33
>>>> (dc_touch_log_file)
>>>> 11/14/17 15:34:02 (21509.0) (3772013): Calling Timer handler 13
>>>> (checkPeriodic)
>>>> 11/14/17 15:34:02 (21509.0) (3772013): Return from Timer handler 13
>>>> (checkPeriodic)
>>>> 11/14/17 15:34:15 (21509.0) (3772013): Calling Timer handler 2
>>>> (check_parent)
>>>> 11/14/17 15:34:15 (21509.0) (3772013): Return from Timer handler 2
>>>> (check_parent)
>>>> 11/14/17 15:34:35 (21509.0) (3772013): In handleJobRemoval(), sig 10
>>>> 11/14/17 15:34:35 (21509.0) (3772013): setting exit reason on
>>>> slot1_1@xxxxxxxxxxxxxx to 102
>>>> 11/14/17 15:34:35 (21509.0) (3772013): Resource slot1_1@xxxxxxxxxxxxxx
>>>> changing state from EXECUTING to FINISHED
>>>> 11/14/17 15:34:35 (21509.0) (3772013): Requesting graceful removal of
>>>> job.
>>>> 11/14/17 15:34:35 (21509.0) (3772013): Entering
>>>> DCStartd::deactivateClaim(graceful)
>>>> 11/14/17 15:34:35 (21509.0) (3772013):
>>>> DCStartd::deactivateClaim(DEACTIVATE_CLAIM,...) making connection to
>>>> <192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
>>>> 11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
>>>> <192.168.181.182:9618>,,size=134,timeout=20,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:35 (21509.0) (3772013): SharedPortClient: sent
>>>> connection
>>>> request to <192.168.181.182:9618> for shared port id 8572_41ea_3
>>>> 11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
>>>> <192.168.181.182:9618>,,size=785,timeout=20,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:35 (21509.0) (3772013): encrypting secret
>>>> 11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
>>>> <192.168.181.182:9618>,,size=164,timeout=20,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
>>>> <192.168.181.182:9618>,,size=21,timeout=20,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
>>>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned
>>>> 1
>>>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
>>>> <192.168.181.182:9618>,,size=23,timeout=20,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
>>>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned
>>>> 1
>>>> 11/14/17 15:34:35 (21509.0) (3772013): DCStartd::deactivateClaim:
>>>> successfully sent command
>>>> 11/14/17 15:34:35 (21509.0) (3772013): CLOSE TCP
>>>> <192.168.181.13:29201>
>>>> fd=5
>>>> 11/14/17 15:34:35 (21509.0) (3772013): Killed starter (graceful) at
>>>> <192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
>>>> 11/14/17 15:34:37 (21509.0) (3772013): Calling Handler
>>>> <HandleSyscalls>
>>>> (1)
>>>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
>>>> slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
>>>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned
>>>> 1
>>>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
>>>> slot1_1@xxxxxxxxxxxxxx,,size=452,timeout=300,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
>>>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned
>>>> 1
>>>> 11/14/17 15:34:37 (21509.0) (3772013): Inside
>>>> RemoteResource::updateFromStarter()
>>>> 1/14/17 15:34:37 (21509.0) (3772013): condor_write(fd=4 startd
>>>> slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:37 (21509.0) (3772013): Return from Handler
>>>> <HandleSyscalls> 0.000604s
>>>> 11/14/17 15:34:39 (21509.0) (3772013): Calling Handler
>>>> <HandleSyscalls>
>>>> (1)
>>>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
>>>> slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
>>>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned
>>>> 1
>>>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
>>>> slot1_1@xxxxxxxxxxxxxx,,size=158,timeout=300,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
>>>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned
>>>> 1
>>>> 11/14/17 15:34:39 (21509.0) (3772013): Inside
>>>> RemoteResource::updateFromStarter()
>>>> 11/14/17 15:34:39 (21509.0) (3772013): Inside
>>>> RemoteResource::resourceExit()
>>>> 11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=4 startd
>>>> slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:39 (21509.0) (3772013): Job 21509.0 is being evicted
>>>> from
>>>> slot1_1@xxxxxxxxxxxxxx
>>>> 11/14/17 15:34:39 (21509.0) (3772013):
>>>> Daemon::startCommand(QMGMT_WRITE_CMD,...) making connection to
>>>> <81.180.86.133:20346>
>>>> 11/14/17 15:34:39 (21509.0) (3772013): CONNECT bound to
>>>> <81.180.86.133:3323> fd=5 peer=<81.180.86.133:20346>
>>>> 11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
>>>> <81.180.86.133:20346>,,size=818,timeout=300,flags=0,non_blocking=0)
>>>> 11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
>>>> <81.180.86.133:20346>,,size=40,timeout=300,flags=0,non_blocking=0)
>>>> ............................
>>>> 11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP <81.180.86.133:3323>
>>>> fd=5
>>>> 11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP  fd=20
>>>> 11/14/17 15:34:39 (21509.0) (3772013): **** condor_shadow
>>>> (condor_SHADOW)
>>>> pid 3772013 EXITING WITH STATUS 102
>>>>
>>>> And the Startd log for the same job is:
>>>>
>>>> 11/14/17 15:34:35 Calling Handler
>>>> <SharedPortEndpoint::HandleListenerAccept> (0)
>>>> 11/14/17 15:34:35 SharedPortEndpoint: received command 76
>>>> SHARED_PORT_PASS_SOCK on named socket
>>>> ca0b82630723aa5347090cd7dabd050dd2c84dee3cbd9f6af878bffd3daa11fc/8572_41ea_3
>>>> 11/14/17 15:34:35 SharedPortEndpoint: received forwarded connection
>>>> from
>>>> <192.168.181.13:29201>.
>>>> 11/14/17 15:34:35 Return from Handler
>>>> <SharedPortEndpoint::HandleListenerAccept> 0.000208s
>>>> 11/14/17 15:34:35 Calling Handler
>>>> <DaemonCommandProtocol::WaitForSocketData> (2)
>>>> 11/14/17 15:34:35 Calling HandleReq <command_handler> (0) for command
>>>> 403
>>>> (DEACTIVATE_CLAIM) from condor_pool@xxxxxxxx <192.168.181.13:29201>
>>>> 11/14/17 15:34:35 slot1_1: Called deactivate_claim()
>>>> 11/14/17 15:34:35 slot1_1: In Starter::kill() with pid 1446, sig 15
>>>> (SIGTERM)
>>>> 11/14/17 15:34:35 Send_Signal(): Doing kill(1446,15) [SIGTERM]
>>>> 11/14/17 15:34:35 slot1_1: Using max vacate time of 600s for this job.
>>>> 11/14/17 15:34:35 Return from HandleReq <command_handler> (handler:
>>>> 0.000193s, sec: 0.000s, payload: 0.000s)
>>>> 11/14/17 15:34:35 Return from Handler
>>>> <DaemonCommandProtocol::WaitForSocketData> 0.000480s
>>>>
>>>>
>>>> Have any one any idea?
>>>>
>>>>
>>>> Thanks in advance,
>>>> Mihai
>>>>
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>>> with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>>
>>>
>>
>>
>> Dr. Mihai Ciubancan
>> IT Department
>> National Institute of Physics and Nuclear Engineering "Horia Hulubei"
>> Str. Reactorului no. 30, P.O. BOX MG-6
>> 077125, Magurele - Bucharest, Romania
>> http://www.ifin.ro
>> Work:   +40214042360
>> Mobile: +40761345687
>> Fax:    +40214042395
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing   Department of Computer Sciences
> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                  Madison, WI 53706-1685
>


Dr. Mihai Ciubancan
IT Department
National Institute of Physics and Nuclear Engineering "Horia Hulubei"
Str. Reactorului no. 30, P.O. BOX MG-6
077125, Magurele - Bucharest, Romania
http://www.ifin.ro
Work:   +40214042360
Mobile: +40761345687
Fax:    +40214042395