[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_starter processes at 100% CPU on Debian after completing job, before transferring files back



Hi Andrew,
This is a known problem with Debian - I have faced this recently. Please check out this recent thread: [https://www-auth.cs.wisc.edu/lists/htcondor-users/2017-April/msg00059.shtml]
Thanks.
-Samik

On 14-Jul-17 4:08 AM, Andrew Cunningham wrote:
I am running Condor  on Debian as a compute node. The Condor host (and also compute node) is Red Hat 7, running 8.6.4
The submit node is a Windows 7 machine running 8.6.4 64-bit.
The job is a vanilla job that sends some files and returns some files from a computation.

$CondorVersion: 8.6.4 Jun 21 2017 BuildID: 408625 $
$CondorPlatform: x86_64_Debian8 $

I have divided the debian machine into 8 equal slots
more . The condor_config.local only was modified

CONDOR_HOST=condor
ALLOW_WRITE = $(FULL_HOSTNAME),*.mydomainname.com
NUM_SLOTS =8
DAEMON_LIST=MASTER STARTD

When I submit a job to our "cluster" of 2 machines, the Debian machine completes the jobs but  the condor_starter processes get stuck at 100% CPU seemingly spinning their wheels trying to transfer the files back to the submit host.
The other Linux node, completes its jobs and sends back the computed data with no issues.

Examining the /var/lib/condor/execute directory shows the execute directories, with the output all normal and complete.
 
Nothing in the Starter logs on the Debian node indicates any issues.

07/13/17 15:25:33 (pid:34323) Communicating with shadow <192.168.0.211:9618?addrs=192.168.0.211-9618&noUDP&sock=9892_fcaa_3>
07/13/17 15:25:33 (pid:34323) Submitting machine is "bose"
07/13/17 15:25:33 (pid:34323) setting the orig job name in starter
07/13/17 15:25:33 (pid:34323) setting the orig job iwd in starter
07/13/17 15:25:33 (pid:34323) SLOT2_USER set, so running job as acu
07/13/17 15:25:33 (pid:34323) Chirp config summary: IO false, Updates false, Delayed updates true.
07/13/17 15:25:33 (pid:34323) Initialized IO Proxy.
07/13/17 15:25:33 (pid:34323) Done setting resource limits
07/13/17 15:25:33 (pid:34323) File transfer completed successfully.
07/13/17 15:25:33 (pid:34323) Job 84.1 set to execute immediately
07/13/17 15:25:33 (pid:34323) Starting a VANILLA universe job with ID: 84.1
07/13/17 15:25:33 (pid:34323) IWD: /var/lib/condor/execute/dir_34323
07/13/17 15:25:33 (pid:34323) Output file: /var/lib/condor/execute/dir_34323/_condor_stdout
07/13/17 15:25:33 (pid:34323) Error file: /var/lib/condor/execute/dir_34323/_condor_stderr
07/13/17 15:25:33 (pid:34323) Renice expr "0" evaluated to 0
07/13/17 15:25:33 (pid:34323) About to exec /var/lib/condor/execute/dir_34323/condor_exec.exe VA22_2017 BEM_Fluid 1 27012@licenseserver
07/13/17 15:25:33 (pid:34323) Running job as user acu
07/13/17 15:25:33 (pid:34323) Create_Process succeeded, pid=34340
07/13/17 15:25:33 (pid:34323) Cgroup controller for memory accounting is not available.



 I am forced to restart condor on this node to stop the condor_starter processes.

Thanks for any advice.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/