[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_dagman crashed when suspend it in 7.8.2 version



Hi,

Did you look in ScheddLog?  Is it possible the schedd also crashed?

Brian

On Sep 5, 2013, at 1:51 AM, éææ <kyleqian@xxxxxxxxx> wrote:

Hi, I am using Condor for a personal pool, its version is:
$CondorVersion: 7.8.2 Sep 30 2012 Debian-7.8.2~dfsg.1-1+deb7u1 $
$CondorPlatform: X86_64-Ubuntu_ $
I found that to condor_suspend a dagman job can make it crashed and get into RECOVERY mode. This is the output for dagman when issue suspend command:
......
09/05/13 14:29:44 MultiLogFiles: truncating log file /home/kyle/csf/RS-9/RS-9.log
09/05/13 14:29:44 Submitting Condor Node RS-9.1 job(s)...
09/05/13 14:29:44 submitting: condor_submit -a dag_node_name' '=' 'RS-9.1 -a +DAGManJobId' '=' '327 -a DAGManJobId' '=' '327 -a submit_event_notes' '=' 'DAG' 'Node:' 'RS-9.1 -a DAG_STATUS' '=' '0 -a FAILED_COUNT' '=' '0 -a +DAGParentNodeNames' '=' '"" -a +KeepClaimIdle' '=' '20 RS-9.1.sub
09/05/13 14:29:44 From submit: Submitting job(s).
09/05/13 14:29:44 From submit: 1 job(s) submitted to cluster 328.
09/05/13 14:29:44     assigned Condor ID (328.0.0)
09/05/13 14:29:44 Just submitted 1 job this cycle...
09/05/13 14:29:44 Currently monitoring 1 Condor log file(s)
09/05/13 14:29:44 Event: ULOG_SUBMIT for Condor Node RS-9.1 (328.0.0)
09/05/13 14:29:44 Number of idle job procs: 1
09/05/13 14:29:44 Of 2 nodes total:
09/05/13 14:29:44  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
09/05/13 14:29:44   ===     ===      ===     ===     ===        ===      ===
09/05/13 14:29:44     0       0        1       0       0          1        0
09/05/13 14:29:44 0 job proc(s) currently held
09/05/13 14:29:54 Currently monitoring 1 Condor log file(s)
09/05/13 14:29:54 Event: ULOG_EXECUTE for Condor Node RS-9.1 (328.0.0)
09/05/13 14:29:54 Number of idle job procs: 0
09/05/13 14:30:04 Currently monitoring 1 Condor log file(s)
09/05/13 14:30:04 Event: ULOG_IMAGE_SIZE for Condor Node RS-9.1 (328.0.0)
09/05/13 14:30:28 Setting maximum accepts per cycle 8.
09/05/13 14:30:28 ******************************************************
09/05/13 14:30:28 ** condor_scheduniv_exec.327.0 (CONDOR_DAGMAN) STARTING UP
09/05/13 14:30:28 ** /usr/bin/condor_dagman
09/05/13 14:30:28 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
09/05/13 14:30:28 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
09/05/13 14:30:28 ** $CondorVersion: 7.8.2 Sep 30 2012 Debian-7.8.2~dfsg.1-1+deb7u1 $
09/05/13 14:30:28 ** $CondorPlatform: X86_64-Ubuntu_ $
09/05/13 14:30:28 ** PID = 15394
09/05/13 14:30:28 ** Log last touched 9/5 14:30:04
09/05/13 14:30:28 ******************************************************
......
I think the line in red is the last output before dagman crashed. The terminal window is:
kyle@scorpio ~/csf/RS-7 $ condor_q
-- Submitter: scorpio.otitan.com : <127.0.0.1:38147> : scorpio.otitan.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
 325.0   kyle            9/5  14:20   0+00:00:11 R  0   0.3  condor_dagman    
 326.0   kyle            9/5  14:20   0+00:00:00 I  0   2.7  csfexec          
2 jobs; 0 completed, 0 removed, 1 idle, 1 running, 0 held, 0 suspended

kyle@scorpio ~/csf/RS-7 $ condor_suspend 325.0
Job 325.0 suspended

kyle@scorpio ~/csf/RS-7 $ condor_q
-- Failed to fetch ads from: <127.0.0.1:38147> : scorpio.otitan.com
CEDAR:6001:Failed to connect to <127.0.0.1:38147>


kyle@scorpio ~/csf/RS-7 $ condor_q
-- Submitter: scorpio.otitan.com : <127.0.0.1:59970> : scorpio.otitan.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
 325.0   kyle            9/5  14:20   0+00:00:00 I  0   0.3  condor_dagman    
 326.0   kyle            9/5  14:20   0+00:00:00 I  0   2.7  csfexec          
2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended

kyle@scorpio ~/csf/RS-7 $ condor_q
-- Submitter: scorpio.otitan.com : <127.0.0.1:59970> : scorpio.otitan.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
 325.0   kyle            9/5  14:20   0+00:00:12 R  0   0.3  condor_dagman    
 326.0   kyle            9/5  14:20   0+00:00:00 I  0   2.7  csfexec          
2 jobs; 0 completed, 0 removed, 1 idle, 1 running, 0 held, 0 suspended

There was also a job disconnect event for this dag node job.
So is dagman crashed?  What should I do? Upgrade my condor software?
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/