[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor job idle



Hi Nathan,
Thanks for the tip. 
I reduced the number of retries to 1.

The weird thing is that when I first launch the dag, both the dag and the job from that node shows are R. Then I assume the job started. But then it goes back to the I state.
See bellow, I keep getting R but then goes back to I.
How do I find out what is causing the job(node) to exit the run state?

I tried to go back and submit the jog without dagman and keep getting something related to this compute node.
 slot2@xxxxxxxxxxxxxxxxxx

It looks like it trying to run but then I get a socket connection error.
Will try it gain to see if it lands on another slot this time.
Might nee to make sure my submit file works well first and then the dag.

Victor



...
000 (266905.000.000) 05/10 22:36:08 Job submitted from host: <128.104.153.183:9618?PrivAddr=%3c10.129.28.28:9618%3fsock%3d5400_b3d1_2%3e&PrivNet=morgridge&noUDP&sock=5400_b3d1_2>
...
001 (266905.000.000) 05/10 22:36:32 Job executing on host: <128.104.55.43:57255>
...
022 (266905.000.000) 05/10 22:36:32 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
...
024 (266905.000.000) 05/10 22:36:32 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot2@xxxxxxxxxxxxxxxxxx, rescheduling job
...

Thanks,
Victor



-- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   soaruser       11/9  12:47  30+17:05:34 I  0   73.2 continuous.cron 20
   3.0   soaruser       11/9  12:47 182+05:44:54 R  0   0.0  checkprogress.cron
158265.0   soaruser        1/11 09:27   0+10:27:40 I  0   170.9 scrubber.cron 20  
158266.0   galaxy          1/11 09:32   0+16:57:14 I  0   244.1 scrubber.cron 20  
160662.0   galaxy          1/18 15:31 112+20:01:41 R  0   73.2 checkprogress.cron
266877.0   galaxy          5/10 22:09   0+00:02:52 R  0   7.3  condor_dagman -f -

6 jobs; 3 idle, 3 running, 0 held
-bash-3.2$ condor_q -dag 


-- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   soaruser       11/9  12:47  30+17:05:34 I  0   73.2 continuous.cron 20
   3.0   soaruser       11/9  12:47 182+05:44:55 R  0   0.0  checkprogress.cron
158265.0   soaruser        1/11 09:27   0+10:27:40 I  0   170.9 scrubber.cron 20  
158266.0   galaxy          1/11 09:32   0+16:57:14 I  0   244.1 scrubber.cron 20  
160662.0   galaxy          1/18 15:31 112+20:01:42 R  0   73.2 checkprogress.cron
266877.0   galaxy          5/10 22:09   0+00:02:53 R  0   7.3  condor_dagman -f -
266881.0    |-fastq_file1  5/10 22:12   0+00:00:00 I  0   0.0  chtcjobwrapper --t

-- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   soaruser       11/9  12:47  30+17:05:34 I  0   73.2 continuous.cron 20
   3.0   soaruser       11/9  12:47 182+05:46:23 R  0   0.0  checkprogress.cron
158265.0   soaruser        1/11 09:27   0+10:27:40 I  0   170.9 scrubber.cron 20  
158266.0   galaxy          1/11 09:32   0+16:57:14 I  0   244.1 scrubber.cron 20  
160662.0   galaxy          1/18 15:31 112+20:03:10 R  0   73.2 checkprogress.cron
266877.0   galaxy          5/10 22:09   0+00:04:21 R  0   7.3  condor_dagman -f -
266882.0    |-fastq_file1  5/10 22:13   0+00:00:00 I  0   0.0  chtcjobwrapper --t

7 jobs; 4 idle, 3 running, 0 held
-bash-3.2$ condor_q -dag 


-- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   soaruser       11/9  12:47  30+17:05:34 I  0   73.2 continuous.cron 20
   3.0   soaruser       11/9  12:47 182+05:46:26 R  0   0.0  checkprogress.cron
158265.0   soaruser        1/11 09:27   0+10:27:40 I  0   170.9 scrubber.cron 20  
158266.0   galaxy          1/11 09:32   0+16:57:14 I  0   244.1 scrubber.cron 20  
160662.0   galaxy          1/18 15:31 112+20:03:13 R  0   73.2 checkprogress.cron
266877.0   galaxy          5/10 22:09   0+00:04:24 R  0   7.3  condor_dagman -f -
266882.0    |-fastq_file1  5/10 22:13   0+00:00:00 R  0   0.0  chtcjobwrapper --t

On May 10, 2012, at 6:30 PM, Nathan Panike wrote:

> On Thu, May 10, 2012 at 05:32:15PM -0500, Victor wrote:
>> Hi,
>> I'm very new at creating dags so sorry in advance as this might be a mistake on my part.
>> I'm hoping someone can point me out on how to check why the node->job is always idle. 
>> I created a very simple dag and started it via 
>> condor_submit_dag mydag.dag
>> 
>> CONFIG dagman_config
>> JOB fastq_file1 process.cmd dir fastq_file1
>> SCRIPT POST fastq_file1 /opt/galaxy/dagme/ChtcRun/postjob.pl
>> RETRY fastq_file1 10
>> 
>> I can see the dagman running, so something within the node is causing this would be my guess.
>> 
>> 266865.0   galaxy          5/10 17:14   0+00:00:24 R  0   7.3  condor_dagman -f -
>> 266866.0    |-fastq_file1  5/10 17:14   0+00:00:00 I  0   0.0  chtcjobwrapper --t
> 
> What universe is 266866 running in? Unless it is a local universe or
> scheduler, this behavior is expected.
> 
>> From this, I read that DAGman has been running for 24 seconds.  That
> means that your job has been in the queue for about 12 seconds.  It is
> quite likely that it has not yet been considered for a match.  I
> counsel a certain amount of patience.
> 
> Nathan Panike
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/