[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor job idle



Hi,

>From multiple submissions, the job always lands on slot2@xxxxxxxxxxxxxxxxxx
Then this compute node is failing to execute a simple job.
Is this normal?
Sorry in advance if I'm doing something wrong.

Victor
 

-bash-3.2$ condor_submit process.cmd 
Submitting job(s).
1 job(s) submitted to cluster 266909.
  
-bash-3.2$ tail -f process.log  
009 (266908.000.000) 05/10 22:49:02 Job was aborted by the user.
	via condor_rm (by user galaxy)
...
000 (266909.000.000) 05/10 22:49:19 Job submitted from host: <128.104.153.183:9618?PrivAddr=%3c10.129.28.28:9618%3fsock%3d5400_b3d1_2%3e&PrivNet=morgridge&noUDP&sock=5400_b3d1_2>
...
001 (266909.000.000) 05/10 22:50:01 Job executing on host: <128.104.55.43:57255>
...
022 (266909.000.000) 05/10 22:50:01 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
...
024 (266909.000.000) 05/10 22:50:01 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot2@xxxxxxxxxxxxxxxxxx, rescheduling job

022 (266905.000.000) 05/10 22:43:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
...
024 (266905.000.000) 05/10 22:43:58 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot2@xxxxxxxxxxxxxxxxxx, rescheduling job
...
001 (266906.000.000) 05/10 22:44:46 Job executing on host: <128.104.55.43:57255>
...
022 (266906.000.000) 05/10 22:44:46 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot5@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
...
024 (266906.000.000) 05/10 22:44:46 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot5@xxxxxxxxxxxxxxxxxx, rescheduling job

000 (266908.000.000) 05/10 22:47:33 Job submitted from host: <128.104.153.183:9618?PrivAddr=%3c10.129.28.28:9618%3fsock%3d5400_b3d1_2%3e&PrivNet=morgridge&noUDP&sock=5400_b3d1_2>
...
001 (266908.000.000) 05/10 22:47:43 Job executing on host: <128.104.55.43:57255>
...
022 (266908.000.000) 05/10 22:47:43 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot5@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
...
024 (266908.000.000) 05/10 22:47:43 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot5@xxxxxxxxxxxxxxxxxx, rescheduling job
...
On May 10, 2012, at 10:41 PM, Victor wrote:

> Hi Nathan,
> Thanks for the tip. 
> I reduced the number of retries to 1.
> 
> The weird thing is that when I first launch the dag, both the dag and the job from that node shows are R. Then I assume the job started. But then it goes back to the I state.
> See bellow, I keep getting R but then goes back to I.
> How do I find out what is causing the job(node) to exit the run state?
> 
> I tried to go back and submit the jog without dagman and keep getting something related to this compute node.
> slot2@xxxxxxxxxxxxxxxxxx
> 
> It looks like it trying to run but then I get a socket connection error.
> Will try it gain to see if it lands on another slot this time.
> Might nee to make sure my submit file works well first and then the dag.
> 
> Victor
> 
> 
> 
> ...
> 000 (266905.000.000) 05/10 22:36:08 Job submitted from host: <128.104.153.183:9618?PrivAddr=%3c10.129.28.28:9618%3fsock%3d5400_b3d1_2%3e&PrivNet=morgridge&noUDP&sock=5400_b3d1_2>
> ...
> 001 (266905.000.000) 05/10 22:36:32 Job executing on host: <128.104.55.43:57255>
> ...
> 022 (266905.000.000) 05/10 22:36:32 Job disconnected, attempting to reconnect
>    Socket between submit and execute hosts closed unexpectedly
>    Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
> ...
> 024 (266905.000.000) 05/10 22:36:32 Job reconnection failed
>    Job not found at execution machine
>    Can not reconnect to slot2@xxxxxxxxxxxxxxxxxx, rescheduling job
> ...
> 
> Thanks,
> Victor
> 
> 
> 
> -- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
> ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
>   2.0   soaruser       11/9  12:47  30+17:05:34 I  0   73.2 continuous.cron 20
>   3.0   soaruser       11/9  12:47 182+05:44:54 R  0   0.0  checkprogress.cron
> 158265.0   soaruser        1/11 09:27   0+10:27:40 I  0   170.9 scrubber.cron 20  
> 158266.0   galaxy          1/11 09:32   0+16:57:14 I  0   244.1 scrubber.cron 20  
> 160662.0   galaxy          1/18 15:31 112+20:01:41 R  0   73.2 checkprogress.cron
> 266877.0   galaxy          5/10 22:09   0+00:02:52 R  0   7.3  condor_dagman -f -
> 
> 6 jobs; 3 idle, 3 running, 0 held
> -bash-3.2$ condor_q -dag 
> 
> 
> -- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
> ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
>   2.0   soaruser       11/9  12:47  30+17:05:34 I  0   73.2 continuous.cron 20
>   3.0   soaruser       11/9  12:47 182+05:44:55 R  0   0.0  checkprogress.cron
> 158265.0   soaruser        1/11 09:27   0+10:27:40 I  0   170.9 scrubber.cron 20  
> 158266.0   galaxy          1/11 09:32   0+16:57:14 I  0   244.1 scrubber.cron 20  
> 160662.0   galaxy          1/18 15:31 112+20:01:42 R  0   73.2 checkprogress.cron
> 266877.0   galaxy          5/10 22:09   0+00:02:53 R  0   7.3  condor_dagman -f -
> 266881.0    |-fastq_file1  5/10 22:12   0+00:00:00 I  0   0.0  chtcjobwrapper --t
> 
> -- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
> ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
>   2.0   soaruser       11/9  12:47  30+17:05:34 I  0   73.2 continuous.cron 20
>   3.0   soaruser       11/9  12:47 182+05:46:23 R  0   0.0  checkprogress.cron
> 158265.0   soaruser        1/11 09:27   0+10:27:40 I  0   170.9 scrubber.cron 20  
> 158266.0   galaxy          1/11 09:32   0+16:57:14 I  0   244.1 scrubber.cron 20  
> 160662.0   galaxy          1/18 15:31 112+20:03:10 R  0   73.2 checkprogress.cron
> 266877.0   galaxy          5/10 22:09   0+00:04:21 R  0   7.3  condor_dagman -f -
> 266882.0    |-fastq_file1  5/10 22:13   0+00:00:00 I  0   0.0  chtcjobwrapper --t
> 
> 7 jobs; 4 idle, 3 running, 0 held
> -bash-3.2$ condor_q -dag 
> 
> 
> -- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
> ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
>   2.0   soaruser       11/9  12:47  30+17:05:34 I  0   73.2 continuous.cron 20
>   3.0   soaruser       11/9  12:47 182+05:46:26 R  0   0.0  checkprogress.cron
> 158265.0   soaruser        1/11 09:27   0+10:27:40 I  0   170.9 scrubber.cron 20  
> 158266.0   galaxy          1/11 09:32   0+16:57:14 I  0   244.1 scrubber.cron 20  
> 160662.0   galaxy          1/18 15:31 112+20:03:13 R  0   73.2 checkprogress.cron
> 266877.0   galaxy          5/10 22:09   0+00:04:24 R  0   7.3  condor_dagman -f -
> 266882.0    |-fastq_file1  5/10 22:13   0+00:00:00 R  0   0.0  chtcjobwrapper --t
> 
> On May 10, 2012, at 6:30 PM, Nathan Panike wrote:
> 
>> On Thu, May 10, 2012 at 05:32:15PM -0500, Victor wrote:
>>> Hi,
>>> I'm very new at creating dags so sorry in advance as this might be a mistake on my part.
>>> I'm hoping someone can point me out on how to check why the node->job is always idle. 
>>> I created a very simple dag and started it via 
>>> condor_submit_dag mydag.dag
>>> 
>>> CONFIG dagman_config
>>> JOB fastq_file1 process.cmd dir fastq_file1
>>> SCRIPT POST fastq_file1 /opt/galaxy/dagme/ChtcRun/postjob.pl
>>> RETRY fastq_file1 10
>>> 
>>> I can see the dagman running, so something within the node is causing this would be my guess.
>>> 
>>> 266865.0   galaxy          5/10 17:14   0+00:00:24 R  0   7.3  condor_dagman -f -
>>> 266866.0    |-fastq_file1  5/10 17:14   0+00:00:00 I  0   0.0  chtcjobwrapper --t
>> 
>> What universe is 266866 running in? Unless it is a local universe or
>> scheduler, this behavior is expected.
>> 
>>> From this, I read that DAGman has been running for 24 seconds.  That
>> means that your job has been in the queue for about 12 seconds.  It is
>> quite likely that it has not yet been considered for a match.  I
>> counsel a certain amount of patience.
>> 
>> Nathan Panike
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>