[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job keeps re-executing after reaching ~ 1GB size but never finishes...



Leo,

Are all of the machines in your Condor pool in that 10.0.40 network? If not, are you using GCB or some other trick to get around the routing issues that arise from using private network space for your execute nodes? Normally, your submit nodes must be able to communicate with your execute nodes. The shadow process on the submit node will try to communicate with the execute node for the job. It looks like that is failing. It may be a routing issue or it may be that you have firewalls running on the machines.

- dave


Leo Cristobal C. Ambolode II wrote:
Dave,

All the three machines have identical properties, they all have 1GB RAM,
Scientific Linux 4.3 (OS), and they all have at least 40 GB of available
disk. I am using Condor version 6.8.2.

The Shadowlog:
##########################################################
7/29 22:22:22 ******************************************************
7/29 22:22:22 Using config source: /home/condor/condor/etc/condor_config
7/29 22:22:22 Using local config sources:
7/29 22:22:22 /home/condor/condor/local.phys-ugradlab02/condor_config.local
7/29 22:22:22 DaemonCore: Command Socket at <10.0.40.139:32809>
7/29 22:22:22 Initializing a VANILLA shadow for job 552.2
7/29 22:22:22 (552.2) (4805): Request to run on <10.0.40.112:32771> was
ACCEPTED
7/30 12:45:25 (552.1) (4719): Got SIGTERM. Performing graceful shutdown.
7/30 12:45:25 (552.0) (4718): Got SIGTERM. Performing graceful shutdown.
7/30 12:45:25 (552.2) (4805): Got SIGTERM. Performing graceful shutdown.
######################################################################

I don't know why the above happens. This is the time when the jobs are
then re-executed.

...continued Shadowlog...
######################################################################
7/30 12:45:26 (552.1) (4719): attempt to connect to <10.0.40.139:32772>
failed: Invalid argument (connect errno = 22).  Will keep trying for 2
0 total seconds (19 to go).

7/30 12:45:26 (552.0) (4718): attempt to connect to <10.0.40.148:32771>
failed: Network is unreachable (connect errno = 101).  Will keep tryin
g for 20 total seconds (19 to go).

7/30 12:45:26 (552.2) (4805): attempt to connect to <10.0.40.112:32771>
failed: Network is unreachable (connect errno = 101).  Will keep tryin
g for 20 total seconds (19 to go).
#################################################################

The Schedlog
#################################################################
7/30 12:35:20 (pid:3407) Sent ad to 1 collectors for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:40:20 (pid:3407) Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:40:20 (pid:3407) Sent ad to 1 collectors for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:45:20 (pid:3407) Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:45:20 (pid:3407) Sent ad to 1 collectors for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:45:25 (pid:3407) Got SIGTERM. Performing graceful shutdown.
7/30 12:45:26 (pid:3407) Called preempt( 1 )
7/30 12:45:27 (pid:3407) SafeMsg: sending small msg failed. errno: 101
7/30 12:45:27 (pid:3407) Can't send EOM to <10.0.40.148:32771>
7/30 12:45:27 (pid:3407) Sent vacate command to <10.0.40.148:32771> for
job 552.0
7/30 12:45:29 (pid:3407) Called preempt( 1 )
7/30 12:45:29 (pid:3407) SafeMsg: sending small msg failed. errno: 22
7/30 12:45:29 (pid:3407) Can't send EOM to <10.0.40.139:32772>
7/30 12:45:29 (pid:3407) Sent vacate command to <10.0.40.139:32772> for
job 552.1
##################################################################
The above shows the error before my job is vacated...

...continued Schedlog...
##################################################################
7/30 12:46:40 (pid:3402)
******************************************************
7/30 12:46:40 (pid:3402) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
7/30 12:46:40 (pid:3402) ** /home/condor/condor/sbin/condor_schedd
7/30 12:46:40 (pid:3402) ** $CondorVersion: 6.8.2 Oct 12 2006 $
7/30 12:46:40 (pid:3402) ** $CondorPlatform: I386-LINUX_RHEL3 $
7/30 12:46:40 (pid:3402) ** PID = 3402
7/30 12:46:40 (pid:3402) ** Log last touched 7/30 12:45:29
7/30 12:46:40 (pid:3402)
******************************************************
7/30 12:46:40 (pid:3402) Using config source:
/home/condor/condor/etc/condor_config
7/30 12:46:40 (pid:3402) Using local config sources:
7/30 12:46:40 (pid:3402) /home/condor/condor/local.phys-ugradlab02/condor_config.loca
##################################################################
(then the job is re-executed as shown above)

Remarks: I can not access the lists.cs.wisc.edu (mailing list archive).


Thanks,

Leo

Leo,

You will need to say what version of Condor, what operating system, and
maybe some information about the machines in your pool.  How much memory
and virtual memory do the execute nodes have?  Condor jobs can certainly
use more than 1GB of memory.

You can also search the log files for both the submit node and the
execute node for one of those job numbers to see if you can find more
information about what is happening to your jobs.

- dave


Leo Cristobal C. Ambolode II wrote:
Hi all,

Can anyone explain to me what happened to my jobs: the batch jobs I have
became idle for some time when one of the jobs reached approximately 1GB
of result/file, and then the jobs are re-executed..this keeps on
repeating
twice then I decided to remove the job because it never finishes....I am
expecting that each job would return approximately 5 GB each. Is this
some
restriction in Condor environment?,if so, how can I fix this?

The following is the log file of my job:
############################################
000 (552.000.000) 07/28 21:58:24 Job submitted from host:
<10.0.40.139:32771>
...
000 (552.001.000) 07/28 21:58:24 Job submitted from host:
<10.0.40.139:32771>
...
000 (552.002.000) 07/28 21:58:24 Job submitted from host:
<10.0.40.139:32771>
...
001 (552.000.000) 07/28 21:58:27 Job executing on host:
<10.0.40.148:32771>
...
001 (552.001.000) 07/28 21:58:29 Job executing on host:
<10.0.40.139:32772>
...
001 (552.002.000) 07/28 21:58:32 Job executing on host:
<10.0.40.112:32771>
...
006 (552.000.000) 07/28 21:58:36 Image size of job updated: 232508
...
006 (552.001.000) 07/28 21:58:37 Image size of job updated: 222952
...
006 (552.002.000) 07/28 21:58:40 Image size of job updated: 240012
...
006 (552.000.000) 07/28 22:18:36 Image size of job updated: 299492
...
006 (552.001.000) 07/28 22:18:37 Image size of job updated: 300036
...
006 (552.002.000) 07/28 22:18:40 Image size of job updated: 300564
...
006 (552.000.000) 07/28 22:38:35 Image size of job updated: 299992
...
006 (552.001.000) 07/28 22:38:37 Image size of job updated: 300588
...
006 (552.002.000) 07/28 22:38:40 Image size of job updated: 301064
...
006 (552.002.000) 07/28 22:58:40 Image size of job updated: 301196
...
006 (552.000.000) 07/28 23:18:35 Image size of job updated: 303588
...
006 (552.001.000) 07/28 23:18:37 Image size of job updated: 304188
...
006 (552.002.000) 07/28 23:18:40 Image size of job updated: 304660
...
006 (552.000.000) 07/28 23:38:35 Image size of job updated: 304708
...
006 (552.001.000) 07/28 23:38:37 Image size of job updated: 305304
...
006 (552.002.000) 07/28 23:38:40 Image size of job updated: 305776
...
006 (552.000.000) 07/29 00:38:36 Image size of job updated: 325480
...
006 (552.001.000) 07/29 00:38:37 Image size of job updated: 324032
...
006 (552.002.000) 07/29 00:38:39 Image size of job updated: 324516
...
006 (552.000.000) 07/29 00:58:36 Image size of job updated: 327124
...
006 (552.001.000) 07/29 00:58:37 Image size of job updated: 327980
...
006 (552.002.000) 07/29 01:18:39 Image size of job updated: 327168
...
006 (552.000.000) 07/29 01:38:36 Image size of job updated: 327640
...
006 (552.002.000) 07/29 02:18:40 Image size of job updated: 328708
...
006 (552.000.000) 07/29 03:38:35 Image size of job updated: 338992
...
006 (552.000.000) 07/29 04:38:36 Image size of job updated: 588500
...
006 (552.001.000) 07/29 04:38:37 Image size of job updated: 333832
...
006 (552.002.000) 07/29 04:38:40 Image size of job updated: 339928
...
006 (552.001.000) 07/29 05:38:37 Image size of job updated: 599632
...
006 (552.002.000) 07/29 05:38:40 Image size of job updated: 594704
...
006 (552.002.000) 07/29 05:58:40 Image size of job updated: 601292
...
001 (552.000.000) 07/29 06:23:53 Job executing on host:
<10.0.40.139:32772>
...
001 (552.001.000) 07/29 06:34:15 Job executing on host:
<10.0.40.148:32771>
...
001 (552.002.000) 07/29 06:34:17 Job executing on host:
<10.0.40.112:32771>
...
006 (552.000.000) 07/29 06:44:01 Image size of job updated: 300600
...
006 (552.001.000) 07/29 06:54:23 Image size of job updated: 300148
...
006 (552.002.000) 07/29 06:54:25 Image size of job updated: 299148
...
006 (552.000.000) 07/29 07:04:01 Image size of job updated: 301100
...
006 (552.001.000) 07/29 07:14:23 Image size of job updated: 300616
...
006 (552.002.000) 07/29 07:14:25 Image size of job updated: 299456
...
006 (552.000.000) 07/29 07:24:01 Image size of job updated: 304700
...
006 (552.001.000) 07/29 07:34:23 Image size of job updated: 304212
...
006 (552.002.000) 07/29 07:34:25 Image size of job updated: 303052
...
006 (552.000.000) 07/29 07:44:01 Image size of job updated: 305816
...
006 (552.001.000) 07/29 07:54:23 Image size of job updated: 305332
...
006 (552.002.000) 07/29 07:54:26 Image size of job updated: 304172
...
006 (552.000.000) 07/29 08:44:01 Image size of job updated: 324556
...
006 (552.001.000) 07/29 08:54:23 Image size of job updated: 325088
...
006 (552.002.000) 07/29 08:54:25 Image size of job updated: 323752
...
006 (552.000.000) 07/29 09:04:01 Image size of job updated: 328840
...
006 (552.001.000) 07/29 09:14:23 Image size of job updated: 326720
...
006 (552.002.000) 07/29 09:14:25 Image size of job updated: 326972
...
006 (552.001.000) 07/29 09:54:23 Image size of job updated: 327492
...
006 (552.002.000) 07/29 09:54:25 Image size of job updated: 327228
...
006 (552.000.000) 07/29 11:44:01 Image size of job updated: 334344
...
006 (552.001.000) 07/29 11:54:23 Image size of job updated: 333860
...
006 (552.002.000) 07/29 11:54:25 Image size of job updated: 332700
...
006 (552.002.000) 07/29 12:34:26 Image size of job updated: 439668
...
006 (552.000.000) 07/29 12:44:01 Image size of job updated: 596972
...
006 (552.001.000) 07/29 12:54:23 Image size of job updated: 595696
...
006 (552.002.000) 07/29 12:54:25 Image size of job updated: 598368
...
006 (552.000.000) 07/29 20:04:01 Image size of job updated: 924564
...
006 (552.001.000) 07/29 20:14:23 Image size of job updated: 846144
...
006 (552.000.000) 07/29 20:24:01 Image size of job updated: 1033628
...
001 (552.000.000) 07/29 22:17:22 Job executing on host:
<10.0.40.148:32771>
...
001 (552.001.000) 07/29 22:17:24 Job executing on host:
<10.0.40.139:32772>
...
001 (552.002.000) 07/29 22:22:24 Job executing on host:
<10.0.40.112:32771>
...
006 (552.000.000) 07/29 22:37:30 Image size of job updated: 299556
...
006 (552.001.000) 07/29 22:37:32 Image size of job updated: 298980
...
006 (552.002.000) 07/29 22:42:31 Image size of job updated: 301080
...
006 (552.000.000) 07/29 22:57:31 Image size of job updated: 300056
...
006 (552.001.000) 07/29 22:57:32 Image size of job updated: 299476
...
006 (552.002.000) 07/29 23:02:31 Image size of job updated: 301108
...
006 (552.000.000) 07/29 23:17:30 Image size of job updated: 303652
...
006 (552.002.000) 07/29 23:22:31 Image size of job updated: 304704
...
006 (552.000.000) 07/29 23:37:30 Image size of job updated: 304772
...
006 (552.001.000) 07/29 23:37:32 Image size of job updated: 303076
...
006 (552.002.000) 07/29 23:42:32 Image size of job updated: 305820
...
006 (552.001.000) 07/29 23:57:32 Image size of job updated: 304192
...
006 (552.000.000) 07/30 00:37:30 Image size of job updated: 323696
...
006 (552.002.000) 07/30 00:42:32 Image size of job updated: 326856
...
006 (552.000.000) 07/30 00:57:30 Image size of job updated: 327188
...
006 (552.001.000) 07/30 00:57:32 Image size of job updated: 322928
...
006 (552.002.000) 07/30 01:02:31 Image size of job updated: 328236
...
006 (552.001.000) 07/30 01:17:32 Image size of job updated: 326608
...
006 (552.002.000) 07/30 01:42:32 Image size of job updated: 328752
...
006 (552.001.000) 07/30 02:17:33 Image size of job updated: 327124
...
006 (552.000.000) 07/30 03:37:30 Image size of job updated: 333300
...
006 (552.002.000) 07/30 03:42:31 Image size of job updated: 334352
...
006 (552.001.000) 07/30 03:57:33 Image size of job updated: 372392
...
006 (552.000.000) 07/30 04:17:30 Image size of job updated: 376872
...
006 (552.002.000) 07/30 04:22:31 Image size of job updated: 443972
...
006 (552.000.000) 07/30 04:37:30 Image size of job updated: 597524
...
006 (552.002.000) 07/30 04:42:32 Image size of job updated: 588952
...
006 (552.001.000) 07/30 04:57:33 Image size of job updated: 591804
...
006 (552.002.000) 07/30 06:22:32 Image size of job updated: 596632
...
006 (552.001.000) 07/30 06:37:33 Image size of job updated: 594992
...
006 (552.002.000) 07/30 11:42:32 Image size of job updated: 851356
...
006 (552.000.000) 07/30 11:57:30 Image size of job updated: 977040
...
006 (552.001.000) 07/30 11:57:33 Image size of job updated: 679004
...
006 (552.002.000) 07/30 12:02:33 Image size of job updated: 1034548
...
006 (552.000.000) 07/30 12:17:31 Image size of job updated: 1033368
...
006 (552.001.000) 07/30 12:17:33 Image size of job updated: 1031860
...

...then the job became idle again :(
######################################################################


Thanks,

Leo

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/