[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAGman job submissions stall



Hi,

I have a problem with DAGman: when I submit a DAGman submission file, things
go smoothly in the beginning (DAGman submits jobs to Condor in blocks of
DAGMAN_MAX_SUBMITS_PER_INTERVAL), but after a while things stall and DAGman
only submits one job per 30 seconds. I have no clue why this happens. Can
anyone help me with this?

Thanks,

- Jan

---------------------------------

Here comes a part of the xxx.dagman.out. The jobs are named S0 until S255
and are all independent each other. (Later on jobs with dependencies on
these follow.) My remarks are between <<...>>.

<<snip>>

6/22 00:26:13 Dag contains 392 total jobs
6/22 00:26:13 Deleting any older versions of log files...
6/22 00:26:13 Bootstrapping...
6/22 00:26:13 Number of pre-completed jobs: 0
6/22 00:26:13 Registering condor_event_timer...
6/22 00:26:14 Submitting Condor Job S0 ...
6/22 00:26:14 submitting: condor_submit  -a 'dag_node_name = S0' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = DAG
 Node: $(dag_node_name)' -a 'OUT = S0.out' -a 'MDFIVE = 0-FFFFFFFFE' -a
'TRANSFER = S0.out0,S0.out1,S0.out2,S0.out3,S0.o
ut4,S0.out5,S0.out6,S0.out7' sorter.condor 2>&1
6/22 00:26:14   assigned Condor ID (3555.0.0)
6/22 00:26:14 Submitting Condor Job S1 ...

<<same for jobs S2 ... S19>>

6/22 00:26:16 Submitting Condor Job S19 ...
6/22 00:26:16 submitting: condor_submit  -a 'dag_node_name = S19' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = DA
G Node: $(dag_node_name)' -a 'OUT = S19.out' -a 'MDFIVE =
12FFFFFFFFF-13FFFFFFFFE' -a 'TRANSFER = S19.out0,S19.out1,S19.
out2,S19.out3,S19.out4,S19.out5,S19.out6,S19.out7' sorter.condor 2>&1
6/22 00:26:16   assigned Condor ID (3574.0.0)
6/22 00:26:16 Just submitted 20 jobs this cycle...
6/22 00:26:16 Event: ULOG_SUBMIT for Condor Job S0 (3555.0.0)
6/22 00:26:16 Event: ULOG_SUBMIT for Condor Job S1 (3556.0.0)

<<same for S2 .. S19>>

6/22 00:26:16 Event: ULOG_SUBMIT for Condor Job S19 (3574.0.0)
6/22 00:26:16 Of 392 nodes total:
6/22 00:26:16  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
6/22 00:26:16   ===     ===      ===     ===     ===        ===      ===
6/22 00:26:16     0       0       20       0     236        136        0
6/22 00:26:21 Submitting Condor Job S20 ...

<<everything is fine now. 20 (=DAGMAN_MAX_SUBMISSIONS_PER_INTERVAL) jobs are
submitted>>
<<things remain fine until S231. we continue xxx.dagman.out from submitting
job S219>>

6/22 00:28:08 Event: ULOG_SUBMIT for Condor Job S219 (3774.0.0)
6/22 00:28:08 Of 392 nodes total:
6/22 00:28:08  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
6/22 00:28:08   ===     ===      ===     ===     ===        ===      ===
6/22 00:28:08    35       0      185       0      44        128        0

<<everything still fine: 220 jobs submitted, 185 queued and 35 finished>>
<<at job S232 things start to stall>>

6/22 00:28:13 Submitting Condor Job S220 ...
6/22 00:28:13 submitting: condor_submit  -a 'dag_node_name = S220' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S220.out' -a 'MDFIVE =
DBFFFFFFFFF-DCFFFFFFFFE' -a 'TRANSFER = S220.out0,S220.out1,
S220.out2,S220.out3,S220.out4,S220.out5,S220.out6,S220.out7' sorter.condor
2>&1
6/22 00:28:13   assigned Condor ID (3775.0.0)
6/22 00:28:13 Submitting Condor Job S221 ...
6/22 00:28:13 submitting: condor_submit  -a 'dag_node_name = S221' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S221.out' -a 'MDFIVE =
DCFFFFFFFFF-DDFFFFFFFFE' -a 'TRANSFER = S221.out0,S221.out1,
S221.out2,S221.out3,S221.out4,S221.out5,S221.out6,S221.out7' sorter.condor
2>&1
6/22 00:28:14   assigned Condor ID (3776.0.0)
6/22 00:28:14 Submitting Condor Job S222 ...
6/22 00:28:14 submitting: condor_submit  -a 'dag_node_name = S222' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S222.out' -a 'MDFIVE =
DDFFFFFFFFF-DEFFFFFFFFE' -a 'TRANSFER = S222.out0,S222.out1,
S222.out2,S222.out3,S222.out4,S222.out5,S222.out6,S222.out7' sorter.condor
2>&1
6/22 00:28:14   assigned Condor ID (3777.0.0)
6/22 00:28:14 Submitting Condor Job S223 ...
6/22 00:28:14 submitting: condor_submit  -a 'dag_node_name = S223' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S223.out' -a 'MDFIVE =
DEFFFFFFFFF-DFFFFFFFFFE' -a 'TRANSFER = S223.out0,S223.out1,
S223.out2,S223.out3,S223.out4,S223.out5,S223.out6,S223.out7' sorter.condor
2>&1
6/22 00:28:14   assigned Condor ID (3778.0.0)
6/22 00:28:14 Submitting Condor Job S224 ...
6/22 00:28:14 submitting: condor_submit  -a 'dag_node_name = S224' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S224.out' -a 'MDFIVE =
DFFFFFFFFFF-E0FFFFFFFFE' -a 'TRANSFER = S224.out0,S224.out1,
S224.out2,S224.out3,S224.out4,S224.out5,S224.out6,S224.out7' sorter.condor
2>&1
6/22 00:28:14   assigned Condor ID (3779.0.0) 6/22 00:28:14 Submitting
Condor Job S225 ...
6/22 00:28:14 submitting: condor_submit  -a 'dag_node_name = S225' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S225.out' -a 'MDFIVE =
E0FFFFFFFFF-E1FFFFFFFFE' -a 'TRANSFER = S225.out0,S225.out1,
S225.out2,S225.out3,S225.out4,S225.out5,S225.out6,S225.out7' sorter.condor
2>&1
6/22 00:28:14   assigned Condor ID (3780.0.0)
6/22 00:28:14 Submitting Condor Job S226 ...
6/22 00:28:14 submitting: condor_submit  -a 'dag_node_name = S226' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S226.out' -a 'MDFIVE =
E1FFFFFFFFF-E2FFFFFFFFE' -a 'TRANSFER = S226.out0,S226.out1,
S226.out2,S226.out3,S226.out4,S226.out5,S226.out6,S226.out7' sorter.condor
2>&1
6/22 00:28:14   assigned Condor ID (3781.0.0)
6/22 00:28:14 Submitting Condor Job S227 ...
6/22 00:28:14 submitting: condor_submit  -a 'dag_node_name = S227' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S227.out' -a 'MDFIVE =
E2FFFFFFFFF-E3FFFFFFFFE' -a 'TRANSFER = S227.out0,S227.out1,
S227.out2,S227.out3,S227.out4,S227.out5,S227.out6,S227.out7' sorter.condor
2>&1
6/22 00:28:14   assigned Condor ID (3782.0.0)
6/22 00:28:14 Submitting Condor Job S228 ...
6/22 00:28:14 submitting: condor_submit  -a 'dag_node_name = S228' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S228.out' -a 'MDFIVE =
E3FFFFFFFFF-E4FFFFFFFFE' -a 'TRANSFER = S228.out0,S228.out1,
S228.out2,S228.out3,S228.out4,S228.out5,S228.out6,S228.out7' sorter.condor
2>&1
6/22 00:28:14   assigned Condor ID (3783.0.0)
6/22 00:28:14 Submitting Condor Job S229 ...
6/22 00:28:14 submitting: condor_submit  -a 'dag_node_name = S229' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S229.out' -a 'MDFIVE =
E4FFFFFFFFF-E5FFFFFFFFE' -a 'TRANSFER = S229.out0,S229.out1,
S229.out2,S229.out3,S229.out4,S229.out5,S229.out6,S229.out7' sorter.condor
2>&1
6/22 00:28:15   assigned Condor ID (3784.0.0)
6/22 00:28:15 Submitting Condor Job S230 ...
6/22 00:28:15 submitting: condor_submit  -a 'dag_node_name = S230' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S230.out' -a 'MDFIVE =
E5FFFFFFFFF-E6FFFFFFFFE' -a 'TRANSFER = S230.out0,S230.out1,
S230.out2,S230.out3,S230.out4,S230.out5,S230.out6,S230.out7' sorter.condor
2>&1
6/22 00:28:15   assigned Condor ID (3785.0.0)
6/22 00:28:15 Submitting Condor Job S231 ...
6/22 00:28:15 submitting: condor_submit  -a 'dag_node_name = S231' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S231.out' -a 'MDFIVE =
E6FFFFFFFFF-E7FFFFFFFFE' -a 'TRANSFER = S231.out0,S231.out1,
S231.out2,S231.out3,S231.out4,S231.out5,S231.out6,S231.out7' sorter.condor
2>&1

<< !!! Condor suddenly waits for 30secs before submitting a new job !!!>>

6/22 00:28:45   assigned Condor ID (3786.0.0)
6/22 00:28:45 Submitting Condor Job S232 ...
6/22 00:28:45 submitting: condor_submit  -a 'dag_node_name = S232' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S232.out' -a 'MDFIVE =
E7FFFFFFFFF-E8FFFFFFFFE' -a 'TRANSFER = S232.out0,S232.out1,
S232.out2,S232.out3,S232.out4,S232.out5,S232.out6,S232.out7' sorter.condor
2>&1

<<and another 30 secs>>

6/22 00:29:16   assigned Condor ID (3787.0.0)
6/22 00:29:16 Submitting Condor Job S233 ...
6/22 00:29:16 submitting: condor_submit  -a 'dag_node_name = S233' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S233.out' -a 'MDFIVE =
E8FFFFFFFFF-E9FFFFFFFFE' -a 'TRANSFER = S233.out0,S233.out1,
S233.out2,S233.out3,S233.out4,S233.out5,S233.out6,S233.out7' sorter.condor
2>&1

<<and another 30>>

6/22 00:29:47   assigned Condor ID (3788.0.0)
6/22 00:29:47 Submitting Condor Job S234 ...
6/22 00:29:47 submitting: condor_submit  -a 'dag_node_name = S234' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S234.out' -a 'MDFIVE =
E9FFFFFFFFF-EAFFFFFFFFE' -a 'TRANSFER = S234.out0,S234.out1,
S234.out2,S234.out3,S234.out4,S234.out5,S234.out6,S234.out7' sorter.condor
2>&1

<<you get the picture>>

6/22 00:30:17   assigned Condor ID (3789.0.0)
6/22 00:30:17 Submitting Condor Job S235 ...
6/22 00:30:17 submitting: condor_submit  -a 'dag_node_name = S235' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S235.out' -a 'MDFIVE =
EAFFFFFFFFF-EBFFFFFFFFE' -a 'TRANSFER = S235.out0,S235.out1,
S235.out2,S235.out3,S235.out4,S235.out5,S235.out6,S235.out7' sorter.condor
2>&1
6/22 00:30:48   assigned Condor ID (3790.0.0)
6/22 00:30:48 Submitting Condor Job S236 ...
6/22 00:30:48 submitting: condor_submit  -a 'dag_node_name = S236' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S236.out' -a 'MDFIVE =
EBFFFFFFFFF-ECFFFFFFFFE' -a 'TRANSFER = S236.out0,S236.out1,
S236.out2,S236.out3,S236.out4,S236.out5,S236.out6,S236.out7' sorter.condor
2>&1

<<here it waits for a whole minute>>

6/22 00:31:48   assigned Condor ID (3791.0.0)
6/22 00:31:48 Submitting Condor Job S237 ...
6/22 00:31:48 submitting: condor_submit  -a 'dag_node_name = S237' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S237.out' -a 'MDFIVE =
ECFFFFFFFFF-EDFFFFFFFFE' -a 'TRANSFER = S237.out0,S237.out1,
S237.out2,S237.out3,S237.out4,S237.out5,S237.out6,S237.out7' sorter.condor
2>&1
6/22 00:32:48   assigned Condor ID (3792.0.0)
6/22 00:32:48 Submitting Condor Job S238 ...
6/22 00:32:48 submitting: condor_submit  -a 'dag_node_name = S238' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S238.out' -a 'MDFIVE =
EDFFFFFFFFF-EEFFFFFFFFE' -a 'TRANSFER = S238.out0,S238.out1,
S238.out2,S238.out3,S238.out4,S238.out5,S238.out6,S238.out7' sorter.condor
2>&1
6/22 00:33:49   assigned Condor ID (3793.0.0)
6/22 00:33:49 Submitting Condor Job S239 ...
6/22 00:33:49 submitting: condor_submit  -a 'dag_node_name = S239' -a
'+DAGManJobID = 3554.0' -a 'submit_event_notes = D
AG Node: $(dag_node_name)' -a 'OUT = S239.out' -a 'MDFIVE =
EEFFFFFFFFF-EFFFFFFFFFE' -a 'TRANSFER = S239.out0,S239.out1,
S239.out2,S239.out3,S239.out4,S239.out5,S239.out6,S239.out7' sorter.condor
2>&1
6/22 00:34:49   assigned Condor ID (3794.0.0)
6/22 00:34:49 Just submitted 20 jobs this cycle...
6/22 00:36:49 Event: ULOG_JOB_TERMINATED for Condor Job S27 (3582.0.0)
6/22 00:36:49 Job S27 completed successfully.
6/22 00:39:19 Event: ULOG_JOB_TERMINATED for Condor Job S31 (3586.0.0)
6/22 00:39:19 Job S31 completed successfully.
6/22 00:40:49 Event: ULOG_EXECUTE for Condor Job S39 (3594.0.0)
6/22 00:41:19 Event: ULOG_EXECUTE for Condor Job S40 (3595.0.0)
6/22 00:41:19 Event: ULOG_IMAGE_SIZE for Condor Job S33 (3588.0.0)
6/22 00:41:19 Event: ULOG_JOB_TERMINATED for Condor Job S33 (3588.0.0)
6/22 00:41:19 Job S33 completed successfully.
6/22 00:41:19 Event: ULOG_JOB_TERMINATED for Condor Job S36 (3591.0.0)
6/22 00:41:19 Job S36 completed successfully.
6/22 00:41:19 Event: ULOG_SUBMIT for Condor Job S220 (3775.0.0)
6/22 00:41:19 Event: ULOG_SUBMIT for Condor Job S221 (3776.0.0)
6/22 00:41:19 Event: ULOG_SUBMIT for Condor Job S222 (3777.0.0)
6/22 00:41:19 Event: ULOG_SUBMIT for Condor Job S223 (3778.0.0)
6/22 00:41:19 Event: ULOG_SUBMIT for Condor Job S224 (3779.0.0)
6/22 00:41:19 Event: ULOG_SUBMIT for Condor Job S225 (3780.0.0)
6/22 00:41:19 Event: ULOG_SUBMIT for Condor Job S226 (3781.0.0)

<<etc...>>