[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] strange errors in ShadowLog




We are seeing a lot of strange errors and uncalled-for evictions
on our condor cluster at the moment.  Here is an
excerpt from ShadowLog for a single job that gives the highlights.

There are a couple errors that we haven't normally seen:
One is the logEvictEvent with unknown reason,
and the other is the AUTHENTICATE:1002 failure performing handshake.
I am not sure why the daemons would be handshaking at all as we
are not using either kerberos or GSI authentication at this time.

Any ideas what might be going on?  As you can see, the job in question has
twice been evicted and now is running on a third node.

There are two other mysteries:
"Top" shows that the process is taking only 575M of RAM, well within
the 1000M allocated on this VM. but the ImageSize shows 1315728K,
more than twice as much.  My cluster runs with
a requirement ((Memory * 1024 ) >= ImageSize

Imagesize was bigger than this both times the job was evicted.
It shouldn't have been able to restart at all, but somehow it did.
Many other jobs like it didn't.

Any clues?

Steve Timm



4/12 09:18:03 Initializing a VANILLA shadow for job 238192.0
4/12 09:18:03 (238192.0) (11506): Request to run on <131.225.167.147:27555> was ACCEPTED
4/12 09:18:03 (238192.0) (11506): Asked to write event of number 1.
4/12 09:18:11 (238192.0) (11506): Asked to write event of number 6.
4/12 09:36:48 (238192.0) (11506): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 09:36:48 (238192.0) (11506): IO: Failed to read packet header
4/12 09:40:36 (238192.0) (11506): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 09:40:36 (238192.0) (11506): IO: Failed to read packet header
4/12 09:40:36 (238192.0) (11506): AUTHENTICATE: handshake failed!
4/12 09:40:36 (238192.0) (11506): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 09:40:36 (238192.0) (11506): Asked to write event of number 6.
4/12 09:59:31 (238192.0) (11506): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 09:59:31 (238192.0) (11506): IO: Failed to read packet header
4/12 10:04:31 (238192.0) (11506): condor_read(): timeout reading buffer.
4/12 10:04:31 (238192.0) (11506): IO: Failed to read packet header
4/12 10:04:31 (238192.0) (11506): AUTHENTICATE: handshake failed!
4/12 10:04:31 (238192.0) (11506): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 10:04:31 (238192.0) (11506): Asked to write event of number 6.
4/12 10:18:11 (238192.0) (11506): Asked to write event of number 6.
4/12 10:24:31 (238192.0) (11506): condor_read(): timeout reading buffer.
4/12 10:24:31 (238192.0) (11506): IO: Failed to read packet header
4/12 10:29:31 (238192.0) (11506): condor_read(): timeout reading buffer.
4/12 10:29:31 (238192.0) (11506): IO: Failed to read packet header
4/12 10:29:31 (238192.0) (11506): AUTHENTICATE: handshake failed!
4/12 10:29:31 (238192.0) (11506): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 10:38:11 (238192.0) (11506): Asked to write event of number 6.
4/12 10:48:26 (238192.0) (11506): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 10:48:26 (238192.0) (11506): IO: Failed to read packet header
4/12 10:52:30 (238192.0) (11506): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 10:52:30 (238192.0) (11506): IO: Failed to read packet header
4/12 10:52:30 (238192.0) (11506): AUTHENTICATE: handshake failed!
4/12 10:52:30 (238192.0) (11506): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 10:58:11 (238192.0) (11506): Asked to write event of number 6.
4/12 12:00:25 Initializing a VANILLA shadow for job 238192.0
4/12 12:00:25 (238192.0) (10960): Request to run on <131.225.167.147:27555> was REFUSED
4/12 12:00:25 (238192.0) (10960): Job 238192.0 is being evicted
4/12 12:00:25 (238192.0) (10960): logEvictEvent with unknown reason (108), aborting
4/12 12:05:25 (238192.0) (10960): condor_read(): timeout reading buffer.
4/12 12:05:25 (238192.0) (10960): IO: Failed to read packet header
4/12 12:09:20 (238192.0) (10960): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 12:09:20 (238192.0) (10960): IO: Failed to read packet header
4/12 12:09:20 (238192.0) (10960): AUTHENTICATE: handshake failed!
4/12 12:09:20 (238192.0) (10960): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 12:09:20 (238192.0) (10960): Failed to update job queue!
4/12 12:09:20 (238192.0) (10960): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 108
4/12 12:31:23 Initializing a VANILLA shadow for job 238192.0
4/12 12:31:23 (238192.0) (5788): Request to run on <131.225.167.190:32777> was ACCEPTED
4/12 12:31:23 (238192.0) (5788): Asked to write event of number 1.
4/12 12:31:31 (238192.0) (5788): Asked to write event of number 6.
4/12 12:51:23 (238192.0) (5788): condor_read(): timeout reading buffer.
4/12 12:51:23 (238192.0) (5788): IO: Failed to read packet header
4/12 12:55:18 (238192.0) (5788): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 12:55:18 (238192.0) (5788): IO: Failed to read packet header
4/12 12:55:18 (238192.0) (5788): AUTHENTICATE: handshake failed!
4/12 12:55:18 (238192.0) (5788): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 12:55:18 (238192.0) (5788): Asked to write event of number 6.
4/12 13:14:35 (238192.0) (5788): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 13:14:35 (238192.0) (5788): IO: Failed to read packet header
4/12 13:17:44 (238192.0) (5788): getpeername failed so connect must have failed
4/12 13:22:50 (238192.0) (5788): condor_read(): timeout reading buffer.
4/12 13:22:50 (238192.0) (5788): IO: Failed to read packet header
4/12 13:22:50 (238192.0) (5788): AUTHENTICATE: handshake failed!
4/12 13:22:50 (238192.0) (5788): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 13:22:50 (238192.0) (5788): Asked to write event of number 6.
4/12 13:31:31 (238192.0) (5788): Asked to write event of number 6.
4/12 14:10:05 Initializing a VANILLA shadow for job 238192.0
4/12 14:10:05 (238192.0) (27233): Request to run on <131.225.167.190:32777> was REFUSED
4/12 14:10:05 (238192.0) (27233): Job 238192.0 is being evicted
4/12 14:10:06 (238192.0) (27233): logEvictEvent with unknown reason (108), aborting
4/12 14:15:06 (238192.0) (27233): condor_read(): timeout reading buffer.
4/12 14:15:06 (238192.0) (27233): IO: Failed to read packet header
4/12 14:16:37 (238192.0) (27233): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 108
4/12 14:23:16 Initializing a VANILLA shadow for job 238192.0
4/12 14:23:16 (238192.0) (8008): Request to run on <131.225.167.186:32772> was ACCEPTED
4/12 14:23:16 (238192.0) (8008): Asked to write event of number 1.
4/12 14:23:25 (238192.0) (8008): Asked to write event of number 6.
4/12 14:42:12 (238192.0) (8008): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 14:42:12 (238192.0) (8008): IO: Failed to read packet header
4/12 14:47:12 (238192.0) (8008): condor_read(): timeout reading buffer.
4/12 14:47:12 (238192.0) (8008): IO: Failed to read packet header
4/12 14:47:12 (238192.0) (8008): AUTHENTICATE: handshake failed!
4/12 14:47:12 (238192.0) (8008): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 14:47:12 (238192.0) (8008): Asked to write event of number 6.
4/12 15:07:12 (238192.0) (8008): condor_read(): timeout reading buffer.
4/12 15:07:12 (238192.0) (8008): IO: Failed to read packet header
4/12 15:12:12 (238192.0) (8008): condor_read(): timeout reading buffer.
4/12 15:12:12 (238192.0) (8008): IO: Failed to read packet header
4/12 15:12:12 (238192.0) (8008): AUTHENTICATE: handshake failed!
4/12 15:12:12 (238192.0) (8008): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 15:12:12 (238192.0) (8008): Asked to write event of number 6.
4/12 15:23:25 (238192.0) (8008): Asked to write event of number 6.
4/12 15:43:25 (238192.0) (8008): Asked to write event of number 6.
4/12 16:01:29 (238192.0) (8008): condor_read(): recv() returned -1, errno = 104, assuming failure.
4/12 16:01:29 (238192.0) (8008): IO: Failed to read packet header
4/12 16:01:29 (238192.0) (8008): AUTHENTICATE: handshake failed!
4/12 16:01:29 (238192.0) (8008): Authentication Error: AUTHENTICATE:1002:Failure performing handshake 4/12 16:03:25 (238192.0) (8008): Asked to write event of number 6.
4/12 16:21:29 (238192.0) (8008): condor_read(): timeout reading buffer.
4/12 16:21:29 (238192.0) (8008): IO: Failed to read packet header
4/12 16:21:29 (238192.0) (8008): AUTHENTICATE: handshake failed!
4/12 16:21:29 (238192.0) (8008): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
4/12 16:23:25 (238192.0) (8008): Asked to write event of number 6.

----------------

xml-based condor log:

[root@fngp-osg log]# more /export/osg/grid/globus/tmp/gram_job_state/gram_condor_log.23994.1144789277
<c>
    <a n="MyType"><s>SubmitEvent</s></a>
    <a n="EventTypeNumber"><i>0</i></a>
    <a n="EventTime"><s>2006-04-11T16:01:21</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="SubmitHost"><s>&lt;131.225.167.42:19965&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="EventTime"><s>2006-04-12T07:22:51</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
   <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;131.225.167.156:32772&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T07:23:00</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>10004</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T07:46:32</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1297420</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T08:09:22</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1312780</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T08:23:00</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1315852</i></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="EventTime"><s>2006-04-12T09:18:03</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
   <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;131.225.167.147:27555&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T09:18:11</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>9996</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T09:40:36</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1298328</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T10:04:31</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1312664</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T10:18:11</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1315736</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T10:38:11</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1316760</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T10:58:11</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1317784</i></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
   <a n="EventTime"><s>2006-04-12T12:31:23</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;131.225.167.190:32777&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T12:31:31</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>10000</i></a>
</c>
c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T12:55:18</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1298304</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T13:22:50</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1312640</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T13:31:31</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1315712</i></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="EventTime"><s>2006-04-12T14:23:16</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;131.225.167.186:32772&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T14:23:25</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>10012</i></a>
</c>
c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T14:47:12</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1298320</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T15:12:12</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1312656</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T15:23:25</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1315728</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
   <a n="EventTime"><s>2006-04-12T15:43:25</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1316752</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T16:03:25</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1360784</i></a>
</c>
c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T16:23:25</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1462160</i></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2006-04-12T16:43:25</s></a>
    <a n="Cluster"><i>238192</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>1463184</i></a>
</c>



--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team