[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] fsync() failed in UserLog::writeEvent



Hi.  I am seeing jobs being evicted for no apparent reason.  These
occurrences are accompanied by some odd errors in my SchedLog and
ShadowLog on the master, which is also the submit machine.  The condor
installation is on NFS, which initially lead me to suspect a filesystem
problem.  However, there are no related errors on either the server or
any of the many clients this error has been associated with.  As this is
preventing many jobs from finishing, it is quickly becoming a Condor
show-stopper for us.  Any ideas?

-Jacob

Here are some log snippets:

SchedLog:
---------------------------------------------------------------------
./lop1/log/SchedLog:6/9 10:52:11 Started shadow for job 25885.1 on
"<192.168.1.26:32773>", (shadow pid = 31827)
./lop1/log/SchedLog:6/9 11:05:16 fsync() failed in UserLog::writeEvent -
errno 22 (Invalid argument)6/9 11:05:16 match
(<192.168.1.26:32773>#1113882228#835) out of jobs (cluster id 25885);
relinquishing
./lop1/log/SchedLog:6/9 11:05:16 Match record (<192.168.1.26:32773>,
25885, -1) deleted
./lop1/log/SchedLog:6/9 11:05:43 fsync() failed in UserLog::writeEvent -
errno 22 (Invalid argument)6/9 11:05:43 match
(<192.168.1.25:32773>#1108063081#1164) out of jobs (cluster id 25885);
relinquishing
./lop1/log/SchedLog:6/9 11:05:43 Match record (<192.168.1.25:32773>,
25885, -1) deleted

ShadowLog:
----------------------------------------------------------------------
./lop1/log/ShadowLog:6/8 16:22:16 (25885.1) (3217): Request to run on
<192.168.1.24:32773> was ACCEPTED
./lop1/log/ShadowLog:6/8 16:22:16 (25885.1) (3217): fsync() failed in
UserLog::writeEvent - errno 22 (Invalid argument)6/8 16:22:18
******************************************************
./lop1/log/ShadowLog:6/8 16:22:22 (25886.0) (3219): fsync() failed in
UserLog::writeEvent - errno 22 (Invalid argument)6/8 16:22:24 (25885.1)
(3217): fsync() failed in UserLog::writeEvent - errno 22 (Invalid
argument)6/8 16:22:27 (25885.2) (3218): fsync() failed in
UserLog::writeEvent - errno 22 (Invalid argument)6/8 16:22:30 (25886.0)
(3219): fsync() failed in UserLog::writeEvent - errno 22 (Invalid
argument)6/8 16:22:44 (25883.6) (3048): fsync() failed in
UserLog::writeEvent - errno 22 (Invalid argument)6/8 16:22:44 (25883.6)
(3048): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
./lop1/log/ShadowLog:6/8 16:42:09 (25895.0) (3924): fsync() failed in
UserLog::writeEvent - errno 22 (Invalid argument)6/8 16:42:18 (25895.0)
(3924): fsync() failed in UserLog::writeEvent - errno 22 (Invalid
argument)6/8 16:42:24 (25885.1) (3217): fsync() failed in
UserLog::writeEvent - errno 22 (Invalid argument)6/8 16:43:36 (25887.4)
(3277): fsync() failed in UserLog::writeEvent - errno 22 (Invalid
argument)6/8 16:43:39 (25885.2) (3218): fsync() failed in
UserLog::writeEvent - errno 22 (Invalid argument)6/8 16:43:44 (25887.3)
(3291): fsync() failed in UserLog::writeEvent - errno 22 (Invalid
argument)6/8 16:43:50 (25895.0) (3924): fsync() failed in
UserLog::writeEvent - errno 22 (Invalid argument)6/8 16:43:50 (25895.0)
(3924): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
./lop1/log/ShadowLog:6/9 02:20:43 (25885.1) (3217): Job 25885.1 is being
evicted
./lop1/log/ShadowLog:6/9 02:20:44 (25887.1) (3289): fsync() failed in
UserLog::writeEvent - errno 22 (Invalid argument)6/9 02:20:47 (25885.1)
(3217): fsync() failed in UserLog::writeEvent - errno 22 (Invalid
argument)6/9 02:20:47 (25887.1) (3289): **** condor_shadow
(condor_SHADOW) EXITING WITH STATUS 107
./lop1/log/ShadowLog:6/9 02:20:47 (25885.1) (3217): **** condor_shadow
(condor_SHADOW) EXITING WITH STATUS 107