[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] starter condor_write() failed



On 5/15/2012 10:00 PM, 杨萌萌 wrote:
Hi,

I employed a condor pool with two machine.
The version of condor is 7.6.7 and the OS is fedora14.
When I use condor to run a workflow,it appears wrong as follows.


Could you give us some context about what appears wrong?

Does your job go on hold?  What do you see when you do condor_q?

If you job is on hold (state "H"), try entering "condor_q -hold" to see the a reason why it was held.

Based on the logs below, my guess is Condor put your job on hold because your submit file specified a file (maybe named diff.000004.000008.fits?) in transfer_output_files that did not exist after the job exited.

I also notice that your job exited in less that one second with status 1, so maybe your program died before it could create the output file you specified. You may want to confirm your program runs properly outside of Condor, and/or take a look in the stderr of your program (specified via error= in the submit description file).

hope this helps
Todd

Startlog
5/15/12 22:17:20 Output file:
/home/condor/localcondor/execute/dir_8051/_condor_stdout
05/15/12 22:17:20 Error file:
/home/condor/localcondor/execute/dir_8051/_condor_stderr
05/15/12 22:17:20 About to exec
/home/condor/localcondor/execute/dir_8051/condor_exec.exe
05/15/12 22:17:20 Create_Process succeeded, pid=8053
05/15/12 22:17:20 Process exited, pid=8053, status=1
05/15/12 22:17:20 ReliSock::put_file_with_permissions(): Failed to stat
file
'/home/condor/localcondor/execute/dir_8051/diff.000004.000008.fits': No
such file or directory (errno: 2, si_error: 1)
05/15/12 22:17:20 DoUpload: (Condo! r error code 13, subcode 2) STARTER
at 192.168.1.105 failed to send file(s) to <192.168.1.105:38037>: error
reading from
/home/condor/localcondor/execute/dir_8051/diff.000004.000008.fits:
(errno 2) No such file or directory; SHADOW failed to receive file(s)
from <192.168.1.105:55967>
05/15/12 22:17:20 JICShadow::notifyJobTermination(): Sending mock
terminate event.
05/15/12 22:17:20 JIC::transferOutput() failed, waiting for job lease to
expire or for a reconnect attempt
05/15/12 22:17:20 Returning from CStarter::JobReaper()
05/15/12 22:17:20 Got SIGQUIT. Performing fast shutdown.
05/15/12 22:17:20 ShutdownFast all jobs.
05/15/12 22:17:20 condor_read() failed: recv() returned -1, errno = 104
Connection reset by peer, reading 5 bytes from <192.168.1.105:36233>.
05/15/12 22:17:20 IO: Failed to read packet header
05/15/12 22:17:20 condor_write(): Socket closed when trying to write 97
bytes to <192.168.1.105:36233>, fd is 6 05/15/12 22:17:20 Buf::write():
condor_write() failed
05/15/12 22 :17:20 Failed to send job exit status to shadow
05/15/12 22:17:20 JobExit() failed, waiting for job lease to expire or
for a reconnect attempt
05/15/12 22:17:40 Got SIGTERM. Performing graceful shutdown.
05/15/12 22:17:40 ShutdownGraceful all jobs.
05/15/12 22:17:40 condor_write(): Socket closed when trying to write 97
bytes to <192.168.1.105:36233>, fd is 6
05/15/12 22:17:40 Buf::write(): condor_write() failed
05/15/12 22:17:40 Failed to send job exit status to shadow
05/15/12 22:17:40 JobExit() failed, waiting for job lease to expire or
for a reconnect attempt
05/15/12 22:17:40 **** condor_starter (condor_STARTER) pid 8051 EXITING
WITH STATUS 0


Matchlog
<192.168.1.105:55934> preempting none <192.168.1.106:45394> xuwei.shanda.com
05/15/12 22:15:59 Matched 114.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934> preempting none
<192.168.1.105:49! 373> yang.shanda.com
05/15/12 22:15:59 Rejected 115.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934>: no match found
05/15/12 22:15:59 Rejected 108.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934>: no match found
05/15/12 22:16:19 Rejected 118.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934>: no match found
05/15/12 22:16:19 Rejected 108.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934>: no match found
05/15/12 22:17:19 Matched 115.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934> preempting none
<192.168.1.106:45394> xuwei.shanda.com
05/15/12 22:17:19 ! ; Matched 116.0 condor@xxxxxxxxxx
<192.168.1.105:55934> preempting none <192.168.1.105:49373> yang.shanda.com
05/15/12 22:17:19 Rejected 118.0
<mailto:condor@xxxxxxxxxx>condor@xxxxxxxxxx <mailto:condor@xxxxxxxxxx>
<192.168.1.105:55934>: no match found
05/15/12 22:17:19 Rejected 108.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934>: no match found

NegotiatorLog
05/15/12 22:17:19 ---------- Started Negotiation Cycle ----------
05/15/12 22:17:19 Phase 1: Obtaining ads from collector ...
05/15/12 22:17:19 Getting all public ads ...
05/15/12 22:17:19 Sorting 7 ads ...
05/15/12 22:17:19 Getting startd private ads ...
05/15/12 22:17:19 Got ads: 7 public and 2 private
05/15/12 22:17:19 Public ads include 1 submitter, 2 startd
05/15/12 22:17:19 Phase 2: Performing accounting ...
05/15/12 22:17:19 Phase 3: Sorti! ng submitter ads by priority ...
05/15/12 22:17:19 Phase 4.1: Negotiating with schedds ...
05/15/12 22:17:19 Negotiating with condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> at <192.168.1.105:55934>
05/15/12 22:17:19 0 seconds so far
05/15/12 22:17:19 Request 00115.00000:
05/15/12 22:17:19 Matched 115.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934> preempting none
<192.168.1.106:45394> xuwei.shanda.com
05/15/12 22:17:19 Successfully matched with xuwei.shanda.com
05/15/12 22:17:19 Request 00116.00000:
05/15/12 22:17:19 Matched 116.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934> preempting none
<192.168.1.105:49373> yang.shanda.com
05/15/12 22:17:19 &nb! sp; Successfully matched with yang.shanda.com
05/15/12 22:17: 19 Request 00118.00000:
05/15/12 22:17:19 Rejected 118.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934>: no match found
05/15/12 22:17:19 Request 00108.00000:
05/15/12 22:17:19 Rejected 108.0 condor@xxxxxxxxxx
<mailto:condor@xxxxxxxxxx> <192.168.1.105:55934>: no match found
05/15/12 22:17:19 Got NO_MORE_JOBS; done negotiating
05/15/12 22:17:19 negotiateWithGroup resources used scheddAds length 1
05/15/12 22:17:19 ---------- Finished Negotiation Cycle ----------
05/15/12 22:18:17 Got SIGTERM. Performing graceful shutdown.
05/15/12 22:18:17 **** condor_negotiator (condor_NEGOTIATOR) pid 7249
EXITING WITH STATUS 0


I'm a fresh to condor.
I'll appreciate if you give some answers and advises.
Thank you with your help.

Yang





_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
Condor Project Technical Lead          1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685