[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] Condor starter crashes in NTsenders -- Eureka!!!



Eureka!  I have found it!  Well, maybe not found it, but I have a huge clue.  Apparently, this crash is closely related to having files in the main spool directory; if they're there, starter continually crashes.  If they aren't, it doesn't.

Basically, when I first submit a job (from the Central Manager), it starts off fine on the grid node.  When it's vacated, it returns, and intermediate files are placed in the spool tmp directory.  Sometimes they're committed to the spool main directory, sometimes not (that's another issue I don't understand).  Anyway, if they're committed to the main directory under spool (i.e., \condor\spool\cluster#.proc#.subproc#\), then starter continually crashes when the job is resubmitted.  If the contents of that directory are cleared, the job will begin normally.

So that's a big step toward figuring out my communications problems.  That still leaves me wondering why files in the spool directory cause problems, and still leaves me wondering why sometimes they aren't committed to the main directory from the tmp directory after returning to the submit machine.

-David

-----Original Message-----
From: David Vestal 
Sent: Wednesday, March 10, 2004 2:22 PM
To: condor-users@xxxxxxxxxxx
Subject: Re: [condor-users] Condor starter crashes in NTsenders


Colin,

Unfortunately, I can't.  I cached the logs for the client then, but not the logfiles on the Central Manager.  They've rolled over long since.

Since then, I've changed the network connection that that grid node was using form a wireless card to a direct line into the LAN.  Starter is still continuously exiting, but not with that particular error.

For the current problem, I checked ShadowLog on the Central Manager, and found this:
3/10 14:09:56 (fd:5) (1820.0) (2940): DoUpload: Permission denied to read file C:\Condor/spool\cluster1820.proc0.subproc0\azetidine_t2.out.0!
3/10 14:09:56 (fd:5) (1820.0) (2940): DoUpload: exiting at 1154

The file in question is one of the intermediate files created by the job as a manual checkpoint.

StarterLog on the grid node at this time period reads:
3/10 14:09:56 (fd:5) DaemonCore: Command received via UDP from host <192.168.33.165:3208>
3/10 14:09:56 (fd:5) DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
3/10 14:09:56 (fd:5) DaemonCore: tid 1280 exited with status 0, invoking reaper 2 <FileTransfer::Reaper()>
3/10 14:09:56 (fd:5) File transfer failed (status=0).
3/10 14:09:56 (fd:3) Calling client FileTransfer handler function.
3/10 14:09:56 (fd:3) ERROR "Failed to transfer files" at line 577 in file ..\src\condor_starter.V6.1\starter_class.C
3/10 14:09:56 (fd:3) ShutdownFast all jobs.

Does this help?
-David


-----Original Message-----
From: Colin Stolley [mailto:stolley@xxxxxxxxxxx]
Sent: Tuesday, March 09, 2004 6:14 PM
To: condor-users@xxxxxxxxxxx
Subject: [SPAM] - Re: [condor-users] Condor starter crashes in NTsenders
- Email found in subject


>then immediately crashes.  The StarterLog on the run machine contains this 
>to explain the crashes:
>
>3/4 16:41:16 (fd:3) In CStarter::StartJob()
>3/4 16:41:16 (fd:3) Doing CONDOR_get_job_info
>3/4 16:41:16 (fd:3) ERROR "Assertion ERROR on (result)" at line 148 in 
>file ..\src\condor_starter.V6.1\NTsenders.C
>3/4 16:41:16 (fd:3) ShutdownFast all jobs.

Can you post a snippet of the corresponding ShadowLog when this happens?

thanks,
Colin
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>