[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Output file getting lost



Hi,
I'm running Condor on 20 computers which transform given file and I encountered problems, that sometimes the output isn't returned. It does a lot of transformations and is run overnight and when one of the files isn't returned, it gets stuck for whole night, which is really paintful, as we are short of time.

Job's log file says:
--------
005 (2884.000.000) 03/27 15:03:16 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:16, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:16, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	205203  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	205203  -  Total Bytes Received By Job
--------

which is wrong, as it should return at least 200kb. I cheched StartLog and it doesn't contain anything suspicious. One strange thing I noticed though is that this not-functioning happens when there is some messing with system time. StartLog says:
--------
3/27 15:02:50 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
3/27 15:02:50 Got activate_claim request from shadow (<10.11.2.89:3177>)
3/27 15:02:50 Remote job ID is 2884.0
3/27 15:02:51 Got universe "VANILLA" (5) from request classad
3/27 15:02:51 State change: claim-activation protocol successful
3/27 15:02:51 Changing activity: Idle -> Busy
3/27 14:57:19 DaemonCore: Command received via TCP from host <10.11.2.89:3229>
3/27 14:57:19 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
3/27 14:57:19 Called deactivate_claim_forcibly()
3/27 14:57:20 DaemonCore: Command received via UDP from host <10.11.2.208:1327>
3/27 14:57:20 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
3/27 14:57:20 Starter pid 1884 exited with status 0
3/27 14:57:20 State change: starter exited
3/27 14:57:20 Changing activity: Busy -> Idle
3/27 14:57:20 DaemonCore: Command received via UDP from host <10.11.2.89:3234>
3/27 14:57:20 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
3/27 14:57:20 State change: received RELEASE_CLAIM command
3/27 14:57:20 Changing state and activity: Claimed/Idle -> Preempting/Vacating
3/27 14:57:20 State change: No preempting claim, returning to owner
3/27 14:57:20 ERROR in CpuAttributes::cpu_busy_time() - negative cpu busy time!, returning 0
3/27 14:57:20 Changing state and activity: Preempting/Vacating -> Owner/Idle
3/27 14:57:20 State change: IS_OWNER is false
3/27 14:57:20 Changing state: Owner -> Unclaimed
3/27 14:57:21 DaemonCore: Command received via UDP from host <10.11.2.89:3235>
3/27 14:57:21 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
3/27 14:57:21 Error: can't find resource with capability (<10.11.2.208:3835>#1697071605)
3/27 15:03:20 loadavg thread died, restarting. (exit code=6)
--------
I had to kill the job after some time, so there is command 404, but notice, that time changed *back* (from 15:02 to 14:57). Could that cause such not-functioning ?

On all my clients I have installed Automachron, which synchronizes time every 30 seconds (with ntp.maths.tcd.ie). So, this system time fluctulations shouldn't happen.

Would anybody know what cause of that problem (and how to fix it) ?

Cheers,
Michal
____________________________________________________________
Obchodní dům.cz - široký sortiment domácích spotřebičů a elektroniky, výrazné slevy. Navštivte  http://www.obchodni-dum.cz/index.phtml?prov=59
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>