[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job stopped then restarted from scratch



Hi

I have the following job, that should last 2-3 days, but seems to finish, then restarts again, without sending any results : 
_____
Universe = vanilla

Executable      = closest_condor.sh

output          = closest.out
error           = closest.err
Log             = closest.log

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
requirements = (machine == "atlas.galaxy.ibpc.fr")

notify_user     = user_email@xxxxxxx
notification    = always

queue
_________


I attach several files, sorry for flooding your mailbox, but I think the answer is somewhere here
-the log file (.out and .err are empty)
-the SchedLog file
-the StartLog file of the target machine

If you could explain me what happens to my job (id n° 254), I would be very grateful

Nicolas


-----------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
------------------------------------------------

Attachment: Atlas-StartLog
Description: Binary data

Attachment: Fab-SchedLog
Description: Binary data

001 (077.000.000) 04/04 17:43:52 Job executing on host: <193.49.27.66:32772>
...
006 (077.000.000) 04/04 18:04:00 Image size of job updated: 58352
...
001 (077.000.000) 04/04 18:23:56 Job executing on host: <193.49.27.56:32772>
...
006 (077.000.000) 04/04 18:44:05 Image size of job updated: 58352
...
006 (077.000.000) 04/05 05:24:20 Image size of job updated: 58840
...
010 (077.000.000) 04/05 06:28:36 Job was suspended.
	Number of processes actually suspended: 3
...
011 (077.000.000) 04/05 06:30:42 Job was unsuspended.
...
004 (077.000.000) 04/05 11:28:59 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 16:03:45, Sys 0 00:02:39  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	1346  -  Run Bytes Received By Job
...
009 (077.000.000) 04/05 11:28:59 Job was aborted by the user.
	via condor_rm (by user cailliez)
...
000 (254.000.000) 04/05 11:53:59 Job submitted from host: <193.49.27.73:32772>
...
001 (254.000.000) 04/05 11:54:07 Job executing on host: <193.49.27.56:32772>
...
006 (254.000.000) 04/05 11:54:15 Image size of job updated: 31140
...
006 (254.000.000) 04/05 12:14:15 Image size of job updated: 58352
...
006 (254.000.000) 04/05 22:34:30 Image size of job updated: 58608
...
010 (254.000.000) 04/06 06:28:29 Job was suspended.
	Number of processes actually suspended: 3
...
011 (254.000.000) 04/06 06:30:49 Job was unsuspended.
...
010 (254.000.000) 04/07 06:28:32 Job was suspended.
	Number of processes actually suspended: 3
...
011 (254.000.000) 04/07 06:38:23 Job was unsuspended.
...
004 (254.000.000) 04/07 06:38:23 Job was evicted.
	(0) Job was not checkpointed.
		Usr 1 18:13:21, Sys 0 00:07:01  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	1346  -  Run Bytes Received By Job
...
001 (254.000.000) 04/07 14:46:15 Job executing on host: <193.49.27.56:32772>
...
006 (254.000.000) 04/07 15:06:24 Image size of job updated: 58352
...
006 (254.000.000) 04/07 16:06:25 Image size of job updated: 58608
...
010 (254.000.000) 04/08 06:28:41 Job was suspended.
	Number of processes actually suspended: 3
...
011 (254.000.000) 04/08 06:30:46 Job was unsuspended.
...
010 (254.000.000) 04/08 12:04:06 Job was suspended.
	Number of processes actually suspended: 3
...
011 (254.000.000) 04/08 12:06:51 Job was unsuspended.
...
001 (254.000.000) 04/10 11:52:24 Job executing on host: <193.49.27.56:32772>
...
006 (254.000.000) 04/10 12:12:32 Image size of job updated: 58120
...
006 (254.000.000) 04/10 12:32:33 Image size of job updated: 58352
...
006 (254.000.000) 04/10 16:12:38 Image size of job updated: 58608
...
010 (254.000.000) 04/11 06:28:42 Job was suspended.
	Number of processes actually suspended: 3
...
011 (254.000.000) 04/11 06:30:37 Job was unsuspended.
...
006 (254.000.000) 04/11 20:12:38 Image size of job updated: 58840
...
010 (254.000.000) 04/12 06:28:27 Job was suspended.
	Number of processes actually suspended: 3
...
011 (254.000.000) 04/12 06:30:28 Job was unsuspended.
...
001 (254.000.000) 04/13 09:47:31 Job executing on host: <193.49.27.56:32772>
...
006 (254.000.000) 04/13 10:07:40 Image size of job updated: 58352
...