[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Evictions



Here is the entry from the shadowlog file that corresponds to the hello_world.log that I sent in the last e-mail.  I have also included a zip file of the entire log.  this comes neer the end of the file.
 
2/1 18:40:04 (?.?) (3477):******* Standard Shadow starting up *******
2/1 18:40:04 (?.?) (3477):** $CondorVersion: 6.7.14 Dec 13 2005 $
2/1 18:40:04 (?.?) (3477):** $CondorPlatform: I386-LINUX_RH9 $
2/1 18:40:04 (?.?) (3477):*******************************************
2/1 18:40:04 (?.?) (3477):uid=0, euid=501, gid=0, egid=501
2/1 18:40:04 (?.?) (3477):Hostname = "<192.168.0.2:32773>", Job = 22.0
2/1 18:40:04 (22.0) (3477):Requesting Primary Starter
2/1 18:40:04 (22.0) (3477):Shadow: Request to run a job was ACCEPTED
2/1 18:40:04 (22.0) (3477):Shadow: RSC_SOCK connected, fd = 17
2/1 18:40:04 (22.0) (3477):Shadow: CLIENT_LOG connected, fd = 18
2/1 18:40:04 (22.0) (3477):My_Filesystem_Domain = "condor.local"
2/1 18:40:04 (22.0) (3477):My_UID_Domain = "condor01.condor.local"
2/1 18:40:04 (22.0) (3477): Entering pseudo_get_file_stream
2/1 18:40:04 (22.0) (3477): file = "/home/condor/local_scratch/spool/cluster22.ickpt.subproc0"
2/1 18:40:16 (22.0) (3477):Reaped child status - pid 3478 exited with status 0
2/1 18:40:17 (22.0) (3477):Shadow: Job 22.0 exited, termsig = 9, coredump = 0, retcode = 110
2/1 18:40:17 (22.0) (3477):Shadow: Job was kicked off without a checkpoint
2/1 18:40:17 (22.0) (3477):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/local_scratch/spool/cluster22.proc0.subproc0.tmp'
2/1 18:40:17 (22.0) (3477):Trying to unlink /home/condor/local_scratch/spool/cluster22.proc0.subproc0.tmp
2/1 18:40:17 (22.0) (3477):user_time = 2 ticks
2/1 18:40:17 (22.0) (3477):sys_time = 16 ticks
2/1 18:40:17 (22.0) (3477):********** Shadow Exiting(107) **********
----- Original Message -----
From: Jaime Frey
Sent: Thursday, February 02, 2006 10:06 AM
Subject: Re: [Condor-users] Evictions

Ah, you're using the standard universe. That changes things. You don't have to worry about shared file systems or file transfer in that case. You used condor_compile to link your program, yes?

The logs are in the Condor log directory. Run 'condor_config_val log' to determine where that is.

 -- Jaime

On Jan 31, 2006, at 5:09 PM, Stephen Broughton wrote:

I don't have current access to the grid, I will check the shadow log this evening.  Just in case, where is this log file?
 
I have a condor NFS share with all the binaries that all the nodes connect to, each of the nodes has a local scratch folder with their local config file.  The program is located on a sub folder of the condor shared folder with write access (I believe) set for read/write/execute.   I have not explicitly designated file transfer.  The program does output to a text file in the program source folder.  As I was able to run this as a test in the previous installation.
 
I will check the file permissions, that seems like a likely problem.
 
My main purpose in this program is to just output a few time stamps into a single output file, is there a way to do this that is supported more directly in Condor that would work better than just writting to a text file?
 
####################
##
## Prime Number Condor command file
##
####################
Universte  = standard
executable      = prime_new
 
log    = prime_new.log
 
#output   = prime_new.$(Process).out
output   = prime_new.out
error   = prime_new.err
 
# 1
arguments  = 10000000 10000500 1
queue
 
# 2
arguments  = 10000501 10001000 2
queue
The program is running from /home/condor/condor/prime that exists on all nodes through an NFS share.
----- Original Message -----
From: Jaime Frey
Sent: Tuesday, January 31, 2006 3:50 PM
Subject: Re: [Condor-users] Evictions

On Jan 31, 2006, at 11:26 AM, Stephen Broughton wrote:

I just noticed from the log that all the evictions are from the nodes, the job only completes on the master which is also the submittign machine and the NFS server for the Condor installation binaries.  This test program worked when I had a Condor 6.7.12 install and all the same configuration settings.

Does the shadow log on your submit machine or the starter log on the evicting execute machines contain any interesting error messages? If you don't have file transfer enabled, is the executable, input, output, or error on a local disk?

+--------------------------------+-----------------------------------+
|           Jaime Frey           | I used to be a heavy gambler.     |
|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |
+--------------------------------+-----------------------------------+



_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

Attachment: ShadowLog.zip
Description: Zip compressed data