[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Why are jobs not executing properly on some machines?



Hi all, 

I am a Condor newbie joining an established Condor 6.4.7 pool here at our
lab. My problem is that I cannot figure out why my jobs are not executing
properly on some execute machines. I am running a straightforward fortran
executable on WinNT5.0 machines that needs three input files and produces
many output files.   My queue is typically 80.  

For failed jobs, the output (stdout log) file is empty so I  think my
fortran executable isn't even starting.  According to the condor log files,
the machines that fail exit with "normal termination (return value 128)."
shown in this snippet 

000 (017.001.000) 10/08 09:23:42 Job submitted from host:
<1xx.1xx.15x.x22:1045>
...
001 (017.001.000) 10/08 09:24:25 Job executing on host:
<1xx.1xx.xx0.x20:1041>
...
005 (017.001.000) 10/08 09:24:25 Job terminated.
	(1) Normal termination (return value 128)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	27040368  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	27040368  -  Total Bytes Received By Job

FAQ says termination with code 128 is a problem with DLLs, but I know that
the program doesn't use any DLLs.   Furthermore, I discovered that my own
machine is also one that fails to execute jobs!  But I know that the program
can execute on my own machine when I start it directly!    These failing
machines keep accepting most of the jobs and CONDOR thinks they are
succeeding.  I am only getting about a 5-10% success rate.  

The very same jobs do run correctly on most machines with log file entries
like:

...
000 (020.000.000) 10/08 10:10:21 Job submitted from host:
<1xx.1xx.15x.x22:1045>
...
001 (020.000.000) 10/08 10:10:43 Job executing on host:
<1xx.4x.xx5.x10:1043>
...
006 (020.000.000) 10/08 10:10:51 Image size of job updated: 6992
...
006 (020.000.000) 10/08 10:30:51 Image size of job updated: 35296
...
005 (020.000.000) 10/08 10:37:44 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:25:44, Sys 0 00:00:23  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:25:44, Sys 0 00:00:23  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	76296528  -  Run Bytes Sent By Job
	17431122  -  Run Bytes Received By Job
	76296528  -  Total Bytes Sent By Job
	17431122  -  Total Bytes Received By Job

I have been working around the problem by excluding machines that fail (e.g.
requirements = Name != joBlow.nrel.gov), but that involves time consuming
iteration.  

Should not CONDOR know that a job that returned no data did not succeed?

I am at a total loss as to why jobs are failing...  does anyone have
suggestions?

Thanks in advance, 

Brent
 

 
P.S. 
Here is a snippet from the shadow log for job that failed... 

10/8 09:23:57 ******************************************************
10/8 09:23:57 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/8 09:23:57 ** $CondorVersion: 6.4.7 Jan 27 2003 $
10/8 09:23:57 ** $CondorPlatform: INTEL-WINNT40 $
10/8 09:23:57 ** PID = 2492
10/8 09:23:57 ******************************************************
10/8 09:23:57 DaemonCore: Command Socket at <1xx.xx7.x59.x22:4392>
10/8 09:23:58 Initializing a VANILLA shadow
10/8 09:23:58 (17.1) (2492): entering init_user_ids()...watch out.
10/8 09:23:58 (17.1) (2492): entering init_user_ids()...watch out.
10/8 09:23:58 (17.1) (2492): Request to run on <1xx.xx4.x0.x20:1041> was
ACCEPTED

10/8 09:23:58 (17.1) (2492): perm::init: Lookup Account Name zzzzzz failed
(err=1722), using Everyone
10/8 09:24:00 (17.1) (2492): perm::init: Lookup Account Name zzzzzz failed
(err=1722), using Everyone
10/8 09:24:00 (17.1) (2492): perm::init: Lookup Account Name zzzzzz failed
(err=1722), using Everyone

 and here is a my typical submit file (with workaround) :

universe = vanilla
executable = EpRun.bat
initialdir = MCA_$(Process)
transfer_input_files =..\EpRun.bat, ..\Energy+.idd,  ..\in.epw,
..\EnergyPlus.exe,  in.idf
requirements =  (( Disk > 100000  ) && \
                (ARCH  == "INTEL")  && \
                (OpSys == "WINNT50") && \
                (Name != "badmachin1.nrel.gov")  && \
                (Name != "badmachin2.nrel.gov"  )  && \
                (Name != "vm1@xxxxxxxxxxxxxxxxxxx"  )  && \
                (Name != "vm2@xxxxxxxxxxxxxxxxxxx"  )  && \
                (Name != "badmachin4.nrel.gov" ) && \
                (Name != "badmachin5.nrel.gov"   ) && \
                (Name != "badmachin6.nrel.gov"   ) && \
                (Name != "vm1@xxxxxxxxxxxxxxxxxxx"  ))
Notification = Error
output = E+screen.log
log    = E+job.log
transfer_files = ALWAYS
queue 80


Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>