[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs retuning blank output



I logged into one of the other machines (node1) with ssh and executed the console app... executed perfectly.
I then added the requirement Machine = "node1" which it wouldnt accept so I then changed it to Machine = "node1.cluster.int" which
it did accept.
 
The same empty files were returned.
 
Chris
 
 
>>Chris!!!
>>  I've gone through all the log files. I dont think there is any problem as far as condor is concerned. Logs are pretty mch ok. Since still you are not gettin any output, i have few suggestion 4 u.
>>
>>a) Run the console application "console_som" on any of the X86_64 machine from command prompt(Either do "ssh" to it from ny of d remote m/c or on dat console itself ).
>>
>>b) If (a) results positive, means its being executed with some prints on screen(bcoz of "echo" or "printf"), Run the same from condor, but this time submit it from the same X86_64 machine you want to >>execute on. You can tell condor to run the executable on same m/c by setting "Machine" variable in requirements to hostname. This is just to ensure that problem is not of condor and its to do with some >>environment settings.
>>
>>c) if in (b), you are getting outputs, you are done. Then you will have  to see the environment part.
>>
>>Above are few tricks to narrow down the problem. Hope this time, we wud b mch closer 2 exact root of d prob.
>>
>>Cheers!!
>>Neeraj

>I have compiled my own console application (no additional libraries or anything)
>and here is the submission file I am now using.
>
>executable = console_som
>universe = vanilla
>should_transfer_files = YES
>when_to_transfer_output = ON_EXIT
>requirements = (Arch == "X86_64") && (OpSys == "LINUX")
>output  = process_$(Process).out
>error = error.log
>log = master.log
>Queue 5
>
>I have attached various Log information as well..
>
>Also.. The submission machine I am using is not X86_64, It is INTEL.. This is just the manager machine of the cluster
>and is not actually taking any jobs for submission.
>
>vm1@thebeast. LINUX      INTEL  Unclaimed  Idle      1.000  512  0+03:55:16
>vm2@thebeast. LINUX      INTEL  Unclaimed  Idle      1.000  512  0+03:55:14
>vm3@thebeast. LINUX      INTEL  Unclaimed  Idle      0.190  512  0+03:55:10
>vm4@thebeast. LINUX      INTEL  Unclaimed  Idle      0.000  512  0+03:55:07
>vm1@xxxxxxxxx LINUX      X86_64 Unclaimed  Idle      0.000  2048[?????]
>vm2@xxxxxxxxx LINUX      X86_64 Unclaimed  Idle      0.000  2048[?????]
>vm1@xxxxxxxxx LINUX      X86_64 Unclaimed  Idle      0.000  2048[?????]
>vm2@xxxxxxxxx LINUX      X86_64 Unclaimed  Idle      0.000  2048[?????]
>vm1@xxxxxxxxx LINUX      X86_64 Unclaimed  Idle      0.000  2048[?????]
>vm2@xxxxxxxxx LINUX      X86_64 Unclaimed  Idle      0.000  2048[?????]
>
>
>
>
>
>
>
>
>
>Firstly the StarterLog on one of the executing machines, and then snips from the tails of the other logs
>
>10/6 01:11:27 ******************************************************
>10/6 01:11:27 ** condor_starter (CONDOR_STARTER) STARTING UP
>10/6 01:11:27 ** /home/condor/condor/sbin/condor_starter
>10/6 01:11:27 ** $CondorVersion: 6.7.10 Aug  3 2005 $
>10/6 01:11:27 ** $CondorPlatform: I386-LINUX_RH9 $
>10/6 01:11:27 ** PID = 5769
>10/6 01:11:27 ******************************************************
>10/6 01:11:27 Using config file: /home/condor/condor_config
>10/6 01:11:27 Using local config files: /home/condor/condor/hosts/node1/condor_config.local
>10/6 01:11:27 DaemonCore: Command Socket at <192.168.1.101:34307>
>10/6 01:11:27 Done setting resource limits
>10/6 01:11:27 Communicating with shadow <192.168.1.1:34943>
>10/6 01:11:27 Submitting machine is "mgmnt.cluster.int"
>10/6 01:11:27 File transfer completed successfully.
>10/6 01:11:28 Starting a VANILLA universe job with ID: 49.0
>10/6 01:11:28 IWD: /home/condor/condor/hosts/node1/execute/dir_5769
>10/6 01:11:28 Output file: /home/condor/condor/hosts/node1/execute/dir_5769/process_0.out
>10/6 01:11:28 Error file: /home/condor/condor/hosts/node1/execute/dir_5769/error.log
>10/6 01:11:28 About to exec /home/condor/condor/hosts/node1/execute/dir_5769/condor_exec.exe
>10/6 01:11:28 Create_Process succeeded, pid=5773
>10/6 01:11:28 Process exited, pid=5773, status=0
>10/6 01:11:28 Got SIGQUIT.  Performing fast shutdown.
>10/6 01:11:28 ShutdownFast all jobs.
>10/6 01:11:28 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
>
>
>
>-=-= Shadow Log =-=-
>
>10/6 00:12:12 ******************************************************
>10/6 00:12:12 ** condor_shadow (CONDOR_SHADOW) STARTING UP
>10/6 00:12:12 ** /home/condor/condor/sbin/condor_shadow
>10/6 00:12:12 ** $CondorVersion: 6.7.10 Aug  3 2005 $
>10/6 00:12:12 ** $CondorPlatform: I386-LINUX_RH9 $
>10/6 00:12:12 ** PID = 13189
>10/6 00:12:12 ******************************************************
>10/6 00:12:12 Using config file: /home/condor/condor_config
>10/6 00:12:12 Using local config files: /home/condor/condor/hosts/thebeast/condor_config.local
>10/6 00:12:12 DaemonCore: Command Socket at <192.168.1.1:34963>
>10/6 00:12:12 Initializing a VANILLA shadow for job 49.4
>10/6 00:12:13 (49.4) (13189): Request to run on <192.168.1.103:33919> was ACCEPTED
>10/6 00:12:14 (49.4) (13189): Job 49.4 terminated: exited with status 0
>10/6 00:12:14 (49.4) (13189): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
>
>
>
>-=-= ScheddLog =-=-
>
>10/6 00:12:12 -------- Done starting jobs --------
>10/6 00:12:13 DaemonCore: Command received via UDP from host <192.168.1.1:35349>
>10/6 00:12:13 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
>10/6 00:12:14 DaemonCore: Command received via TCP from host <192.168.1.1:34967>
>10/6 00:12:14 DaemonCore: received command 1111 (QMGMT_CMD), calling handler (handle_q)
>10/6 00:12:14 AUTHENTICATE_FS: used file /tmp/qmgr_3EznP9, status: 1
>10/6 00:12:14 OwnerCheck retval 1 (success), super_user
>10/6 00:12:14 OwnerCheck retval 1 (success), super_user
>10/6 00:12:14 OwnerCheck retval 1 (success), super_user
>10/6 00:12:14 OwnerCheck retval 1 (success), super_user
>10/6 00:12:14 OwnerCheck retval 1 (success), super_user
>10/6 00:12:14 condor_read(): Socket closed when trying to read buffer
>10/6 00:12:14 QMGR Connection closed
>10/6 00:12:14 DaemonCore: No more children processes to reap.
>10/6 00:12:14 Shadow pid 13189 for job 49.4 exited with status 100
>10/6 00:12:14 Reaper: JOB_EXITED
>10/6 00:12:14 Entered delete_shadow_rec( 13189 )
>10/6 00:12:14 Deleting shadow rec for PID 13189, job (49.4)
>10/6 00:12:14 Entered check_zombie( 13189, 0x0x851c2fc, st=4 )
>10/6 00:12:14 Job 49.4 is finished
>10/6 00:12:14 Added data to SelfDrainingQueue job_is_finished_queue, now has 1 element(s)
>10/6 00:12:14 Registered timer for SelfDrainingQueue job_is_finished_queue, period: 0 (id: 1072)
>10/6 00:12:14 Exited check_zombie( 13189, 0x0x851c2fc )
>10/6 00:12:14
>10/6 00:12:14 ..................
>10/6 00:12:14 .. Shadow Recs (0/1)
>10/6 00:12:14 ..................
>10/6 00:12:14 -------- Done starting jobs --------
>10/6 00:12:14 Inside SelfDrainingQueue::timerHandler() for job_is_finished_queue
>10/6 00:12:14 Job cleanup for 49.4 will not block, calling jobIsFinished() directly
>10/6 00:12:14 jobIsFinished() completed, calling DestroyProc(49.4)
>10/6 00:12:14 KEEP_OUTPUT_SANDBOX is undefined, using default value of False
>10/6 00:12:14 Saving classad to history file
>10/6 00:12:14 SelfDrainingQueue job_is_finished_queue is empty, not resetting timer
>10/6 00:12:14 Canceling timer for SelfDrainingQueue job_is_finished_queue (timer id: 1072)
>10/6 00:12:14 Got VACATE_SERVICE from <192.168.1.103:33930>
>10/6 00:12:14 mrec for "<192.168.1.103:33919>#1128557575#1" not found -- match not deleted
>10/6 00:12:17 DaemonCore: Command received via TCP from host <192.168.1.1:34968>
>10/6 00:12:17 DaemonCore: received command 1111 (QMGMT_CMD), calling handler (handle_q)
>10/6 00:12:17 condor_read(): Socket closed when trying to read buffer
>10/6 00:12:17 QMGR Connection closed
>10/6 00:13:17 Getting monitoring info for pid 11068
>10/6 00:17:12 JobsRunning = 0
>10/6 00:17:12 JobsIdle = 0
>10/6 00:17:12 JobsHeld = 0
>10/6 00:17:12 JobsRemoved = 0
>10/6 00:17:12 LocalUniverseJobsRunning = 0
>10/6 00:17:12 LocalUniverseJobsIdle = 0
>10/6 00:17:12 SchedUniverseJobsRunning = 0
>10/6 00:17:12 SchedUniverseJobsIdle = 0
>10/6 00:17:12 N_Owners = 0
>10/6 00:17:12 MaxJobsRunning = 200
>10/6 00:17:12 ENABLE_SOAP is undefined, using default value of False
>10/6 00:17:12 Trying to update collector <192.168.1.1:9618>
>10/6 00:17:12 Attempting to send update via UDP to collector thebeast.cluster.int <192.168.1.1:9618>
>10/6 00:17:12 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
>10/6 00:17:12 Sent HEART BEAT ad to 1 collectors. Number of submittors=0
>10/6 00:17:12 Changed attribute: Name = "condor@xxxxxxxxxxxxxxxxxxxx"
>10/6 00:17:12 Trying to update collector <192.168.1.1:9618>
>10/6 00:17:12 Attempting to send update via UDP to collector thebeast.cluster.int <192.168.1.1:9618>
>10/6 00:17:12 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
>10/6 00:17:12 Sent owner (0 jobs) ad to 1 collectors
>10/6 00:17:12 ============ Begin clean_shadow_recs =============
>10/6 00:17:12 ============ End clean_shadow_recs =============
>10/6 00:17:14 -------- Begin starting jobs --------
>10/6 00:17:14 -------- Done starting jobs --------
>10/6 00:17:17 Getting monitoring info for pid 11068
>
>
>
>-=-= CollectorLog =-=-
>
>10/6 00:21:25 Found StartdIpAddr
>10/6 00:21:25 Got IP = '<192.168.1.117:33591>'
>10/6 00:21:26 Found StartdIpAddr
>10/6 00:21:26 Got IP = '<192.168.1.118:33594>'
>10/6 00:21:26 Found StartdIpAddr
>10/6 00:21:26 Got IP = '<192.168.1.119:33584>'
>10/6 00:21:26 Found StartdIpAddr
>10/6 00:21:26 Got IP = '<192.168.1.121:33810>'
>10/6 00:21:26 Found StartdIpAddr
>10/6 00:21:26 Got IP = '<192.168.1.120:33587>'
>10/6 00:21:26 Found StartdIpAddr
>10/6 00:21:26 Got IP = '<192.168.1.122:33596>'
>10/6 00:21:26 Found StartdIpAddr
>10/6 00:21:26 Got IP = '<192.168.1.123:33609>'
>10/6 00:22:01 (Sending 100 ads in response to query)
>10/6 00:22:01 Got QUERY_STARTD_PVT_ADS
>10/6 00:22:01 (Sending 50 ads in response to query)
>10/6 00:22:12 Found ScheddIpAddr
>10/6 00:22:12 Got IP = '<192.168.1.1:51772>'
>10/6 00:22:24 Found StartdIpAddr
>10/6 00:22:24 Got IP = '<192.168.1.1:51771>'
>10/6 00:22:25 Found StartdIpAddr
>10/6 00:22:25 Got IP = '<192.168.1.1:51771>'
>10/6 00:22:26 Found StartdIpAddr
>10/6 00:22:26 Got IP = '<192.168.1.1:51771>'
>10/6 00:22:27 Found StartdIpAddr
>10/6 00:22:27 Got IP = '<192.168.1.1:51771>'
>10/6 00:22:29 NegotiatorAd  : Inserting ** "< thebeast.cluster.int >"
>
>
>
>-=-= NegotiatorLog =-=-
>
>10/6 00:12:00 ---------- Started Negotiation Cycle ----------
>10/6 00:12:00 Phase 1:  Obtaining ads from collector ...
>10/6 00:12:00  Getting all public ads ...
>10/6 00:12:00  Sorting 100 ads ...
>10/6 00:12:00  Getting startd private ads ...
>10/6 00:12:00 Got ads: 100 public and 50 private
>10/6 00:12:00 Public ads include 1 submitter, 50 startd
>10/6 00:12:00 Phase 2:  Performing accounting ...
>10/6 00:12:00 Phase 3:  Sorting submitter ads by priority ...
>10/6 00:12:00 Phase 4.1:  Negotiating with schedds ...
>10/6 00:12:00  Negotiating with condor@xxxxxxxxxxxxxxxxxxxx at <192.168.1.1:51772>
>10/6 00:12:00    Request 00049.00000:
>10/6 00:12:00      Matched 49.0 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.101:34300> vm1@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Successfully matched with vm1@xxxxxxxxxxxxxxxxx
>10/6 00:12:00    Request 00049.00001:
>10/6 00:12:00      Matched 49.1 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.101:34300> vm2@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Successfully matched with vm2@xxxxxxxxxxxxxxxxx
>10/6 00:12:00    Request 00049.00002:
>10/6 00:12:00      Matched 49.2 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.102:34757> vm1@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Successfully matched with vm1@xxxxxxxxxxxxxxxxx
>10/6 00:12:00    Request 00049.00003:
>10/6 00:12:00      Matched 49.3 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.102:34757> vm2@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Successfully matched with vm2@xxxxxxxxxxxxxxxxx
>10/6 00:12:00    Request 00049.00004:
>10/6 00:12:00      Matched 49.4 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.103:33919> vm1@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Successfully matched with vm1@xxxxxxxxxxxxxxxxx
>10/6 00:12:00    Got NO_MORE_JOBS;  done negotiating
>10/6 00:12:00 ---------- Finished Negotiation Cycle ----------
>10/6 00:17:00 ---------- Started Negotiation Cycle ----------
>10/6 00:17:00 Phase 1:  Obtaining ads from collector ...
>10/6 00:17:00  Getting all public ads ...
>10/6 00:17:00  Sorting 100 ads ...
>10/6 00:17:00  Getting startd private ads ...
>10/6 00:17:00 Got ads: 100 public and 50 private
>10/6 00:17:00 Public ads include 1 submitter, 50 startd
>10/6 00:17:00 Phase 2:  Performing accounting ...
>10/6 00:17:01 Phase 3:  Sorting submitter ads by priority ...
>10/6 00:17:01 Phase 4.1:  Negotiating with schedds ...
>10/6 00:17:01 ---------- Finished Negotiation Cycle ----------
>10/6 00:22:01 ---------- Started Negotiation Cycle ----------
>10/6 00:22:01 Phase 1:  Obtaining ads from collector ...
>10/6 00:22:01  Getting all public ads ...
>10/6 00:22:01  Sorting 100 ads ...
>10/6 00:22:01  Getting startd private ads ...
>10/6 00:22:01 Got ads: 100 public and 50 private
>10/6 00:22:01 Public ads include 1 submitter, 50 startd
>10/6 00:22:01 Phase 2:  Performing accounting ...
>10/6 00:22:01 Phase 3:  Sorting submitter ads by priority ...
>10/6 00:22:01 Phase 4.1:  Negotiating with schedds ...
>10/6 00:22:01 ---------- Finished Negotiation Cycle ----------
>10/6 00:27:01 ---------- Started Negotiation Cycle ----------
>10/6 00:27:01 Phase 1:  Obtaining ads from collector ...
>10/6 00:27:01  Getting all public ads ...
>10/6 00:27:01  Sorting 100 ads ...
>10/6 00:27:01  Getting startd private ads ...
>10/6 00:27:01 Got ads: 100 public and 50 private
>10/6 00:27:01 Public ads include 1 submitter, 50 startd
>10/6 00:27:01 Phase 2:  Performing accounting ...
>10/6 00:27:01 Phase 3:  Sorting submitter ads by priority ...
>10/6 00:27:01 Phase 4.1:  Negotiating with schedds ...
>10/6 00:27:01 ---------- Finished Negotiation Cycle ----------
>
>
>
>-=-= MatchLog =-=-
>
>de3.cluster.int
>10/5 00:32:17      Matched 47.0 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.101:34179> vm1@xxxxxxxxxxxxxxxxx
>10/5 00:32:17      Matched 47.1 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.101:34179> vm2@xxxxxxxxxxxxxxxxx
>10/5 00:32:17      Matched 47.2 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.102:34636> vm1@xxxxxxxxxxxxxxxxx
>10/5 00:32:17      Matched 47.3 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.102:34636> vm2@xxxxxxxxxxxxxxxxx
>10/5 00:32:17      Matched 47.4 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.103:33852> vm1@xxxxxxxxxxxxxxxxx
>10/5 23:52:17      Matched 48.0 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.101:34179> vm1@xxxxxxxxxxxxxxxxx
>10/5 23:52:17      Matched 48.1 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.101:34179> vm2@xxxxxxxxxxxxxxxxx
>10/5 23:52:17      Matched 48.2 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.102:34636> vm1@xxxxxxxxxxxxxxxxx
>10/5 23:52:17      Matched 48.3 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.102:34636> vm2@xxxxxxxxxxxxxxxxx
>10/5 23:52:17      Matched 48.4 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.103:33852> vm1@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Matched 49.0 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.101:34300> vm1@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Matched 49.1 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.101:34300> vm2@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Matched 49.2 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.102:34757> vm1@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Matched 49.3 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.102:34757> vm2@xxxxxxxxxxxxxxxxx
>10/6 00:12:00      Matched 49.4 condor@xxxxxxxxxxxxxxxxxxxx <192.168.1.1:51772> preempting none <192.168.1.103:33919> vm1@xxxxxxxxxxxxxxxxx
>
>  ----- Original Message -----
>  From: Neeraj Chourasia
>  To: Condor-Users Mail List ; chrismiles@xxxxxxxxxxxxxxxx
>  Sent: Wednesday, October 05, 2005 7:41 AM
>  Subject: Re: Re: [Condor-users] Jobs retuning blank output
>
>
>
>
>
>  On Wed, 05 Oct 2005 Chris Miles wrote :
>  >Here is a copy of my job file now.
>  >
>  >executable = /bin/hostname
>  >universe = vanilla
>  >
>  >TransferExecutable = false
>  >should_transfer_files = YES
>  >when_to_transfer_output = ON_EXIT
>  >
>  >requirements = (Arch == "X86_64") && (OpSys == "LINUX")
>  >
>  >output  = output/Process_$(Process).out
>  >error = error/Error.log
>  >log = log/Master.log
>  >
>  >Queue 5
>  >
>  >...
>  >
>  >Still getting no output, no error. And the main log file just says the same
>  >as before.
>  >
>  >005 (046.002.000) 10/05 00:26:47 Job terminated.
>  >        (1) Normal termination (return value 0)
>  >                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>  >                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>  >                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>  >                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>  >        0  -  Run Bytes Sent By Job
>  >        0  -  Run Bytes Received By Job
>  >        0  -  Total Bytes Sent By Job
>  >        0  -  Total Bytes Received By Job
>  >
>
>  Hey Chris,
>
>  Set TransferExecutable = True in your submit file, or better delete that line(By default its True). You have set it to false. So probably condor is not transferring executable to remote machine, which is essential for its execution. n one more thing i want 2 know, whether all machines are of "X86_64" architecture, b'coz i fear if there is incomatibilites in executable. Why dnt you send me the "shadow log" of submitter m/c and "starter Log" of execute m/c. Probably dat cud help us 2 gt betr pic.
>
>  Regards
>  Neeraj