[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] [Fwd: Re: VMGAHP_ERR_CRITICAL]




The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com

--- Begin Message ---
Hi Yoon,

      Thanks Its Working... Thank  for your explanation....

 Then If I submit a Job it says 

  1. 1 match but reject the job for unknown reasons
      were i can find out the reasons (No VMGAPHLOG file was created)
      I can't find it out from dameonlog files also...

  2. 1 are rejected by your job's requirements
       When I used condor_q -analyze its pointing above statement and 
       showing some statements which i have giving in job description  
       files. tell me how can i trace which requirement is not correct.

       
Then I this was the set up  I need to set up

      1.Zeus (Central manager,submitter) idealgrid - username
      2.Pluto (executor)   idealgrid - username 
 
     before making this setup i tried to start VM in Pluto from Pluto
     and also from Zeus.. It showing Error No 2(I am using same Job file
     which i sent to you)


  till now i used user Johnson to submit job but I now changed user to
idealgrid the VMGAPHLOG.idealgrid shows same VMGAHP_ERR_CRITICAL 

  this was happend when i am trying to submit job from Zeus to Zeus.
    (i have attached the file) 

On Mon, 2008-03-17 at 12:22 -0500, Jaeyoung Yoon wrote:
> Hi Johnson,
> 
> Let me clarify your problem in your system environment. As you said,
> you have two machines (Zeus, Pluto) for Condor. 
> 
> 1. First of all, because your job submit description file has
> 'Requirements   = (Machine == "zeus.pesgrid.wipro.com")', your VM job
> can be executed on only Zeus.
> 
> 2.  When you try to submit a VM job from Zeus,  the job should be
> assigned to Zeus. And I think you should have NO problem. 
> 
> 3.  When you try to submit a VM job from Pluto, the job should also be
> assigned to Zeus due to your job requirements. And I think you must
> have VMGAHP problem. 
> 
> Here are my observations for your case.
> 
> In your environment, your Condor daemons on Zeus seem to run as Root
> with "CONDOR_IDS=daemon,daemon". So ordinary Condor jobs like Vanilla,
> Standard, JAVA from other machine(Pluto) will run as "Nobody" or "Same
> UID on submit machine".
> 
> In result, because VMware requires that a user starting a Virtual
> machine have a writable working directory. Your problem happened
> because the UID=2(daemon) doesn't have a writable working directory as
> "Nobody" doesn't. Unlike ordinary Condor jobs, VM jobs doens't use
> "Nobody" when the UID on submit machine doesn't exist on an execute
> machine. Instead of "Nobody", VM jobs try to use UID of Condor daemon,
> generally "condor".
> 
> With VMGAHP log files you sent, you can look at what happened on your
> Zeus.
> 
> When you submit a VM job from Zeus to Zeus, VMGAPH.Johnson log says
> that your VM job successfully ran as "UID=Johson". 
> But when you submit a VM job from pluto to Zeus, VMGAHP.daemon says
> that your VM jobs tried to run as "UID=daemon" and failed.
> 
> So here is solution for you. 
> 
> If you run Condor as root and you specified CONDOR_IDS=daemon,daemon.
> Please add the following configuration parameter to Condor
> configuration file on Zeus.
> VM_UNIV_NOBODY_USER = "login name of a user who has home directory"
> 
> With above parameter, VM jobs from pluto will use the UID specified in
> "VM_UNIV_NOBODY_USER".
> 
> In Condor manual section 3.3.26, you can see the configuration
> parameters for VM universe.
> 
> If you have questions, please let me know.
> 
> Best,
> 
> -Jaeyoung
> 
> 
> On Mon, Mar 17, 2008 at 8:01 AM, JohnsonKoilraj
> <johnson.raj@xxxxxxxxx> wrote:
>         Hi Yoon,
>         
>                How are you.
>           Here is the scenario.I am having 2 system in my condor pool.
>           1.Zeus (Central manager,submitter,executor) Johnson -
>         username
>         
>           2.Pluto (Submitter,executor)   condor - username (who submit
>         job)
>         
>           Now, I can start in Zeus from Zeus..
>           Then when I try to start VM in Pluto from Zeus (no match
>         found).
>         
>           Then When I try to start Vm in Zeus From Pluto (the error
>         occurs)
>         
>           I am using           -  Condor 7.0.1
>           Vmware Server        -  Vmware 1.0.4
>         
>          1. I have Attached Job Description files (firstvm.sh)
>         
>          2. I have attached VMGAHPLOG.daemon(I think condor updated on
>         that file
>            because when i submit job from Pluto(condor) to Zeus)
>         
>          3. I have attached VMGAHPLOG.Johnson(while Vm was started in
>         Zeus from
>            Zeus(Johnson) this file was updated.)
>         
>          4. I have attached log file created by Job description file
>         
>         Thank you for your response
>         
>         
>         

> 
3/18 18:25:23 ******************************************************
3/18 18:25:23 ** condor_vm-gahp (CONDOR_VM_GAHP) STARTING UP
3/18 18:25:23 ** /opt/condor-7.0.1/sbin/condor_vm-gahp
3/18 18:25:23 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
3/18 18:25:23 ** $CondorPlatform: I386-LINUX_RHEL5 $
3/18 18:25:23 ** PID = 3584
3/18 18:25:23 ** Log last touched time unavailable (No such file or directory)
3/18 18:25:23 ******************************************************
3/18 18:25:23 Using config source: /opt/condor-7.0.1/etc/condor_config
3/18 18:25:23 Using local config sources: 
3/18 18:25:23    /opt/condor-7.0.1/local.zeus/condor_config.local
3/18 18:25:23 DaemonCore: Command Socket at <10.201.40.155:47380>
3/18 18:25:24 Initialized the following authorization table:
3/18 18:25:24 host 10.201.40.155: user *: WRITE,NEGOTIATOR,ADMINISTRATOR,OWNER,DAEMON,ADVERTISE_STARTD,ADVERTISE_SCHEDD,ADVERTISE_MASTER
3/18 18:25:24 Will use UDP to update collector zeus.pesgrid.wipro.com <10.201.40.155:9618>
3/18 18:25:24 VMGAHP[3584]: VM-GAHP initialized with run-mode 1
3/18 18:25:24 VMGAHP[3584]: Initial UID/GUID=49527/49527, EUID/EGUID=49527/49527, Condor UID/GID=49527,49527
3/18 18:25:24 VMGAHP[3584]: Initialize Uids: caller=idealgrid, job user=idealgrid
3/18 18:25:24 VMGAHP[3584]: VM_HARDWARE_VT is undefined, using default value of False
3/18 18:25:24 VMGAHP[3584]: Worker Env = VMGAHP_WORKING_DIR=/opt/condor-7.0.1/local.zeus/execute/dir_3573 VMGAHP_USER_GID=49527 CONDOR_IDS=49527.49527 VMGAHP_VMTYPE=vmware VMGAHP_USER_UID=49527 VMGAHP_CONFIG=/opt/condor-7.0.1/etc/condor_vmgahp_config.vmware
3/18 18:25:24 VMGAHP[3584]: Starting worker : /opt/condor-7.0.1/sbin/condor_vm-gahp -f -t -M 2
3/18 18:25:24 Create_Process: using fast clone() to create child process.
3/18 18:25:24 VMGAHP[3584]: Worker pid=3585
3/18 18:25:24 VMGAHP[3584]: Command: COMMANDS
3/18 18:25:24 Getting monitoring info for pid 3584
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ******************************************************
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** condor_vm-gahp (CONDOR_VM_GAHP) STARTING UP
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** /opt/condor-7.0.1/sbin/condor_vm-gahp
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** $CondorPlatform: I386-LINUX_RHEL5 $
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** PID = 3585
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ** Log last touched time unavailable (Success)
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 ******************************************************
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 Using config source: /opt/condor-7.0.1/etc/condor_config
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 Using local config sources: 
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24    /opt/condor-7.0.1/local.zeus/condor_config.local
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 DaemonCore: Command Socket at <10.201.40.155:34018>
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 Initialized the following authorization table:
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 host 10.201.40.155: user *: WRITE,NEGOTIATOR,ADMINISTRATOR,OWNER,DAEMON,ADVERTISE_STARTD,ADVERTISE_SCHEDD,ADVERTISE_MASTER
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: 3/18 18:25:24 Will use UDP to update collector zeus.pesgrid.wipro.com <10.201.40.155:9618>
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: VM-GAHP initialized with run-mode 2
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: Initial UID/GUID=49527/49527, EUID/EGUID=49527/49527, Condor UID/GID=49527,49527
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: Initialize Uids: caller=idealgrid, job user=idealgrid
3/18 18:25:24 VMGAHP[3584]: Worker[3585]: VM_HARDWARE_VT is undefined, using default value of False
3/18 18:25:25 VMGAHP[3584]: Command: SUPPORT_VMS
3/18 18:25:25 DaemonCore: in SendAliveToParent()
3/18 18:25:25 DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3573 is alive.
3/18 18:25:45 condor_read(): timeout reading 5 bytes from <10.201.40.155:33190>.
3/18 18:25:45 IO: Failed to read packet header
3/18 18:25:45 Failed to read ClassAd size.
3/18 18:25:45 DaemonCore: Leaving SendAliveToParent() - success
3/18 18:25:45 VMGAHP[3584]: Command: ASYNC_MODE_ON
3/18 18:25:45 DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3573 is alive.
3/18 18:25:45 VMGAHP[3584]: Command: CLASSAD
3/18 18:25:45 DaemonCore: Command received via UDP from host <10.201.40.155:32836>, access level IMMEDIATE_FAMILY
3/18 18:25:45 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
3/18 18:25:46 VMGAHP[3584]: Sending Job ClassAd to worker
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: Command: CLASSAD
3/18 18:25:48 VMGAHP[3584]: Command: CONDOR_VM_START
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: Command: CONDOR_VM_START
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: USE_SCRIPT_TO_CREATE_CONFIG is undefined, using default value of False
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: Inside VMwareType::Start
3/18 18:25:48 VMGAHP[3584]: Worker[3585]: Inside VMwareType::Snapshot
3/18 18:25:49 VMGAHP[3584]: Command: RESULTS
3/18 18:25:50 VMGAHP[3584]: Worker[3585]: register(/opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx) = 1
3/18 18:25:52 VMGAHP[3584]: Command: RESULTS
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: /usr/bin/vmware-cmd: Could not connect to VM /opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx
3/18 18:25:54 VMGAHP[3584]: Worker[3585]:   (VMControl error -14: Unexpected response from vmware-authd: The process exited with an error:
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: End of error message)
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: (ERROR) Can't create vm with /opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: Failed to execute my_system: perl /opt/condor-7.0.1/sbin/condor_vm_vmware.pl start /opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx /opt/condor-7.0.1/local.zeus/execute/dir_3573/vmware_status.condor
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: Inside VMwareType::Unregister
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: unregister(/opt/condor-7.0.1/local.zeus/execute/dir_3573/vm3K7txP_condor.vmx) = 1
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: Result "2 1 VMGAHP_ERR_CRITICAL"
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: Inside VMwareType::Shutdown
3/18 18:25:54 VMGAHP[3584]: Worker[3585]: executeStart fail!
3/18 18:25:55 VMGAHP[3584]: Command: RESULTS
3/18 18:25:57 VMGAHP[3584]: Command: QUIT
3/18 18:25:57 VMGAHP[3584]: Started timer to call quitFast in 30 seconds
3/18 18:25:57 VMGAHP[3584]: Worker[3585]: Command: QUIT
3/18 18:25:59 VMGAHP[3584]: EOF reached on DaemonCore pipe 65539
3/18 18:25:59 VMGAHP[3584]: VM GAHP Worker result buffer closed, exiting...
3/18 18:25:59 VMGAHP[3584]: Inside VMwareType::killVMFast
3/18 18:25:59 VMGAHP[3584]: killVMFast is called

--- End Message ---