[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Gahp server (failure issues ) exited with status 1 unexpectedly



Can you send the portion of GridmanagerLog.hbaig file in the HTCondor log directory from time around one of these jobs going to held status?

The "ImportError: No module named siteâ is suspicious, and odd that itâs not printed when you run remote_gahp on the command line.

The RESOURCE USAGE POLICY banner could also be the cause. Such banners are usually suppressed when ssh is given a command to run, and the output of remote_gahp is interpreted by the HTCondor gridmanager daemon, which isnât expecting the banner.

 - Jaime

On Jan 7, 2021, at 5:10 PM, hasanbaigg@xxxxxxxxx wrote:

Hi Again,

I also tried to monitor the status of submitted and the result are given below that might be helpful for you to figure out what is going on:

$ condor_q -hold

-- Schedd: <hostname> : <127.0.0.1:11000?... @ 01/07/21 18:01:30
 ID      OWNER          HELD_SINCE  HOLD_REASON
  46.0   hbaig         1/6  13:34  Failed to start GAHP: Agent pid 3832\nImportError: No module named site\nAgent pid 3832 killed\n

Thanks for any help. 

regards
Hasan

On Jan 7, 2021, at 4:37 PM, Hasan Baig <hasanbaigg@xxxxxxxxx> wrote:

Hello,

Thanks for the response. I tried to run the command you suggested and got the following response

Agent pid 14621
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/home/FCAM/hbaig/bosco/glite/bin/batch_gahp.symlink: /home/FCAM/hbaig/bosco/glite/bin/../lib/condor/libglobus_common.so.0: no version information available (required by /home/FCAM/hbaig/bosco/glite/bin/batch_gahp.symlink)
$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $

I am able to connect to remote server where bosco is installed and donât  understand how could it be an SSH issue. 

Sorry for asking naive questions but I am totally a beginner and do not understand how to proceed with it. Thanks for your due help and responses. 

regards
Hasan

On Jan 7, 2021, at 2:29 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

On Jan 7, 2021, at 9:32 AM, hasanbaigg@xxxxxxxxx wrote:

I am working on a web-based tool which take jobs from a user and submit it to bosco resources (compute nodes). I am using a bosco version (condor 8.8.12) on Linux CentOS 7. The web interface allows a user to add a bosco pool which user can use to submit jobs. However, when I try to submit a job, it fails. I tried to test the pool as well by using the following command:


bosco_cluster --test 


It gives me the following GAHP error:


This a probably an ssh failure (network, authentication, or authorization). Bosco runs the following command to access the remote cluster submit host:

<sbin>/remote_gahp <user>@<hostname> batch_gahp

You can run it on the command line to get more details about what's going wrong. remote_gahp is a bash script, so you can dig in further, if necessary.

 - Jaime

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/