[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Gahp server (failure issues ) exited with status 1 unexpectedly



Hi again,

While waiting to hearing from you back again, I tried to run the following commands to add bosco cluster and test it, and get the following responses. 

ââââââââââââ 

[cloudcopasi@cloud-copasi-new bin]$ eval `ssh-agent`; ssh-add keypath; ./bosco_cluster --platform RH7 --add hbaig@hostname slurm;kill $SSH_AGENT_PID;
Agent pid 30086
Identity added: keypath (keypath)
Enter the password to copy the ssh keys to hbaig@hostname:
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
bash: /usr/bin/scp: Permission denied
lost connection
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
bash: /usr/bin/scp: Permission denied
lost connection
Downloading for hbaig@xxxxxxxxxxxxxxx
Unpacking.
Sending libraries to hbaig@hostname.
Creating BOSCO for the WN's..........................................
Installing on cluster hbaig@xxxxxxxxxxx
Installation complete
The cluster hbaig@hostname has been added to BOSCO
It is available to run jobs submitted with the following values:
> universe = grid
> grid_resource = batch slurm hbaig@hostname
[cloudcopasi@cloud-copasi-new bin]$ bosco_cluster --test
bosco_cluster: option '--test' requires an argument
usage: /home/cloudcopasi/bosco/bin/bosco_cluster command

commands:
 -l|--list                  List the installed clusters
 -a|--add host sched        Install and add a cluster, with scheduler sched
 -r|--remove [host]         Remove the installed cluster (first in list)
 -s|--status [host]         Get status of installed cluster
 -z|--pool_status [host]    Get status of cluster resources
 -t|--test [host]           Test the installed cluster (all clusters)
 -h|--help                  Show this help message

Where host is user@xxxxxxxxxxxxxxxx
/home/cloudcopasi/bosco/bin/bosco_cluster can manage max 254 clusters

Terminating...
[cloudcopasi@cloud-copasi-new bin]$ bosco_cluster --list
hbaig@hostname/slurm
[cloudcopasi@cloud-copasi-new bin]$ bosco_cluster --test hbaig@hostname
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Passed!
Testing bosco submission...Passed!
Submission and log files for this job are in /home/cloudcopasi/bosco/local.bosco/bosco-test/boscotest.qofoJk
Waiting for jobmanager to accept job...Passed
Checking for submission to remote slurm cluster (could take ~30 seconds)...Failed
Showing last 5 lines of logs:
01/15/21 11:39:52 [31246] (4.0) doEvaluateState called: gmState GM_SUBMIT, remoteState 0
01/15/21 11:39:52 [31246] (4.0) blah_job_submit() failed: submission command failed (exit code = 1) (stdout:) (stderr:sbatch: error: Batch job submission failed: Invalid qos specification-Error from sbatch: -)
01/15/21 11:39:56 [31246] No jobs left, shutting down
01/15/21 11:39:56 [31246] Got SIGTERM. Performing graceful shutdown.
01/15/21 11:39:56 [31246] **** condor_gridmanager (condor_GRIDMANAGER) pid 31246 EXITING WITH STATUS 0


ââââââââââââ 

where keypath is the file carrying ssh public key. 

Really appreciate if you could kindly point to a step forward. 

Thanks

regards
Hasan

On Jan 12, 2021, at 3:01 PM, Hasan Baig <hasanbaigg@xxxxxxxxx> wrote:

Hi Jaime,

I donât see any GridmanagerLog.hbaig file under /var/log/condor/ directory. All I see are the following files under this directory: 

-rw-r--r-- 1 root   root       2813 Dec  8 11:32 KernelTuning.log
-rw-r--r-- 1 condor condor    25079 Jan 12 12:32 MasterLog
-rw-r--r-- 1 root   root     289589 Jan 12 13:29 ProcLog
-rw-r--r-- 1 root   root    1000095 Jan 12 08:22 ProcLog.old
-rw-r--r-- 1 condor condor  3921735 Jan 12 13:30 SharedPortLog
-rw-r--r-- 1 condor condor 10485973 Dec  9 03:37 SharedPortLog.old


Let me share a complete message which I get when I try to add the cluster. I run the following command (on a web server) to add the cluster on the Bosco compute pools server. 

[user@servername ~]$ bosco_cluster --add user@hostname slurm
Enter the password to copy the ssh keys to user@hostname:
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ControlSocket /tmp/bosco_ssh_control.user@hostname:22 already exists, disabling multiplexing
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
bash: /usr/bin/scp: Permission denied
lost connection
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
bash: /usr/bin/scp: Permission denied
lost connection
Downloading for user@xxxxxxxxxxxxxxx
Unpacking.
Sending libraries to user@hostname.
Creating BOSCO for the WN's..........................................
Installing on cluster user@xxxxxxxxxxx
Installation complete
The cluster user@hostname has been added to BOSCO
It is available to run jobs submitted with the following values:
> universe = grid
> grid_resource = batch slurm user@hostname


Looking forward to hearing back soon.

Thanks 

Regards
Hasan

On Jan 11, 2021, at 6:05 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

Can you send the portion of GridmanagerLog.hbaig file in the HTCondor log directory from time around one of these jobs going to held status?

The "ImportError: No module named siteâ is suspicious, and odd that itâs not printed when you run remote_gahp on the command line.

The RESOURCE USAGE POLICY banner could also be the cause. Such banners are usually suppressed when ssh is given a command to run, and the output of remote_gahp is interpreted by the HTCondor gridmanager daemon, which isnât expecting the banner.

 - Jaime

On Jan 7, 2021, at 5:10 PM, hasanbaigg@xxxxxxxxx wrote:

Hi Again,

I also tried to monitor the status of submitted and the result are given below that might be helpful for you to figure out what is going on:

$ condor_q -hold

-- Schedd: <hostname> : <127.0.0.1:11000?... @ 01/07/21 18:01:30
 ID      OWNER          HELD_SINCE  HOLD_REASON
  46.0   hbaig         1/6  13:34  Failed to start GAHP: Agent pid 3832\nImportError: No module named site\nAgent pid 3832 killed\n

Thanks for any help. 

regards
Hasan

On Jan 7, 2021, at 4:37 PM, Hasan Baig <hasanbaigg@xxxxxxxxx> wrote:

Hello,

Thanks for the response. I tried to run the command you suggested and got the following response

Agent pid 14621
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
         !!!        RESOURCE USAGE POLICY       !!!
         !!! Uploading and/or processing of PHI !!!
         !!! or other protected data in the HPC !!!
         !!! environment is prohibited.         !!!
         !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/home/FCAM/hbaig/bosco/glite/bin/batch_gahp.symlink: /home/FCAM/hbaig/bosco/glite/bin/../lib/condor/libglobus_common.so.0: no version information available (required by /home/FCAM/hbaig/bosco/glite/bin/batch_gahp.symlink)
$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $

I am able to connect to remote server where bosco is installed and donât  understand how could it be an SSH issue. 

Sorry for asking naive questions but I am totally a beginner and do not understand how to proceed with it. Thanks for your due help and responses. 

regards
Hasan

On Jan 7, 2021, at 2:29 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

On Jan 7, 2021, at 9:32 AM, hasanbaigg@xxxxxxxxx wrote:

I am working on a web-based tool which take jobs from a user and submit it to bosco resources (compute nodes). I am using a bosco version (condor 8.8.12) on Linux CentOS 7. The web interface allows a user to add a bosco pool which user can use to submit jobs. However, when I try to submit a job, it fails. I tried to test the pool as well by using the following command:


bosco_cluster --test 


It gives me the following GAHP error:


This a probably an ssh failure (network, authentication, or authorization). Bosco runs the following command to access the remote cluster submit host:

<sbin>/remote_gahp <user>@<hostname> batch_gahp

You can run it on the command line to get more details about what's going wrong. remote_gahp is a bash script, so you can dig in further, if necessary.

 - Jaime

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/