[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Error, failed while reading from pipe.



Hello everybody.
I am having some troubles when I try to submit a .dag file to condor throught python. That I want to do is send a job from a non shared file sistem, I use the service of condor to try it. If I send the job .dag from console (condor_submit_dag) everything execute in the good way but when I try to execute my python script all the files are sent good I but in  my dagman.out appears the error "failed while reading from pipe."

I have seen two things that are strange:
the first is that all the files sent throught SOAP have strange permisions:

-rw------- 1 condor condor 7180 2013-03-24 10:00 bucle
-rw-r--r-- 1 condor condor    0 2013-03-24 10:06 _bucleA.log
-rw------- 1 condor condor  141 2013-03-24 10:00 bucleA.submit
-rw------- 1 condor condor  141 2013-03-24 10:00 bucleB.submit
-rw------- 1 condor condor  141 2013-03-24 10:00 bucleC.submit
-rw------- 1 condor condor  118 2013-03-24 10:00 bucle.dag
-rw-r--r-- 1 condor condor  520 2013-03-24 10:06 bucle.dagman.log
-rw-r--r-- 1 condor condor 9402 2013-03-24 10:06 bucle.dagman.out
-rw------- 1 condor condor   29 2013-03-24 10:06 bucle.dagman.stdout
-rw-r--r-- 1 condor condor  338 2013-03-24 10:06 bucle.dag.rescue001
-rw------- 1 condor condor  141 2013-03-24 10:00 bucleD.submit
-rw------- 1 condor condor    0 2013-03-24 10:05 bucle.stderr

bucle dont have the execute permision? probably is ok because there is only one way to send files throught SOAP.

The second strange thing is that I did a condor_q -long for check and compare (with meld) the jobs launched from condor_submit_dag and my python script, and I didnt found significative changes.


condor_submit_dag

Arguments = "-f -l . -Debug 3 -Lockfile bucle.lock -AutoRescue 1 -DoRescueFrom 0 -Dag bucle.dag -CsdVersion $CondorVersion:' '7.4.4' 'Oct' '14' '2010' 'BuildID:' '279383' '$"
BufferBlockSize = 32768
BufferSize = 524288
ClusterId = 752
Cmd = "/opt/condor/current/bin/condor_dagman"
CommittedTime = 0
CompletionDate = 1364115965
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL5 $"
CondorVersion = "$CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $"
CoreSize = -1
CumulativeSuspensionTime = 0
CurrentHosts = 0
EnteredCurrentStatus = 1364115965
Env = "_CONDOR_DAGMAN_LOG=bucle.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0; DAGMAN_PROHIBIT_MULTI_JOBS=True"
Err = "bucle.stderr"
ExitBySignal = FALSE
ExitCode = 1
ExitStatus = 1
FilesRetrieved = FALSE
getenv = TRUE
GlobalJobId = "c-head.micluster.com#752.0#1364115625"
ImageSize = 0
ImageSize_RAW = 0
In = "/dev/null"
Iwd = "/home/condor/hosts/c-head/spool/cluster752.proc0.subproc0"
JobCurrentStartDate = 1364115903
JobFinishedHookDone = 1364115965
JobNotification = 0
JobPrio = 0
JobRunCount = 1
JobStartDate = 1364115903
JobStatus = 4
JobUniverse = 7
KillSig = "SIGTERM"
LastJobStatus = 2
LastSuspensionTime = 0
LeaveJobInQueue = FilesRetrieved =?= FALSE
LocalSysCpu = 0.000000
LocalUserCpu = 0.000000
MaxHosts = 1
MinHosts = 1
NiceUser = FALSE
NumCkpts = 0
NumCkpts_RAW = 0
NumJobStarts = 1
NumRestarts = 0
NumSystemHolds = 0
> =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >= 0 && ExitCode <= 2))
OrigMaxHosts = 1
Out = "bucle.dagman.stdout"
Owner = "usuario"
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
ProcId = 0
QDate = 1364115625
RemoteSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteWallClockTime = 62.000000
Requirements = TRUE
RootDir = "/"
ServerTime = 1364117955
ShouldTransferFiles = "YES"
StageInFinish = 1
StageInStart = 1
TotalSuspensions = 0
TransferFiles = "ONEXIT"
TransferInput = "bucleD.submit,bucle.dag,bucle,bucleA.submit,bucleB.submit,bucleC.submit"
UserLog = "bucle.dagman.log"
User = "usuario@xxxxxxxxxxxxx"
WantCheckpoint = FALSE
WantRemoteIO = TRUE
WantRemoteSyscalls = FALSE
WhenToTransferOutput = "ON_EXIT"

python:

Arguments = "-f -l . -Debug 3 -Lockfile bucle.lock -AutoRescue 1 -DoRescueFrom 0 -Dag bucle.dag -CsdVersion $CondorVersion:' '7.4.4' 'Oct' '14' '2010' 'BuildID:' '279383' '$"
BufferBlockSize = 32768
BufferSize = 524288
ClusterId = 752
Cmd = "/opt/condor/current/bin/condor_dagman"
CommittedTime = 0
CompletionDate = 0
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL5 $"
CondorVersion = "$CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $"
CoreSize = -1
CumulativeSuspensionTime = 0
CurrentHosts = 1
EnteredCurrentStatus = 1364115902
Env = "_CONDOR_DAGMAN_LOG=bucle.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0; DAGMAN_PROHIBIT_MULTI_JOBS=True"
Err = "bucle.stderr"
ExitBySignal = FALSE
ExitStatus = 0
FilesRetrieved = FALSE
getenv = TRUE
GlobalJobId = "c-head.micluster.com#752.0#1364115625"
ImageSize = 0
ImageSize_RAW = 0
In = "/dev/null"
Iwd = "/home/condor/hosts/c-head/spool/cluster752.proc0.subproc0"
JobCurrentStartDate = 1364115903
JobNotification = 0
JobPrio = 0
JobRunCount = 1
JobStartDate = 1364115903
JobStatus = 2
JobUniverse = 7
KillSig = "SIGTERM"
LastJobStatus = 1
LastSuspensionTime = 0
LeaveJobInQueue = FilesRetrieved =?= FALSE
LocalSysCpu = 0.000000
LocalUserCpu = 0.000000
MaxHosts = 1
MinHosts = 1
NiceUser = FALSE
NumCkpts = 0
NumCkpts_RAW = 0
NumJobStarts = 1
NumRestarts = 0
NumSystemHolds = 0
> =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >= 0 && ExitCode <= 2))
OrigMaxHosts = 1
Out = "bucle.dagman.stdout"
Owner = "usuario"
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
ProcId = 0
QDate = 1364115625
RemoteSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteWallClockTime = 0.000000
Requirements = TRUE
RootDir = "/"
ServerTime = 1364115938
ShadowBday = 1364115903
ShouldTransferFiles = "YES"
StageInFinish = 1
StageInStart = 1
TotalSuspensions = 0
TransferFiles = "ONEXIT"
TransferInput = "bucleD.submit,bucle.dag,bucle,bucleA.submit,bucleB.submit,bucleC.submit"
UserLog = "bucle.dagman.log"
User = "usuario@xxxxxxxxxxxxx"
WantCheckpoint = FALSE
WantRemoteIO = TRUE
WantRemoteSyscalls = FALSE
WhenToTransferOutput = "ON_EXIT"


I will post the bucle.dagman.out:

03/24 10:05:03 ******************************************************
03/24 10:05:03 ** condor_scheduniv_exec.752.0 (CONDOR_DAGMAN) STARTING UP
03/24 10:05:03 ** /exports/condor/condor-7.4.4/bin/condor_dagman
03/24 10:05:03 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
03/24 10:05:03 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
03/24 10:05:03 ** $CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $
03/24 10:05:03 ** $CondorPlatform: I386-LINUX_RHEL5 $
03/24 10:05:03 ** PID = 2174
03/24 10:05:03 ** Log last touched time unavailable (No such file or directory)
03/24 10:05:03 ******************************************************
03/24 10:05:03 Using config source: /home/condor/condor_config
03/24 10:05:03 Using local config sources:
03/24 10:05:03    /opt/condor/current/etc/condor_config.local
03/24 10:05:03    /opt/condor/etc/condor_config.cluster
03/24 10:05:03    /opt/condor/etc/condor_config.c-head
03/24 10:05:03 DaemonCore: Command Socket at <192.168.1.20:9320>
03/24 10:05:03 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
03/24 10:05:03 DAGMAN_DEBUG_CACHE_ENABLE setting: False
03/24 10:05:03 DAGMAN_SUBMIT_DELAY setting: 0
03/24 10:05:03 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
03/24 10:05:03 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
03/24 10:05:03 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
03/24 10:05:03 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5
03/24 10:05:03 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
03/24 10:05:03 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
03/24 10:05:03 DAGMAN_RETRY_NODE_FIRST setting: 0
03/24 10:05:03 DAGMAN_MAX_JOBS_IDLE setting: 0
03/24 10:05:03 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
03/24 10:05:03 DAGMAN_MUNGE_NODE_NAMES setting: 1
03/24 10:05:03 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
03/24 10:05:03 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
03/24 10:05:03 DAGMAN_ABORT_DUPLICATES setting: 1
03/24 10:05:03 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
03/24 10:05:03 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
03/24 10:05:03 DAGMAN_AUTO_RESCUE setting: 1
03/24 10:05:03 DAGMAN_MAX_RESCUE_NUM setting: 100
03/24 10:05:03 DAGMAN_DEFAULT_NODE_LOG setting: null
03/24 10:05:03 ALL_DEBUG setting:
03/24 10:05:03 DAGMAN_DEBUG setting:
03/24 10:05:03 argv[0] == "condor_scheduniv_exec.752.0"
03/24 10:05:03 argv[1] == "-Debug"
03/24 10:05:03 argv[2] == "3"
03/24 10:05:03 argv[3] == "-Lockfile"
03/24 10:05:03 argv[4] == "bucle.lock"
03/24 10:05:03 argv[5] == "-AutoRescue"
03/24 10:05:03 argv[6] == "1"
03/24 10:05:03 argv[7] == "-DoRescueFrom"
03/24 10:05:03 argv[8] == "0"
03/24 10:05:03 argv[9] == "-Dag"
03/24 10:05:03 argv[10] == "bucle.dag"
03/24 10:05:03 argv[11] == "-CsdVersion"
03/24 10:05:03 argv[12] == "$CondorVersion: 7.4.4 Oct 14 2010 BuildID: 279383 $"
03/24 10:05:03 Default node log file is: </home/condor/hosts/c-head/spool/cluster752.proc0.subproc0/bucle.dag.nodes.log>
03/24 10:05:03 DAG Lockfile will be written to bucle.lock
03/24 10:05:03 DAG Input file is bucle.dag
03/24 10:05:03 Parsing 1 dagfiles
03/24 10:05:03 Parsing bucle.dag ...
03/24 10:05:03 Dag contains 4 total jobs
03/24 10:05:03 Sleeping for 12 seconds to ensure ProcessId uniqueness
03/24 10:05:15 Bootstrapping...
03/24 10:05:15 Number of pre-completed nodes: 0
03/24 10:05:15 Registering condor_event_timer...
03/24 10:05:16 Sleeping for one second for log file consistency
03/24 10:05:17 Submitting Condor Node A job(s)...
03/24 10:05:17 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:17 failed while reading from pipe.
03/24 10:05:17 Read so far:
03/24 10:05:17 ERROR: submit attempt failed
03/24 10:05:17 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:17 Job submit try 1/6 failed, will try again in >= 1 second.
03/24 10:05:17 Of 4 nodes total:
03/24 10:05:17  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
03/24 10:05:17   ===     ===      ===     ===     ===        ===      ===
03/24 10:05:17     0       0        0       0       1          3        0
03/24 10:05:22 Sleeping for one second for log file consistency
03/24 10:05:23 Submitting Condor Node A job(s)...
03/24 10:05:23 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:23 failed while reading from pipe.
03/24 10:05:23 Read so far:
03/24 10:05:23 ERROR: submit attempt failed
03/24 10:05:23 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:23 Job submit try 2/6 failed, will try again in >= 2 seconds.
03/24 10:05:28 Sleeping for one second for log file consistency
03/24 10:05:29 Submitting Condor Node A job(s)...
03/24 10:05:29 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:29 failed while reading from pipe.
03/24 10:05:29 Read so far:
03/24 10:05:29 ERROR: submit attempt failed
03/24 10:05:29 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:29 Job submit try 3/6 failed, will try again in >= 4 seconds.
03/24 10:05:34 Sleeping for one second for log file consistency
03/24 10:05:35 Submitting Condor Node A job(s)...
03/24 10:05:35 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:35 failed while reading from pipe.
03/24 10:05:35 Read so far:
03/24 10:05:35 ERROR: submit attempt failed
03/24 10:05:35 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:35 Job submit try 4/6 failed, will try again in >= 8 seconds.
03/24 10:05:46 Sleeping for one second for log file consistency
03/24 10:05:47 Submitting Condor Node A job(s)...
03/24 10:05:47 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:47 failed while reading from pipe.
03/24 10:05:47 Read so far:
03/24 10:05:47 ERROR: submit attempt failed
03/24 10:05:47 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:05:47 Job submit try 5/6 failed, will try again in >= 16 seconds.
03/24 10:06:04 Sleeping for one second for log file consistency
03/24 10:06:05 Submitting Condor Node A job(s)...
03/24 10:06:05 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:06:05 failed while reading from pipe.
03/24 10:06:05 Read so far:
03/24 10:06:05 ERROR: submit attempt failed
03/24 10:06:05 submit command was: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' '752 -a DAGManJobId' '=' '752 -a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" bucleA.submit
03/24 10:06:05 Job submit failed after 6 tries.
03/24 10:06:05 Shortcutting node A retries because of submit failure(s)
03/24 10:06:05 Of 4 nodes total:
03/24 10:06:05  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
03/24 10:06:05   ===     ===      ===     ===     ===        ===      ===
03/24 10:06:05     0       0        0       0       0          3        1
03/24 10:06:05 ERROR: the following job(s) failed:
03/24 10:06:05 ---------------------- Job ----------------------
03/24 10:06:05       Node Name: A
03/24 10:06:05          NodeID: 0
03/24 10:06:05     Node Status: STATUS_ERROR   
03/24 10:06:05 Node return val: -1
03/24 10:06:05           Error: Job submit failed
03/24 10:06:05 Job Submit File: bucleA.submit
03/24 10:06:05   Condor Job ID: [not yet submitted]
03/24 10:06:05       Q_PARENTS: <END>
03/24 10:06:05       Q_WAITING: <END>
03/24 10:06:05      Q_CHILDREN: B, C, <END>
03/24 10:06:05 ---------------------------------------    <END>
03/24 10:06:05 Aborting DAG...
03/24 10:06:05 Writing Rescue DAG to bucle.dag.rescue001...
03/24 10:06:05 Note: 0 total job deferrals because of -MaxJobs limit (0)
03/24 10:06:05 Note: 0 total job deferrals because of -MaxIdle limit (0)
03/24 10:06:05 Note: 0 total job deferrals because of node category throttles
03/24 10:06:05 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
03/24 10:06:05 Note: 0 total POST script deferrals because of -MaxPost limit (0)
03/24 10:06:05 **** condor_scheduniv_exec.752.0 (condor_DAGMAN) pid 2174 EXITING WITH STATUS 1

If anyone could help me saying what I am doing wrong, I will apreciate it.

Fernando.