Overview
On grid2, we need to install/configure an additional service: GRAM2.
I presume that Pegasus generates DAGMan jobs as either standard, vanilla or gt2. If the job is gt2, GRAM2 client (Condor-G) on a Submit node (grid1) accesses the GRAM2 service (gatekeeper and jobmanager) on grid2. Condor-G is a part of Condor installation package. The Globus jobmanager is configured to use Condor, which is also installed on grid2.
Ref: http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/ -> http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/admin/index.html ->
http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/admin/gram2-admin-configuring.html
On grid1, only Condor-G, MyProxy client and GridFTP may be required. Probably installing Condor on the Submit node, configuring it as a Submit node and submitting a gt2 job via condor_submit will invoke Condor-G.
Pre-requisites
- Condor installed and configured as a Submit node (also Execute node to test Vanilla/Standard jobs).
- If Condor is configured with USE_NFS = True, network shared accounts (LDAP/Kerberos) and NFS-shared home directories are required. For this (with NFSv4), RHEL 5.2 requires to install/configure Kerberos, NFS, LDAP and nss_ldap and pam_krb5. Refer to details on "3.7 LDAP Clients - For SSO with Kerberos and NFS" and other pages.
- Globus Toolkit installed and configured on grid2 already.
- Firewall: Ports should be open for GRAM2 client traffic. GridFTP ephemeral ports. If GridFTP server is used, 2811/tcp should be open. If GT2 MDS GRIS/GIIS is used, 2135/tcp should be open.
gsigatekeeper
# locate globus-gatekeeper.conf
/nfs/software/globus/4.2.0/etc/globus-gatekeeper.conf
/usr/local/globus/etc/globus-gatekeeper.conf
# cat /usr/local/globus/etc/globus-gatekeeper.conf
-x509_cert_dir /etc/grid-security/certificates
-x509_user_cert /etc/grid-security/hostcert.pem
-x509_user_key /etc/grid-security/hostkey.pem
-gridmap /etc/grid-security/grid-mapfile
-home /usr/local/globus
-e libexec
-logfile var/globus-gatekeeper.log
-port 2119
-grid_services etc/grid-services
-inetd
It uses these places/files most of which are correct or installed already, except that it defines inetd. Although we use xinetd, gatekeeper does not have xinetd config option and takes inetd for xinetd.
Make sure that /etc/services has listed the service and port:
# cat /etc/services | grep 2119 gsigatekeeper 2119/tcp # GSIGATEKEEPER gsigatekeeper 2119/udp # GSIGATEKEEPER
Ports are already registered and they are opened in the Firewall. (not sure when I did this)
For xinetd, we need to create this file:
vi /etc/xinetd.d/globus-gatekeeper
-- insert --
service gsigatekeeper
{
socket_type = stream
protocol = tcp
wait = no
user = root
env = LD_LIBRARY_PATH=/usr/local/globus/lib
server = /usr/local/globus/sbin/globus-gatekeeper
server_args = -conf /usr/local/globus/etc/globus-gatekeeper.conf
disable = no
env += GLOBUS_TCP_PORT_RANGE=40000,41000
}
NB: A client may contact from an ephemeral port to the gatekeeper on 2119/tcp. If the data is to be returned, the jobmanager may connect from the ephemeral port on the server to the ephemeral port on the client.
# /etc/rc.d/init.d/xinetd restart or # service xinetd reload
Now it is listening:
# netstat -aut | grep gatekeeper tcp 0 0 *:gsigatekeeper *:* LISTEN
Authentication for gatekeeper
The gatekeeper accepts requests coming in and passes them on to jobmanager if the user has authentication matching in: /etc/grid-security/grid-mapfile. Users are already added to the grid-mapfile when their credentials were created by MyProxy.
# cat /etc/grid-security/grid-mapfile "/O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/CN=Yoichi Takayama" yoichi "/O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/CN=Luke Foxton" lfoxton
jobmanager
Ref: http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/admin/gram2-admin-jobmanager.html
jobmanager has been configured already in the process of configuring Globus installatin (./configure --with-gram-condor) and it has been re-configured after hostcert.pem was installed (refer to the Globus installation section).
# cat $GLOBUS_LOCATION/etc/globus-job-manager.conf -home "/usr/local/globus" -globus-gatekeeper-host grid2.ramscommunity.org -globus-gatekeeper-port 2119 -globus-gatekeeper-subject "/O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/CN=host/grid2.ramscommunity.org" -globus-host-cputype i686 -globus-host-manufacturer pc -globus-host-osname Linux -globus-host-osversion 2.6.18-92.1.10.el5 -globus-toolkit-version 4.2.0 -save-logfile on_error -state-file-dir /usr/local/globus/tmp/gram_job_state -machine-type unknown
It also uses this jobmanager-condor definition in grid-services: (this has been auto-generated during the installation)
# cat $GLOBUS_LOCATION/etc/grid-services/jobmanager-condor stderr_log,local_cred - /usr/local/globus/libexec/globus-job-manager globus-job-manager -conf /usr/local/globus/etc/globus-job-manager.conf -type condor -rdn jobmanager-condor -machine-type unknown -publish-jobs -condor-arch INTEL -condor-os LINUX
Scheduler Event Generator / Job Manager Integration
Supposed to start event generator.
Adding -seg entry to globus-job-manager.conf is supposed to instruct the job-manager to use the event generator.
# vi $GLOBUS_LOCATION/etc/globus-job-manager.conf -- insert -- -seg
It uses globus-job-manager-seg.conf and it must have been configure beforehand:
$ cat $GLOBUS_LOCATION/etc/globus-job-manager-seg.conf condor_log_path=/usr/local/globus/var/globus-job-manager-seg-condor condor_test_log_path=/usr/local/globus/var/globus-job-manager-seg-condor_test test_log_path=/usr/local/globus/var/globus-job-manager-seg-test
It seems OK.
$ $GLOBUS_LOCATION/sbin/globus-job-manager-event-generator -scheduler condor
Hmmm, it freezes up and never returns... Gave it a ctrl-z and bg. It seems that it returns some time later.
$ ps -ef ... globus 5250 5208 0 17:46 pts/1 00:00:00 perl /usr/local/globus/sbin/globus-job-manager-event-generator -s condor globus 5251 5250 0 17:46 pts/1 00:00:00 /usr/local/globus/libexec/globus-scheduler-event-generator -s condor -t 1223444972 ...
Test:
# su - globus
$ $GLOBUS_LOCATION/libexec/globus-scheduler-event-generator -s condor -t 1
001;1223347759;028.000.000;1;0
001;1223348671;029.000.000;1;0
001;1223427478;030.000.000;1;0
001;1223442622;031.000.000;1;0
001;1223443014;032.000.000;1;0
001;1223444954;028.000.000;4;0
001;1223444957;029.000.000;4;0
001;1223444961;030.000.000;4;0
001;1223444970;031.000.000;4;0
001;1223444972;032.000.000;4;0
(stuck there - Is this a correct reaction??)
Audit logging
Skipped for now.
Testing GRAM2
(Do the test actually when we have the 3rd Condor node as an Execute node).
http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/admin/gram2-admin-testing.html
Supposed to test it like:
% grid-proxy-init -debug -verify
% globus-personal-gatekeeper -start
GRAM Contact: grid1.ramscommunity.org:4589:/O=Grid/O=Globus/CN=Your Name
% "grid1.raglobus-job-run mscommunity.org:4589:/O=Grid/O=Globus/CN=Your Name" /bin/date
% globus-personal-gatekeeper -killall
% grid-proxy-destroy
This fails, but I was advised that I should test the real gatekeeper/jobmanager (port 2119):
Ref http://www.globus.org/toolkit/docs/2.4/admin/guide-user.html#gram http://gridinfo.niees.ac.uk/index.php/Using_Globus_4.0.1_at_NIEeS
# su - yoichi
$ myproxy-logon -s grid2
Enter MyProxy pass phrase:
A credential has been received for user yoichi in /tmp/x509up_u500.
(just ping the gatekeeper)
$ globusrun -a -r grid2.ramscommunity.org/jobmanager-condor
GRAM Authentication test successful
$ globus-job-run grid2.ramscommunity.org/jobmanager-condor /bin/hostname
GRAM Job submission failed because data transfer to the server failed (error code 10)
Apparently this is a bug to do with openssl and the platform.
Patches applied to grid1
I did the following as par the advices from gt-user@globus.org:
(This is to improve the error message to make it to report the cause of the error)
# su - globus [globus@grid1 ~]$ wget http://www.mcs.anl.gov/~bester/patches/globus_gram_protocol-7.5.tar.gz [globus@grid1 ~]$ gpt-build globus_gram_protocol-7.5.tar.gz gcc32dbg gcc32dbgpthr gpt-build ====> CHECKING BUILD DEPENDENCIES FOR globus_gram_protocol gpt-build ====> Changing to /home/globus/BUILD/globus_gram_protocol-7.5/ gpt-build ====> BUILDING FLAVOR gcc32dbg gpt-build ====> Changing to /home/globus/BUILD gpt-build ====> Changing to /home/globus/BUILD/globus_gram_protocol-7.5/ gpt-build ====> BUILDING FLAVOR gcc32dbgpthr gpt-build ====> Changing to /home/globus/BUILD globus@grid1 ~]$ exit (exiting yoichi but not this host) [root@grid1 ~]# su - yoichi [yoichi@grid1 ~]$ grid-proxy-init Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/CN=Yoichi Takayama Enter GRID pass phrase for this identity: Creating proxy .............................................. Done Your proxy is valid until: Tue Nov 4 23:47:52 2008 [yoichi@grid1 ~]$ cat a.rsl &(executable="/bin/env")(stdout="https://grid1.ramscommunity.org:40050/dev/stdout") [yoichi@grid1 ~]$ globusrun -r grid2 -f a.rsl globus_gram_client_callback_allow successful GRAM Job submission failed because globus_xio: globus_l_xio_gsi_wrapped_buffer_to_iovec failed. GSS Major Status: General failure globus_gsi_gssapi: internal problem with SSL BIO: SSL_read rc=-1 OpenSSL Error: s3_pkt.c:438: in library: SSL routines, function SSL3_GET_RECORD: bad decompression (error code 10)
This reported SSL problem, then, I was advised to install globus_gssapi_gsi-5.4.tar.gz from http://www.globus.org/toolkit/advisories.html.
[root@grid1 ~]# su - globus [globus@grid1 ~]$ wget http://www-unix.globus.org/ftppub/gt4/4.2.0/updates/src/globus_gssapi_gsi-5.4.tar.gz --01:21:20-- http://www-unix.globus.org/ftppub/gt4/4.2.0/updates/src/globus_gssapi_gsi-5.4.tar.gz Resolving www-unix.globus.org... 192.5.186.90 Connecting to www-unix.globus.org|192.5.186.90|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 251220 (245K) [application/x-tar] Saving to: `globus_gssapi_gsi-5.4.tar.gz' 100%[===================================================================>] 251,220 153K/s in 1.6s 01:21:26 (153 KB/s) - `globus_gssapi_gsi-5.4.tar.gz' saved [251220/251220] [globus@grid1 ~]$ gpt-build globus_gssapi_gsi-5.4.tar.gz gcc32dbg gcc32dbgpthr gpt-build ====> CHECKING BUILD DEPENDENCIES FOR globus_gssapi_gsi gpt-build ====> Changing to /home/globus/BUILD/globus_gssapi_gsi-5.4/ gpt-build ====> BUILDING FLAVOR gcc32dbg gpt-build ====> Changing to /home/globus/BUILD gpt-build ====> Changing to /home/globus/BUILD/globus_gssapi_gsi-5.4/ gpt-build ====> BUILDING FLAVOR gcc32dbgpthr gpt-build ====> Changing to /home/globus/BUILD [globus@grid1 ~]$ exit [root@grid1 ~]# su - yoichi [yoichi@grid1 ~]$ grid-proxy-init Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/CN=Yoichi Takayama Enter GRID pass phrase for this identity: Creating proxy .................................................................... Done Your proxy is valid until: Wed Nov 5 13:27:58 2008 [yoichi@grid1 ~]$ globus-gass-server -o -e -p 40050 https://grid1.ramscommunity.org:40050 (blocks) (much later I got the stdout) (in another shell) [root@grid1 ~]# su - yoichi [yoichi@grid1 ~]$ export GLOBUS_GSSAPI_DEBUG_LEVEL=3 [yoichi@grid1 ~]$ cat a.rsl &(executable="/bin/env")(stdout="https://grid1.ramscommunity.org:40050/dev/stdout") [yoichi@grid1 ~]$ globusrun -r grid2 -f a.rsl ... GRAM Job submission successful ... GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING ... GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE ... ... _CONDOR_ANCESTOR_15155=15156:1225808947:1025947456 _CONDOR_ANCESTOR_4714=15155:1225808947:90104122 _CONDOR_ANCESTOR_4708=4714:1225630528:949999424 LD_LIBRARY_PATH= _CONDOR_SCRATCH_DIR=/scratch/condor/execute/dir_15155 _CONDOR_SLOT=1 _CONDOR_HIGHPORT=9670 GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://grid2.ramscommunity.org:40002/ GLOBUS_GRAM_JOB_CONTACT=https://grid2.ramscommunity.org:40001/8398/1225808944/ _CONDOR_LOWPORT=9620 LOGNAME=yoichi GLOBUS_LOCATION=/usr/local/globus X509_USER_PROXY=/home/yoichi/.globus/job/grid2.ramscommunity.org/8398.1225808944/x509_up HOME=/home/yoichi
Now it seems it works.
On an advice from gt-user@globus.org, I also made sure the users got GLOBUS_TCP_PORT_RANGE in their environment.
# su - yoichi $ cat /etc/profile ... export GLOBUS_LOCATION=/usr/local/globus source $GLOBUS_LOCATION/etc/globus-user-env.sh export GLOBUS_TCP_PORT_RANGE=40000,41000 ...
Patches applied to grid2 and grid4
Since sgird2 and grid4 have Globus installed, the gssapi-asi patch was applied to these, too.
Re-test
[root@grid1 ~]# su - yoichi [yoichi@grid1 ~]$ myproxy-logon -s grid2 Enter MyProxy pass phrase: A credential has been received for user yoichi in /tmp/x509up_u500. [yoichi@grid1 ~]$ globusrun -a -r grid2.ramscommunity.org/jobmanager-condor GRAM Authentication test successful [yoichi@grid1 ~]$ globus-job-run grid2.ramscommunity.org/jobmanager-fork /bin/hostname grid2.ramscommunity.org (this works now) [yoichi@grid1 ~]$ globus-job-run grid2.ramscommunity.org/jobmanager-condor /bin/hostname (blocks) (in another shell) [yoichi@grid1 ~]$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime grid1.ramscommunit LINUX INTEL Owner Idle 0.000 249 0+00:05:04 grid4.ramscommunit LINUX INTEL Owner Idle 0.030 503 0+00:05:04 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX 2 2 0 0 0 0 0 Total 2 2 0 0 0 0 0 (hosts are busy) (several minutes later) grid4.ramscommunity.org
So, it seems it works now.

Add Comment