4.9 Globus - GRAM2 (gsigatekeeper, jobmanager) on grid2

Overview

On grid2, we need to install/configure an additional service: GRAM2.

I presume that Pegasus generates DAGMan jobs as either standard, vanilla or gt2. If the job is gt2, GRAM2 client (Condor-G) on a Submit node (grid1) accesses the GRAM2 service (gatekeeper and jobmanager) on grid2. Condor-G is a part of Condor installation package. The Globus jobmanager is configured to use Condor, which is also installed on grid2.

Ref: http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/ -> http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/admin/index.html ->
http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/admin/gram2-admin-configuring.html

On grid1, only Condor-G, MyProxy client and GridFTP may be required. Probably installing Condor on the Submit node, configuring it as a Submit node and submitting a gt2 job via condor_submit will invoke Condor-G.

Pre-requisites

  1. Condor installed and configured as a Submit node (also Execute node to test Vanilla/Standard jobs).
  2. If Condor is configured with USE_NFS = True, network shared accounts (LDAP/Kerberos) and NFS-shared home directories are required. For this (with NFSv4), RHEL 5.2 requires to install/configure Kerberos, NFS, LDAP and nss_ldap and pam_krb5. Refer to details on "3.7 LDAP Clients - For SSO with Kerberos and NFS" and other pages.
  3. Globus Toolkit installed and configured on grid2 already.
  4. Firewall: Ports should be open for GRAM2 client traffic. GridFTP ephemeral ports. If GridFTP server is used, 2811/tcp should be open. If GT2 MDS GRIS/GIIS is used, 2135/tcp should be open.

gsigatekeeper

# locate globus-gatekeeper.conf
/nfs/software/globus/4.2.0/etc/globus-gatekeeper.conf
/usr/local/globus/etc/globus-gatekeeper.conf

# cat /usr/local/globus/etc/globus-gatekeeper.conf

  -x509_cert_dir /etc/grid-security/certificates
  -x509_user_cert /etc/grid-security/hostcert.pem
  -x509_user_key /etc/grid-security/hostkey.pem
  -gridmap /etc/grid-security/grid-mapfile
  -home /usr/local/globus
  -e libexec
  -logfile var/globus-gatekeeper.log
  -port 2119
  -grid_services etc/grid-services
  -inetd

It uses these places/files most of which are correct or installed already, except that it defines inetd. Although we use xinetd, gatekeeper does not have xinetd config option and takes inetd for xinetd.

Make sure that /etc/services has listed the service and port:

# cat /etc/services | grep 2119
gsigatekeeper	2119/tcp			# GSIGATEKEEPER
gsigatekeeper	2119/udp			# GSIGATEKEEPER

Ports are already registered and they are opened in the Firewall. (not sure when I did this)

For xinetd, we need to create this file:

vi /etc/xinetd.d/globus-gatekeeper
-- insert --
service gsigatekeeper
{
   socket_type  = stream
   protocol     = tcp
   wait         = no
   user         = root
   env          = LD_LIBRARY_PATH=/usr/local/globus/lib
   server       = /usr/local/globus/sbin/globus-gatekeeper
   server_args  = -conf /usr/local/globus/etc/globus-gatekeeper.conf
   disable      = no
   env         += GLOBUS_TCP_PORT_RANGE=40000,41000
}

NB: A client may contact from an ephemeral port to the gatekeeper on 2119/tcp. If the data is to be returned, the jobmanager may connect from the ephemeral port on the server to the ephemeral port on the client.

# /etc/rc.d/init.d/xinetd restart

or

# service xinetd reload

Now it is listening:

# netstat -aut | grep gatekeeper
tcp        0      0 *:gsigatekeeper             *:*                         LISTEN

Authentication for gatekeeper

The gatekeeper accepts requests coming in and passes them on to jobmanager if the user has authentication matching in: /etc/grid-security/grid-mapfile. Users are already added to the grid-mapfile when their credentials were created by MyProxy.

# cat /etc/grid-security/grid-mapfile
"/O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/CN=Yoichi Takayama" yoichi
"/O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/CN=Luke Foxton" lfoxton

jobmanager

Ref: http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/admin/gram2-admin-jobmanager.html

jobmanager has been configured already in the process of configuring Globus installatin (./configure --with-gram-condor) and it has been re-configured after hostcert.pem was installed (refer to the Globus installation section).

# cat $GLOBUS_LOCATION/etc/globus-job-manager.conf

	-home "/usr/local/globus"
	-globus-gatekeeper-host grid2.ramscommunity.org
	-globus-gatekeeper-port 2119
	-globus-gatekeeper-subject "/O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/CN=host/grid2.ramscommunity.org"
	-globus-host-cputype i686
	-globus-host-manufacturer pc
	-globus-host-osname Linux
	-globus-host-osversion 2.6.18-92.1.10.el5
        -globus-toolkit-version 4.2.0
	-save-logfile on_error
	-state-file-dir /usr/local/globus/tmp/gram_job_state
	-machine-type unknown

It also uses this jobmanager-condor definition in grid-services: (this has been auto-generated during the installation)

# cat $GLOBUS_LOCATION/etc/grid-services/jobmanager-condor

stderr_log,local_cred - /usr/local/globus/libexec/globus-job-manager globus-job-manager -conf /usr/local/globus/etc/globus-job-manager.conf -type condor -rdn jobmanager-condor -machine-type unknown -publish-jobs -condor-arch INTEL -condor-os LINUX

Scheduler Event Generator / Job Manager Integration

Supposed to start event generator.

Adding -seg entry to globus-job-manager.conf is supposed to instruct the job-manager to use the event generator.

# vi $GLOBUS_LOCATION/etc/globus-job-manager.conf
-- insert --
-seg

It uses globus-job-manager-seg.conf and it must have been configure beforehand:

$ cat $GLOBUS_LOCATION/etc/globus-job-manager-seg.conf

condor_log_path=/usr/local/globus/var/globus-job-manager-seg-condor
condor_test_log_path=/usr/local/globus/var/globus-job-manager-seg-condor_test
test_log_path=/usr/local/globus/var/globus-job-manager-seg-test

It seems OK.

$ $GLOBUS_LOCATION/sbin/globus-job-manager-event-generator -scheduler condor

Hmmm, it freezes up and never returns... Gave it a ctrl-z and bg. It seems that it returns some time later.

$ ps -ef
...
globus    5250  5208  0 17:46 pts/1    00:00:00 perl /usr/local/globus/sbin/globus-job-manager-event-generator -s condor
globus    5251  5250  0 17:46 pts/1    00:00:00 /usr/local/globus/libexec/globus-scheduler-event-generator -s condor -t 1223444972
...

Test:

# su - globus
$ $GLOBUS_LOCATION/libexec/globus-scheduler-event-generator -s condor  -t 1

001;1223347759;028.000.000;1;0
001;1223348671;029.000.000;1;0
001;1223427478;030.000.000;1;0
001;1223442622;031.000.000;1;0
001;1223443014;032.000.000;1;0
001;1223444954;028.000.000;4;0
001;1223444957;029.000.000;4;0
001;1223444961;030.000.000;4;0
001;1223444970;031.000.000;4;0
001;1223444972;032.000.000;4;0
(stuck there - Is this a correct reaction??)

Audit logging

Skipped for now.

Testing GRAM2

(Do the test actually when we have the 3rd Condor node as an Execute node).

http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram2/admin/gram2-admin-testing.html

Supposed to test it like:

% grid-proxy-init -debug -verify
% globus-personal-gatekeeper -start

GRAM Contact: grid1.ramscommunity.org:4589:/O=Grid/O=Globus/CN=Your Name

% "grid1.raglobus-job-run mscommunity.org:4589:/O=Grid/O=Globus/CN=Your Name" /bin/date

% globus-personal-gatekeeper -killall
% grid-proxy-destroy

This fails, but I was advised that I should test the real gatekeeper/jobmanager (port 2119):

Ref http://www.globus.org/toolkit/docs/2.4/admin/guide-user.html#gram http://gridinfo.niees.ac.uk/index.php/Using_Globus_4.0.1_at_NIEeS

# su - yoichi
$ myproxy-logon -s grid2

Enter MyProxy pass phrase:
A credential has been received for user yoichi in /tmp/x509up_u500.

(just ping the gatekeeper)
$ globusrun -a -r grid2.ramscommunity.org/jobmanager-condor

GRAM Authentication test successful

$ globus-job-run grid2.ramscommunity.org/jobmanager-condor /bin/hostname
GRAM Job submission failed because data transfer to the server failed (error code 10)

Apparently this is a bug to do with openssl and the platform.

Patches applied to grid1

I did the following as par the advices from gt-user@globus.org:

(This is to improve the error message to make it to report the cause of the error)

# su - globus
[globus@grid1 ~]$ wget http://www.mcs.anl.gov/~bester/patches/globus_gram_protocol-7.5.tar.gz
[globus@grid1 ~]$ gpt-build globus_gram_protocol-7.5.tar.gz gcc32dbg gcc32dbgpthr
gpt-build ====> CHECKING BUILD DEPENDENCIES FOR globus_gram_protocol
gpt-build ====> Changing to /home/globus/BUILD/globus_gram_protocol-7.5/
gpt-build ====> BUILDING FLAVOR gcc32dbg
gpt-build ====> Changing to /home/globus/BUILD
gpt-build ====> Changing to /home/globus/BUILD/globus_gram_protocol-7.5/
gpt-build ====> BUILDING FLAVOR gcc32dbgpthr
gpt-build ====> Changing to /home/globus/BUILD
globus@grid1 ~]$ exit (exiting yoichi but not this host)

[root@grid1 ~]# su - yoichi
[yoichi@grid1 ~]$ grid-proxy-init
Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/CN=Yoichi Takayama
Enter GRID pass phrase for this identity:
Creating proxy .............................................. Done
Your proxy is valid until: Tue Nov  4 23:47:52 2008

[yoichi@grid1 ~]$ cat a.rsl
&(executable="/bin/env")(stdout="https://grid1.ramscommunity.org:40050/dev/stdout")
[yoichi@grid1 ~]$ globusrun -r grid2 -f a.rsl
globus_gram_client_callback_allow successful
GRAM Job submission failed because globus_xio: globus_l_xio_gsi_wrapped_buffer_to_iovec failed.
GSS Major Status: General failure
globus_gsi_gssapi: internal problem with SSL BIO: SSL_read rc=-1
OpenSSL Error: s3_pkt.c:438: in library: SSL routines, function SSL3_GET_RECORD: bad decompression
 (error code 10)

This reported SSL problem, then, I was advised to install globus_gssapi_gsi-5.4.tar.gz from http://www.globus.org/toolkit/advisories.html.

[root@grid1 ~]# su - globus
[globus@grid1 ~]$ wget http://www-unix.globus.org/ftppub/gt4/4.2.0/updates/src/globus_gssapi_gsi-5.4.tar.gz
--01:21:20--  http://www-unix.globus.org/ftppub/gt4/4.2.0/updates/src/globus_gssapi_gsi-5.4.tar.gz
Resolving www-unix.globus.org... 192.5.186.90
Connecting to www-unix.globus.org|192.5.186.90|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 251220 (245K) [application/x-tar]
Saving to: `globus_gssapi_gsi-5.4.tar.gz'

100%[===================================================================>] 251,220      153K/s   in 1.6s   

01:21:26 (153 KB/s) - `globus_gssapi_gsi-5.4.tar.gz' saved [251220/251220]

[globus@grid1 ~]$ gpt-build globus_gssapi_gsi-5.4.tar.gz gcc32dbg gcc32dbgpthr
gpt-build ====> CHECKING BUILD DEPENDENCIES FOR globus_gssapi_gsi
gpt-build ====> Changing to /home/globus/BUILD/globus_gssapi_gsi-5.4/
gpt-build ====> BUILDING FLAVOR gcc32dbg
gpt-build ====> Changing to /home/globus/BUILD
gpt-build ====> Changing to /home/globus/BUILD/globus_gssapi_gsi-5.4/
gpt-build ====> BUILDING FLAVOR gcc32dbgpthr
gpt-build ====> Changing to /home/globus/BUILD

[globus@grid1 ~]$ exit



[root@grid1 ~]# su - yoichi
[yoichi@grid1 ~]$ grid-proxy-init
Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/CN=Yoichi Takayama
Enter GRID pass phrase for this identity:
Creating proxy .................................................................... Done
Your proxy is valid until: Wed Nov  5 13:27:58 2008
[yoichi@grid1 ~]$ globus-gass-server -o -e -p 40050
https://grid1.ramscommunity.org:40050

(blocks)
(much later I got the stdout)



(in another shell)

[root@grid1 ~]# su - yoichi
[yoichi@grid1 ~]$ export GLOBUS_GSSAPI_DEBUG_LEVEL=3
[yoichi@grid1 ~]$ cat a.rsl
&(executable="/bin/env")(stdout="https://grid1.ramscommunity.org:40050/dev/stdout")
[yoichi@grid1 ~]$ globusrun -r grid2 -f a.rsl
...
GRAM Job submission successful
...
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
...
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
...
...
_CONDOR_ANCESTOR_15155=15156:1225808947:1025947456
_CONDOR_ANCESTOR_4714=15155:1225808947:90104122
_CONDOR_ANCESTOR_4708=4714:1225630528:949999424
LD_LIBRARY_PATH=
_CONDOR_SCRATCH_DIR=/scratch/condor/execute/dir_15155
_CONDOR_SLOT=1
_CONDOR_HIGHPORT=9670
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://grid2.ramscommunity.org:40002/
GLOBUS_GRAM_JOB_CONTACT=https://grid2.ramscommunity.org:40001/8398/1225808944/
_CONDOR_LOWPORT=9620
LOGNAME=yoichi
GLOBUS_LOCATION=/usr/local/globus
X509_USER_PROXY=/home/yoichi/.globus/job/grid2.ramscommunity.org/8398.1225808944/x509_up
HOME=/home/yoichi

Now it seems it works.

On an advice from gt-user@globus.org, I also made sure the users got GLOBUS_TCP_PORT_RANGE in their environment.

# su - yoichi
$ cat /etc/profile
...
export GLOBUS_LOCATION=/usr/local/globus
source $GLOBUS_LOCATION/etc/globus-user-env.sh
export GLOBUS_TCP_PORT_RANGE=40000,41000
...

Patches applied to grid2 and grid4

Since sgird2 and grid4 have Globus installed, the gssapi-asi patch was applied to these, too.

Re-test

[root@grid1 ~]# su - yoichi
[yoichi@grid1 ~]$ myproxy-logon -s grid2
Enter MyProxy pass phrase:
A credential has been received for user yoichi in /tmp/x509up_u500.
[yoichi@grid1 ~]$ globusrun -a -r grid2.ramscommunity.org/jobmanager-condor

GRAM Authentication test successful

[yoichi@grid1 ~]$ globus-job-run grid2.ramscommunity.org/jobmanager-fork /bin/hostname
grid2.ramscommunity.org
(this works now)

[yoichi@grid1 ~]$ globus-job-run grid2.ramscommunity.org/jobmanager-condor /bin/hostname
(blocks)

(in another shell)
[yoichi@grid1 ~]$ condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

grid1.ramscommunit LINUX      INTEL  Owner     Idle     0.000   249  0+00:05:04
grid4.ramscommunit LINUX      INTEL  Owner     Idle     0.030   503  0+00:05:04

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX     2     2       0         0       0          0        0

               Total     2     2       0         0       0          0        0
(hosts are busy)

(several minutes later)
grid4.ramscommunity.org

So, it seems it works now.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.