Manuals for GEMS

This manual is intended to help users get started with the GEMS tools. GEMS users will typically only need to refer to the client section. Adding storage to an existing system requires reviewing the storage section, and creating a new system requires the server section.

The GEMS API docs are included for users wishing to access the GEMS classes directly in their application. This is particularly useful for Java and Beanshell developers who desire to gain full control of the GEMS clients. Additionally, a Java API is provided for Chirp.

GEMS client tools

Overview

The GEMS system stores datasets, specifically directory trees, called "configs", on a distributed network of file servers. The files are replicated automatically by the system. Metadata specified by the users is kept in a central database that may be queried for information about stored configs. The GEMS clients allow the user to create, view, and download configs that are stored in the system via a GUI. Tools are also included to insert, query, and retrieve configs programmatically from the shell.

Setup

You have two options: using the JAR file or installing the complete program including the scripts.
Note: If the full GEMS services are installed on your system, you will have access to the client tools as well.
JAR File Client install
  1. Benefits: This is the simpler method, requires no compilation, and enables the user to distribute the client over the compute grid quite easily.
  2. This option simply requires the Java SDK 1.4.2 or better. Type

    java -version

    to see if you are capable of running the GEMS clients.
  3. Download the appropriate JAR file from the Downloads page.
  4. To start up the GUI, just double-click on GEMS.jar or type

    > java -jar GEMS.jar

  5. To get started with the command line tools, type

    > java -jar GEMS.jar --help

  1. Benefits: Allows the user to run GEMS clients via the scripts, which manage the CLASSPATH automatically.
  2. This option requires the Java JDK 1.4.2 or better, Ant 1.6.5 or better, and the shell, sh.
  3. Download the complete source tarball from the Downloads page.
  4. Unpack the tarball, for example,

    > bunzip2 GEMS-0.9.0-src.tar.bz2
    > tar xf GEMS-0.9.0-src.tar

  5. Build:

    > ant -Ddist.dir=${HOME}/GEMS client dist

    where dist.dir indicates that the tools will be installed in ${HOME}/GEMS
  6. Following the example above, add ${HOME}/GEMS/bin to your PATH in your startup script, such as .bashrc, like so:

    PATH=${PATH}:${HOME}/GEMS/bin

  7. You should now be able to run GEMSput like so:

    > GEMSput

    and obtain the response No args.

Browsing GEMS

A graphical method to view the records stored in GEMS is provided by the GEMSview tool. If using the JAR file method, double-click on the GEMS file or type

> java -jar GEMS.jar &

or, if using the client install,

> GEMSview &

Main window


To get started with GEMSview:
  1. Type the name of your GEMS installation in the Server field.
  2. The Account field allows you to specify an authentication mode when downloading files or creating new configs. Check the ACL on the config you wish to download to determine the mode you can use.
  3. The left hand panel allows you to enter keys and values to match against configs in the system. Click match to obtain your results.
  4. The Preview panel allows you to download all files in the config by clicking on the link. Config titles (use the title tag) are underlined.
  5. Additional features below are available when entering the Details panel.
  6. When in the Details panel, GEMSview lets you view the Params or the ACL for the config.
  7. Right-clicking on the config key allows you to copy the number, which is useful for pasting to the command line.
  8. The bottom right panel displays the file information for this config. Right-clicking on a file name allows you to preview this file, download the file, or download all the files from this config.
  9. You may delete this config if you have write access by right-clicking on the owner name.
  10. To simplify using the GEMSview tool, see the .gemsclient startup file info below.
  11. You may put a new config into GEMS by clicking Put.

Storage Map

Each config in GEMS is associated with a Storage Map. The map indicates which Chirp hosts are eligible to store the files in the config, and allows the files to be replicated to distinct, separate clusters to improve survivability.
  • Each cluster in the map has a nickname that you may specify.
  • A cluster consists of a list of patterns. The cluster is made up of Chirp servers in the Chirp Catalog that match one of these patterns.
  • Your data files will only be stored on these clusters.
  • The GEMS Replicator will attempt to place one replica in each cluster before creating additional replicas.
  • You may create stand-alone maps for future use by saving the map to a file.
  • The Default setting is stored in your .gemsclient file.

Access Control

Each config in GEMS is associated with an Access Control List (ACL). These ACLs are based on the ACL used in the underlying Chirp system, and are similar to AFS ACLs, for example.
  • Each user consists of a mode and a name.
  • A wildcard (*) may be used to match multiple users. (unix:* is useful to match all users on your filesystem.) Note that wildcard matching here is similar to the pattern matching in the Storage Map above.
  • A user may have read permission for the config or admin privileges. Admin privileges allow the user to modify or delete the whole config. (In Chirp, read is rl and admin is lrwda.)
  • The config owner and the GEMS system always have admin privileges.
  • You may create stand-alone ACLs for future use by saving the ACL to a file.
  • The Default setting is stored in your .gemsclient file.

Inserting into GEMS

Overview

Datasets in GEMS are called configs. Each config consists of a directory subtree of files in which the path information is maintained, similarly to tar or zip archives. Each config is associated with a set of metadata tags called params that may be searched as above with GEMSview or GEMSmatch. Additionally, an Chirp-formatted Access Control List (ACL) is associated with each config. This ACL is propagated to each server that hosts this config. At a minimum, GEMS will grant access to itself and to the submitting user or owner.

Technique

The standard way to insert data into GEMS is to use GEMSput, by putting all the information on the command line, or by creating an GEMSput XML document to specify the operation.
  1. Determine the directory under which all of your files for this config reside. Move to that directory. In the example below, the directory is ~/exp01.
  2. Determine the username that you use when contacting a Chirp system. In this example, this is unix:Sorin. This indicates that the user will be able to authenticate to all eligible Chirp servers by this name. See the Chirp documentation for more information about authentication methods in Chirp.
  3. Determine a set of metadata tags that you will be able to search for later. In this example, the scientist's name is Sorin, and the application is exp.
  4. Determine the files that will be stored and replicated by GEMS. In this case, there are two files, exp.in and exp.out. All files under the current directory may be inserted automatically as described below.
  5. Determine which users will be able to access your files: in this case, all UNIX users that can gain access to the Chirp network will be able to download the whole config. If you would prefer that only you may obtain access, omit this section.
  6. Determine which Chirp servers will be allowed to serve your data. You may construct a storage map and use the --map option to specify the eligible servers.
  7. Type:

    ~exp01> GEMSput --owner unix:sorin
    scientist_lname=sorin app=exp count=1
    --file exp.in --file exp.out
    --acl-entry "unix:*" read
    --acl-entry "hostname:*.university.edu" read

    Note that JAR file users must run GEMSput as

    java -jar GEMS.jar put

    but the arguments remain the same.
  8. To obtain the list of options at any time just pass in the argument --help.

GEMSput Options

Flag Description
-i<host> The hostname on which the GEMS services are running. If omitted, defaults to the local machine (127.0.0.1).
-p <port> The port number on which the GEMS server is listening. If omitted, defaults to 7101.
--owner <identity> Set the owner of this config to identity, a Chirp-formatted identity.
--host <host> Recommend a target Chirp host for the initial upload.
--localhost Recommend the local Chirp host for the initial upload.
<key>=<value> Adds a param pair to the params list.
--reps <count> All of the following files will have this replica count, up to the next --reps flag.
--file <file> Adds file to the set of files to be uploaded.
--find Adds all files under the current directory and subdirectories to the set of files to be uploaded.
--key Print the config key number that corresponds to this new config.
--hosts Print the name of the Chirp host actually used for the upload.
--acl-entry <principal> <perms> Add an ACL entry.
--acl <file.xml> Specify an ACL file. ACL files may be generated with the GEMSview ACL GUI shown above.
--map <file.xml> Specify a map file. Storage map files may be generated with the GEMSview map GUI shown above.
--auto-params <prog> <args> Specify an external program or script that will generate the params for this config. The format of the output of the program is one param pair per line, the first word is the tag name, the rest of the line will be the value.
--debug Turn on verbose debugging.

Searching in GEMS

Overview

Once configs have been stored in GEMS, it becomes necessary to be able to search for them. GEMSmatch matches sets of params to config keys, which can then be used to download the files. A typical GEMSmatch may produce multiple matches, resulting in multiple config keys. The user may retrieve a full XML-formatted response, or a simple list of keys.

Technique

  1. A simple match to obtain all of the config keys that scientist Sorin has submitted:

    > GEMSmatch scientist_lname=Sorin --keys
    141325

    This will simply print the config keys, one per line. Multiple param pairs may be put on the same command line to further restrict the result set.
  2. If the user needs to retrieve the param and file information for a known config key in XML, the user may submit:

    > GEMSmatch --config 141325

    which produces the full XML.
  3. Non-XML output may be obtained using the command line options below.
  4. A typical GEMSmatch XML response is shown below:
    <GEMS>
      <GEMSmatch_Response>
    
      <configs>
      <config key="24858230" owner="unix:sorin"> 
    
       <params> 
        <scientist_lname>Sorin</scientist_lname>
        <app>exp</app>
       </params> 
    
       <files>
          <file path="./" name="exp.in" 
                type="exp input" reps="4"  io="i">
             <host>machine.university.edu</host>
             <host>host2.university.edu</host> </file>
          <file path="./" name="exp.out" 
                type="exp output" reps="2"  io="o" >
             <host>host1.university.edu</host>
             <host>host2.university.edu</host> </file>
       </files>
    
      </config>
      </configs>
     </GEMSmatch_Reponse> 
    </GEMS>
    
    • Note that multiple <config> elements may be returned under the <configs> element.
    • Params are returned as originally specified by GEMSput.
    • Note that each file is now associated with multiple hosts, which will often vary from the original suggested <host> given to GEMSput. These are the hosts from which this file may be retrieved. However, retrieving these files by hand is tedious, see the next tool, GEMSget, to automate this process.
    • Non-XML formatted output may be obtained by using options below.

GEMSmatch Options

Flag Description
-i <host> The hostname on which the GEMS services are running. If omitted, defaults to the local machine (127.0.0.1).
-p <port> The port number on which the GEMS server is listening. If omitted, defaults to 7101.
<file.xml> Use this XML file. Only one such file may be submitted at a time.
- Read XML file from stdin.
--config <key> Ignore param pairs, just search for this config key.
<key>=<value> Adds a param pair to the params list for searching.
--locate <file> Print a valid full chirp-formatted virtual filename for this abstract file. Abstract files are formatted:
  • /<key>/path/file
  • or path/file if a --config is given.
--hosts When used with --locate, print all possible host locations for this file.
--keys Print the config key number that corresponds to each matching config.
--params Print the params for each matching config.
--files Print the files and current replica count for each matching config.
--acls Print the ACL for each matching config.
--maps Print the storage map for each matching config.
--owners Print the owner for each matching config.
--first <n> Omit the first n-1 configs. Configs are ordered by the config number.
--last <n> Omit the configs after config n Configs are ordered by the config number.
--debug Turn on verbose debugging.

Retrieving data from GEMS

Overview

Once the user has obtained the config key by using GEMSmatch or GEMSview, the data files may be downloaded by using the GEMSget tool. All or some of the files may be obtained, and will be stored in the specified output directory.

Technique

  1. GEMSget users must already know the config key to use to access the config. If you do not know the config key, use GEMSmatch or GEMSview to find it.
  2. If the config you want to download is 123, simply execute:

    ~/tmp> GEMSget --config 123.

    This will download all the directories and files in the config and place them under ~/tmp.
  3. To download only certain files, indicate each file with --file. When this flag is used, unmentioned files will not be downloaded.

GEMSget options

Flag Description
-i <host> The hostname on which the GEMS services are running. If omitted, defaults to the local machine (127.0.0.1).
-p <port> The port number on which the GEMS server is listening. If omitted, defaults to 7101.
<file.xml> Use this XML file. Only one such file may be submitted at a time.
- Read XML file from stdin.
--config <key> The config to download.
--file Download this file. Do not download files that are not on the command line.
--output <directory> Specify an output directory for the downloads.
--auth-mode Use this Chirp authentication mode, e.g., unix.
--hosts Print the hosts used to download the files.
--debug Turn on verbose debugging.

Deleting data from GEMS

Overview

Once the user has obtained the config key by using GEMSmatch or GEMSview, the whole config may be deleted.

Technique

  1. Deletes a config. Users must have the w permission. GEMSdelete users must already know the config key to use to access the config. If you do not know the config key, use GEMSmatch or GEMSview to find it.
  2. If the config you want to download is 123, simply execute:

    > GEMSdelete --config 123.

    This will delete the whole record from the database, and the files in this config will be garbage collected.

GEMSdelete options

Flag Description
-i <host> The hostname on which the GEMS services are running. If omitted, defaults to the local machine.
-p <port> The port number on which the GEMS server is listening. If omitted, defaults to 7101.
<file.xml> Use this XML file. Only one such file may be submitted at a time.
- Read XML file from stdin.
--config <key> The config to delete.
--auth-mode Use this Chirp authentication mode, e.g., unix.
--debug Turn on verbose debugging.

Using a .gemsclient file

Overview

Using a .gemsclient greatly simplifies many GEMS client operations. Simply create an XML file like the one shown below and place it in your home directory, called .gemsclient. Windows users should put this file in their "Documents and Settings\<username>" directory.
<GEMS>
  <GEMSclient>

  <chirp>
   <user>unix:sorin</user> 
   <user>hostname:sorin.university.edu</user> 
  </chirp>

  <acl>
    <entry principal="*.university.edu" perms="lr" />
  </acl> 

   <storage name="two_universities">
    <cluster name="university1">
     <pattern>*.university1.edu</pattern>
    </cluster>
    <cluster name="university2">
     <pattern>*.university2.edu</pattern>
    </cluster>
   </storage>

  <servers>
   <host>gems.university.edu</host>
  </servers>

  <GEMSview>
   <keys>
    <scientist_lname> Sorin </scientist_lname>
   </keys>
  </GEMSview>

 </GEMSclient> 
</GEMS>
  • Automatic authentication methods. Especially useful for GEMSview.
  • Automatic ACL entries, used by GEMSput if no ACL is otherwise supplied.
  • Storage map to be used unless superseded in a GEMSput file.
  • Default GEMSd server locations, only used by GEMSview.
  • Auto-entry keys for use with GEMSview. Try it, you'll like it.

Providing resources for GEMS

Overview

GEMS allows storage owners to volunteer space to the system on a temporary basis by running a small server, a service that does not require root access. Storage may be revoked at any time. Additionally, GEMS ensures that your disk does not fill up with data, and actually removes GEMS data as the disk becomes full. In short, volunteering storage space to GEMS is safe, administratively easy, non-committal, and does not interfere with disk consumption by regular users.

Setup

UNIX users may volunteer space to GEMS using the method outlined below.
  1. Requirements: You must be able to run the Chirp service, which is a small, relatively portable program that compiles with the gcc tools.
  2. First, install and run the Chirp server.
  3. Create space for GEMS by creating a directory that is known and accessible to the central GEMS services. The chirp server must report to the Chirp catalog used by GEMSd. Typical installations of GEMS use the /GEMS/ directory, and require hostname access. To accomplish this in chirp, run the server, and then use the client as below, assuming your server is on myhost.university.edu and your GEMSd service and catalog server are on gems.university.edu.

    > chirp_server -r /tmp/chirp -u gems.university.edu
    > chirp myhost.university.edu
    chirp:myhost:/> mkdir /GEMS
    chirp:myhost:/> setacl /GEMS hostname:gems.university.edu admin

    That's it!

GEMS services

Overview

The GEMS services are the center of a GEMS installation. The software consists of a Java-based GEMS daemon called GEMSd, which manages client connections, and responds to queries. This service manages metadata only: the actual data files are transmitted directly from clients to the Chirp servers. The service requires a Postgres database and Java 1.5.0 . It should be run entirely as a non-root user. Installation may be performed by a non-root user as well by modifying the installation process below, but we assume the installation may be performed by root.

Setup

  1. Install PostgreSQL. An example installation process is shown below:

    ~/postgres-src# configure --prefix=/opt/pgsql
    ~/postgres-src# make ; make install
    ~/postgres-src# chmod a+rx /opt/pgsql
    ~/postgres-src# chmod a+rx /opt/pgsql/*
    ~/postgres-src# mkdir /opt/pgsql/data
    ~/postgres-src# useradd postgres
    ~/postgres-src# chown postgres /opt/pgsql/data
    ~/postgres-src# su - postgres
    > cd /opt/pgsql
    > bin/initdb data
    > bin/pg_ctl -D data start
    postmaster successfully started

  2. Install the Chirp catalog. Ensure that your resource providers report to this catalog. Note that this service need not be run on the same machine that runs GEMSd.
  3. Download the latest full GEMS tarball, GEMS-???-src.tar from the Downloads page.
  4. Unpack the tarball, for example,

    > tar xf GEMS-???-src.tar

  5. Build

    > ant -Ddist.dir=/opt/GEMS dist

  6. where /opt/GEMS is an example of where you might like to install GEMS.
  7. We recommend you run GEMS as its own user, named gems.

    # useradd gems

  8. Make the GEMS directory world-readable, and let the gems user modify the configuration file:

    # chmod a+rx /opt/GEMS
    # chown gems /opt/GEMS/.gemsdconfig

  9. Obtain a suitable PostgreSQL JDBC driver file. Place this file at /opt/GEMS/jdbc/postgresql.jdbc.jar.
  10. Login as postgres, or a privileged database user:

    # su - postgres

  11. Initialize the GEMSd database:

    > /opt/pgsql/bin/createdb GEMSd

  12. Give gems permissions on the database:

    > /opt/pgsql/bin/psql GEMSd
    GEMSd=# create user gems;
    GEMSd=# grant all on database "GEMSd" to gems;
    GEMSd=# \q

  13. Login as gems:

    # su - gems

  14. Create the GEMSd database:

    > /opt/GEMS/bin/DBcreate

  15. Start the server:

    > /opt/GEMS/bin/GEMSd -d -f /opt/GEMS/.gemsdconfig

  16. To stop the server:

    > /opt/GEMS/bin/GEMSadmin --stop

GEMSd Options

Flag Description
-d Daemon mode: disable the console. The console is intended to be used for debugging purposes only, this option is used to restrict output to useful log messages only, for example:

> GEMSd -d > /opt/GEMS/gemsd.log

--debug
--debug-sql
--debug-chirp
--debug-log
--debug-msg
--debug-all
--quiet --no-debug
Control display of verbose logging and debugging information, including Chirp operations, SQL statements, etc.
-f <file> Specifies the path to the GEMSd config file.
-w <port> Starts the GEMS web interface on the given port.
-h Displays GEMSd options.

GEMSd Configuration File

An example config file is shown below:
<GEMS> 
<GEMSadmin> 
  <Catalog> 
    <host>catalog.university.edu:9097</host> 
    <refresh>5</refresh> 
  </Catalog> 
  <ChirpRoot> 
      /GEMS/ 
  </ChirpRoot> 
  <Metadatabase>  
    <host>gems.university.edu</host> 
    <user>gems</user> 
    <password /> 
  </Metadatabase>  
  <GEMSprincipal> 
     hostname:gems.university.edu 
  </GEMSprincipal>  
  <Threads> 
      ... 
  </Threads> 
  <Groups> 
   <group name="A"> 
    <host name="*.A.university.edu" /> 
   </group>  
   <group name="BCd"> 
    <host name="*.B.university.edu" /> 
    <host name="*.C.university.edu" /> 
    <host name="d.D.university.edu" /> 
   </group> 
  </Groups> 
</GEMSadmin> 
</GEMS> 
  • The Catalog host indicates the location of the Chirp catalog service as described above. The refresh interval is in minutes, i.e., this service will be queried every 5 minutes. Note that the service itself has a significant delay interval: see the Chirp options to increase its update rate.
  • The ChirpRoot indicates which directory is the directory that will be donated to GEMS on the Chirp servers. As above, this is typically /GEMS/, the slashes are required.
  • The metadatabase indicates the location of the Postgres database, to be accessed by the user and password given.
  • GEMSprincipal is the authentication that GEMSd will attempt to use when contacting the Chirp servers, as above, hostname authentication is preferred. Note that clients may utilize any Chirp authentication mode including unix and globus.
  • The Threads section indicates the intervals used by the automatic replica management services, Auditor, Replicator, and GarbageCollector, as well as the statistics service Statistics. The intervals are given in seconds, decreasing the interval will increase CPU and network consumption but may result in faster responses to changes in the Chirp subsystem and lost replicas.
  • The Groups section allows the administrator to cordon Chirp servers into groups to avoid the possibility that all file replicas end up in the same geographical, logical, or administrative location. Host names may, as shown, be specified by wildcard. The effect is that GEMS prefers to store replicas in different groups, and will avoid storing multiple replicas in the same group. This section may be omitted entirely to revert to default behavior, in which the server is agnostic about the host organization.

GEMSd Maintenance

  1. The output from the GEMS logging to stdout can get pretty big. Be sure to stop the server and compress the files regularly.