====================================
$Id: INSTALL 290 2008-08-02 12:26:53Z lapp $
====================================

BioSQL - INSTALL
--------------

This document gives an overview of the installation process for
BioSQL on 3 different platforms: MySQL, Postgres, and Oracle. 
It's written from a Bioperl and Unix perspective but most of the 
instructions will work for Unix generally, independent of language.

In each case the installation has multiple steps, and there may
be peculiarities for a given server and a given operating system.
If you've read this document and the documentation for your server
and are still having a problem with the install please contact 
bioperl-l@bioperl.org.

BioSQL is a database for storing bioinformatics data, particularly 
sequences and associated data. It has been written jointly by the 
members of the BioJava, BioPerl, BioRuby and BioPython projects, 
and you can use applications from any of these projects to load and 
fetch data. You don't have to use these projects - you can query 
the database directly with SQL (Structured Query Language) or using 
the direct bindings of the different languages (e.g. JDBC, Perl DBI).

BioSQL needs a SQL database server system, currently one of MySQL, 
Postgres or Oracle (but any SQL database will work with a little 
tweaking - we test regularly against these engines). BioSQL will
typically be used with a client program written in one of Perl, 
Java or Python (for example, the Perl client is the bioperl-db 
package). Sensible uses of BioSQL include:

   (a) Storing the entire GenBank or EMBL or Swissprot public
       repository in the BioSQL schema, providing efficient access to
       individual records there.

   (b) Storing your own sequences which you are working on in the lab.

   (c) Storing a particular genome sequence. 

Yes, there are many other SQL-compatible databases which can serve
these purposes. The great thing about BioSQL is the interoperability 
between the various Bio* projects (see www.open-bio.org for more on 
all these efforts).


Server Setup
----------

To run BioSQL yourself you need to have a database server. Depending
on your circumstances there will be different setups to use. Here are
a couple of examples:

   - Running a BioSQL database on a laptop providing Swissprot in a
     flexible and useable way. Here one might use MySQL or Postgres
     running locally on the laptop.

   - Running a BioSQL database for a small lab group in order to share 
     sequence annotations. Here one might use Postgres, or MySQL with
     transaction support. Postgres comes with transaction support built-in.

   - Running a BioSQL database in a company to manage a mid-scale EST
     project. Here one might use an Oracle or Postgres database.

The database server you use should really be chosen by ease of use
for people; for example, in a group with a lot of Oracle experience,
use Oracle, similarly for a MySQL group, use MySQL. If you are
starting out, probably MySQL is the easiest database to work with, but
Postgres is a very sensible option which comes as a more "standard"
database.  Below gives you the basic set up of the three databases.


MYSQL
-----

For the MySQL version to instantiate and run properly you need to
have MySQL version 3.23.50 or later, because only from that version on 
are nullable columns handled in a bug-free manner, according to the
MySQL documentation. If you are lucky your Linux or Mac distribution, 
or whatever, came with MySQL.  Look at your service tools and switch
on MySQL. If not, don't worry, installing MySQL is easy.

Most of us on Unix typically install from source. Download the tar 
file from one of the mirrors linked off www.mysql.com. Uncompress and 
untar the tar file.

   >tar -zxvf mysql-4.0.10-gamma.tar.gz 
   >./configure
   >make
   >make install

You'll probably need root privileges to run the 'make install'.
Once the MySQL files are installed you are ready to create setup
accounts and start the server, and the steps are well-documented
in the MySQL manual. Another good document to look at is in the 
bioperl-db package: bioperl-db/docs/HOWTO-MySQL.html. This one
discusses both MySQL installation and bioperl-db.

If you are on Windows you're better off using a binary. Please
see the installation instructions at www.mysql.com.

If you want InnoDB compiled in MySQL you should probably download the
stable MySQL 4* version, but if you have already installed MySQL 3.23 
MAX and you want to activate InnoDB in your server you need to
uncomment InnoDB-related lines in the my.cnf MySQL configuration file. 

By default there is no configuration file, so if you don't have one
you need to copy one of the sample files in the supporting_files/
directory to /etc/. There are four my.cnf files called my-small.cnf,
my-medium.cnf, my-large.cnf, and my-huge.cnf corresponding to
predefined defaults for small, medium and large databases
respectively. Pick any of them and copy it to /etc/.

For example:

  >cp /usr/local/mysql/support-files/my-small.cf /etc/my.cnf

Then edit the /etc/my.cnf to uncomment lines referring to InnoDB, 
specifically the following lines:

innodb_data_home_dir = /usr/local/mysql/var/
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = /usr/local/mysql/var/
innodb_log_arch_dir = /usr/local/mysql/var/
set-variable = innodb_buffer_pool_size=16M
set-variable = innodb_additional_mem_pool_size=2M
set-variable = innodb_log_file_size=5M
set-variable = innodb_log_buffer_size=8M
innodb_flush_log_at_trx_commit=1
set-variable = innodb_lock_wait_timeout=50

Finally you need to make sure the innodb_data_home_dir exists and is
owned by the mysql user, e.g.

  >mkdir /usr/local/mysql/var/
  >chown mysql /usr/local/mysql/var/

Finally start the mysql server as per normal:

  >safe_mysqld &


POSTGRESQL
----------

Like Mysql, the best thing is to install from source. Go to
www.postgresql.org and choose a mirror to download the postgres tar
ball.

Unzip and Untar, in one step:

  >tar -zxvf postgresql-8.2.9.tar.gz

Or in two steps:
 
  >gunzip postgresql-8.2.9.tar.gz
  >tar -xvf postgresql-8.2.9.tar

===============
NOTE: In version 8.3 and higher, PostgreSQL introduced a significant
change to the behavior of the database server that at present renders
it incompatible with at least Bioperl-db, the Perl language binding
for BioSQL. Specifically, PostgreSQL v8.3+ removes the automatic
typecasts of numeric values to strings, and hence comparing a
character-type column to an unquoted numeric value results in an
error. This will break the binding or client code for untyped
languages (such as Perl) if the code does not explicitly cast bound
query parameters for character columns and if the bound data happens
to be a valid number. This can happen, for example, for accession
numbers (which ironically mostly are not numbers).

Until this incompatibility is resolved, PostgreSQL v8.3 or higher is
unsupported (though it may work well for language bindings such as
Biojava).
===============

Now make the package by going:

  >./configure
  >make
  >make install

You may have to be root to do this.

In some systems, readline is not installed. In this case either:

    (a) install the readline package from ftp.gnu.org, or
    (b) do ./configure --without-readline

Now make a new postgres user, for example 'postgres', which will
actually run the postgres server on your machine. On linux and other
unix's this is done by something like:

  >adduser postgres

On Mac OS X, go to SystemPreferences --> Users --> New User

You now need to make a place for your data directory for postgres.
For example, as root:

  >cd /usr/local
  # pgsql/ may already exist from installing PostgreSQL
  >mkdir pgsql
  >mkdir pgsql/data
  >chown -R postgres pgsql/

The chown command makes the postgres user the owner of the pgsql/
directory.

Since this is where the postgres data will be stored, you may well want
to have this directory on a separate disk partion. If so, simply build
a directory tree (e.g. /somewhere/disk12/pgsql/data). Remember that
the postgres server will be accessing this directory all the time, so
wherever possible make sure this disk is local storage, *not* network
mounted.

Now log in as the postgres user to install the postgres server data files:

  >su postgres 

The postgres server needs to know where to find the data and where
the postgres binaries are. For the postgres user, it is best to store
this once in a init shell file, e.g. ~/.tcshrc for tcsh:

  >setenv PGDATA /usr/local/pgsql/data
  >setenv PATH ${PATH}:/usr/local/pgsql/bin

The PGDATA environment variable tells postgres where the data files
are stored; the path is to put all the postgres binaries on the path

The initdb command builds all the metadata files and basic starting
files for the postgres server. you only need to run this once on your
machine

  >/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data/

*WARNING* In some situations, you need to init the database with the
--no-locale command (this disables some of the sorting cases in the
postgres server). If when you start the postmaster command you see a
message like:

  FATAL:  invalid value for option 'LC_MONETARY': 'en_US'

Then remove the files in /usr/local/postgres/data/ and rerun initdb
with --no-locale command.

  >rm -rf /usr/local/postgres/data/*
  >/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data/ --no-locale

Now start the postgres server. This actually pulls up a instance of
the postgres server for running things:

  >/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data/ -l logfile start
  
To test that your postgres server works, you need to make a test
database and then connect to the server using the psql command

   >createdb test
   >psql test

*WARNING* If the above commands don't work, then perhaps your locale
settings have gotten messed up. Check out the logfile for the psqldb user
and if it has message like:

  FATAL:  invalid value for option 'LC_MONETARY': 'en_US'

Then remove the files in /usr/local/postgres/data/ and rerun initdb
with --no-locale command

  >rm -rf /usr/local/pgsql/data/*
  >/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data/ --no-locale

The postgres server is now up and running.

ORACLE Installation 
----------------

Oracle installations are certainly more complicated than MySQL
installs. Depending on the situation one could choose from the free 
Personal Oracle or Oracle Enterprise. The latter package is installed 
using the Oracle Universal Installer which uses a graphic interface. 
Both available through the Oracle Technology Network.

Installing Oracle, with its relative complexity, is somewhat beyond
the scope of this document. However, if this is your choice then bear
in mind that there are people in bioperl-l@bioperl.org who can assist
should you run into trouble. Just as with MySQL and Postgres you will
probably need to create special users to install the database and
you will need root privileges. Unlike MySQL and Postgres Oracle uses
configuration files, such as tnsnames.ora and listener.ora, that keep
track of essential information and many problems are solved by making 
changes in these files. Other common problems concern the
environmental variables $ORACLE_HOME and $ORACLE_SID, make sure that
these are set correctly.


Starting BioSQL and loading the first database
------------------------------------

You have to first choose the language or project you want to use in 
to load your data. You currently have three options: Perl and BioPerl, 
Java and BioJava and Python and BioPython. You will need to download
the schema, the same for all three languages, and the supporting modules 
for each language. In each case the package is a mixture of generic 
modules for database access and then the BioSQL binding code.


Schema Loading
-------------

The BioSQL schema is distributed separately from the language
bindings. Use the svn checkout from the anonymous server:

 >svn co svn://code.open-bio.org/biosql/biosql-schema/trunk biosql-schema

Create a database, which we'll call 'biosql', in the data instance:

For MySQL do:

  >mysqladmin -u root create biosql

For postgres do, as the postgres user:

  >createdb biosql

To load the schema, use the appropiate SQL dialect in
biosql-schema/sql.

For mysql do:

  >mysql -u root biosql < biosqldb-mysql.sql

For postgres do:

  >psql biosql < biosqldb-pg.sql

For Oracle there's more than a single script, and the database itself
is more complex, complete with triggers, indices, and different
configurations. Please take a look at the INSTALL file in the 
biosql/biosql-schema/sql/biosql-ora directory for more information.

At this point we now have a database instance with the server set up
correctly, we now need to load it with data.


Perl Loading
----------

To load with Perl, you need to have the DBI module installed as well
as the driver for your SQL database of choice. The Oracle, MySQL, 
and Postgres modules are available at CPAN:

  >sudo perl -MCPAN -e shell

'sudo' is a command that allows users to issue commands as root.
Alternatively become root.
 
At the CPAN shell prompt go:

  cpan>install DBI

Then for the particular SQL flavour you need to install the DBD.

For MySQL:

  cpan>install DBD::mysql

For Postgres:

  cpan>install DBD::Pg

For Oracle:

  cpan>install DBD::Oracle

The DBD modules will need the database client libraries for
the relevant database on the local system to compile and link
correctly. Occasionally this install fails in an indecipherable way.
If the install in the CPAN shell fails it is usually best to try the
install by hand, looking at the compile error message carefully. 
If you interrupt CPAN the files are downloaded to ~/.cpan/build. 
If you are root, this will be either off /.cpan/ or /root/.cpan,
depending on the operating system.

For example, in my case, I got the following error on installing DBD::Pg:

  /usr/bin/ld: table of contents for archive: 
  /usr/local/pgsql/lib/libpq.a is out of date; rerun ranlib(1) (can't load from it)
  make: *** [blib/arch/auto/DBD/Pg/Pg.bundle] Error 1

Looking at this error I then did, as root:

  >ranlib /usr/local/pgsql/lib/libpq

and then re-ran make and make install.

Another common source of error comes from Perl assuming a library file
is in a certain place when it's elsewhere. For example, this partial 
command, after the "make" command:

  ... gcc -c -L/usr/lib/mysql -lmysqlclient -lm -lz ...

means that gcc assumes the file "libmysqlclient.a" might be found in
the directory /usr/lib/mysql, or in one of the standard library
directories like /lib or /usr/lib. If it's actually in
/usr/local/lib/mysql, for example, then you'll need to tell Perl to 
construct the Makefile appropriately. First clean up:

  >make clean

Then make the Makefile:

  >perl Makefile.PL --libs="-L/usr/local/lib/mysql -lmysqlclient -lm -lz"
  >make

Then proceed to "make test" and "make install".


Downloading bioperl-db and installing
-------------------------------

With DBI and the driver of choice set up you now should download
the bioperl and bioperl-db package.

bioperl-1.5.2 is available from CPAN or www.bioperl.org, and
bioperl-db is available from CPAN, or from svn:

  >svn co svn://code.open-bio.org/bioperl/bioperl-db/trunk bioperl-db

Look at the INSTALL files in the bioperl and bioperl-db directories, 
they will give specific instructions. Bioperl uses other CPAN modules 
but don't make the mistake of trying to install each and every
accessory module, most of these are related to specific modules in 
Bioperl that you may not be using.

Also, you don't need root permission to use bioperl and bioperl-db and
you don't need to run the final 'make install'. You could use these packages
with the packages just untar'd in your local directory and PERL5LIB
set appropriately, e.g.:

  >setenv PERL5LIB /sw/lib/perl5:/Users/birney/src/bioperl-live:/Users/birney/src/bioperl-db

Many of the bioperl installation details are discussed in the INSTALL
file found in the bioperl package, look there for more information.


Loading taxonomy data
-------------------

With bioperl and bioperl-db installed you are ready to load some data.
It is advisable to pre-load the NCBI taxonomy database (use
scripts/load_taxonomy.pl in the biosql-schema package, the details are
in its documentation). Otherwise you'll see errors from misparsed
organisms. 


Loading sequence data
------------------

To load sequence data you can use the load_seqdatabase.pl script in 
bioperl-db/scripts/biosql/, for example:

  >perl load_seqdatabase.pl -dbuser postgres -dbname biosql -namespace swissprot -format swiss sprot40.dat

Note that in this example for a PostgreSQL database it is the database
superuser who loads the sequences. Just as one would not use the
system root user on a Unix system as a normal user account, one would
create separate users with more limited privileges as the ones
"owning" a schema, and loading or manipulating data. Each RDBMS has
its own way of creating users; look up the documentation for yours for
the 'CREATE USER' command.

The -dbname parameter is the name of the database you have set up for 
your data.

The -dbuser parameter is the user who will load the data, who has
write-privileges on the database.

The -namespace 'swissprot' provides the string under which this dataset
is stored. A single bioSQL database can store many different "sequence
databases" (e.g., it can store both Swissprot and EMBL and
GenBank).

The -format parameter indicates the format of the files provided. All the
SeqIO formats are supported (see www.bioperl.org/HOWTOs for more on 
SeqIO), though of course not all formats have the same amount of 
information in them.

The last argument to the command above is the sequence file, "sprot40.dat".
Because you have your choice of formats you have your choice of input
files. The complete Swissprot file can be obtained at ftp.expasy.ch,
along many sub-databases. All of Genbank is available at
ftp://ftp.ncbi.nih.gov/genbank, in genbank format. Many microbial 
genomes are available at ftp://ftp.tigr.org/pub/data/. Enjoy!
