GenBank HOWTO

This is a quick synopsis of the steps needed to initialize a GBrowse database from a genbank record. For the purposes of illustration, we will use the RefSeq record for M. bovis, accession NC_002945.

Using the GBrowse in-memory database

1. Convert from Genbank format into GFF format

Download the Genbank record and convert it into GFF format. You can do this easily using the bp_genbank2gff.pl script, which is part of Bioperl (scripts/Bio-DB-GFF/genbank2gff.pl):

   bp_genbank2gff.pl -stdout -accession NC_002945 > mbovis.gff

This will download the record for M. bovis (refseq NC_002945) and save it to the file mbovis.gff.

If you already have the genbank record available as a file named NC_002945.gb, you can convert it like this:

   bp_genbank2gff.pl -stdout -file NC_002945.gb > mbovis.gff

The newly-converted file uses GFF3 format, which combines feature data with sequence/DNA data. This means that you do not need a separate FASTA file for the sequence.

2. Install the GFF file into the databases directory

Copy this file into your in-memory GFF databases directory, as described in the tutorial. We will assume /usr/local/apache/htdocs/gbrowse/databases.

  mkdir /usr/local/apache/htdocs/gbrowse/databases/mbovis
  chmod o+rwx /usr/local/apache/htdocs/gbrowse/databases/mbovis
  cp mbovis.gff /usr/local/apache/htdocs/gbrowse/databases/mbovis

3. Set up the configuration file

Use the configuration file 08.genbank.conf as your starting template. This is located in contrib/conf_files:

  cp contrib/conf_files/08.genbank.conf /usr/local/apache/conf/gbrowse.conf/mb.conf

4. Edit the configuration file as appropriate

You will need to change the [GENERAL] section to use the in-memory adaptor and to point to the location of the M. bovis GFF file:

 [GENERAL]
 description   = Mycobacterium Bovis In-Memory
 db_adaptor    = Bio::DB::GFF
 db_args       = -adaptor memory
                 -dir     /usr/local/apache/htdocs/gbrowse/databases/mbovis

You might also want to change the ``examples'' tag to introduce the accession number for the whole genome, and a few choice gene names and search terms:

  examples = NC_002945 Mb1800 galT glucose

That's all there is to it, but since this is a pretty big chunk of DNA (> 4 Mbp), it uses a considerable amount of memory and performance will be sluggish unless you have a fast machine with lots of memory. So you might wish to view it using a MySQL, PostgreSQL or Oracle database. The following are instructions for doing this.

Using the GBrowse GFF database with Mysql

We will assume that you are using a MySQL database.

1. Create the database

Create the database using mysqladmin:

  mysqladmin create mbovis

As described in the GBrowse tutorial, give yourself write permission for the database, and give the web server user (e.g. ``nobody'') select permission.

2. Convert from Genbank format into GFF format and load it into the database

The bp_genbank2gff.pl script can download the accession, convert it into GFF and load the database directly in one smooth step:

  bp_genbank2gff.pl -create -dsn mbovis -accession NC_002945

If you prefer, you can do this in two steps by first creating the gff file as described for the in-memory adaptor, and then using Bioperl's bp_bulk_load_gff.pl or bp_fast_load_gff.pl.

If you are using a PostgreSQL or Oracle database, you must specify the appropriate adaptor to bp_genbank2gff.pl:

  bp_genbank2gff.pl -create -dsn mbovis -adaptor dbi::oracle -accession NC_002945

3. Set up the configuration file

Use the configuration file 08.genbank.conf as your starting template. This is located in contrib/conf_files:

  cp contrib/conf_files/08.genbank.conf /usr/local/apache/conf/gbrowse.conf/mb.conf

4. Edit the configuration file as appropriate

You will need to change the [GENERAL] section to use the appropriate database adaptor:

 [GENERAL]
 description   = Mycobacterium Bovis Database
 db_adaptor    = Bio::DB::GFF
 db_args       = -adaptor dbi::mysql
                -dsn     dbi:mysql:database=mbovis;host=localhost
                -user    nobody
                -passwd  ""

You might also want to change the ``examples'' tag to introduce the accession number for the whole genome, and a few choice gene names and search terms:

  examples = NC_002945 Mb1800 galT glucose

That should be it!

NOTE

You can load as many accessions into the database as you like. Each one will appear as a ``chromosome'' named after the accession number of the entry.