Optional sidebar menu
Social media
Contact information

5th Avenue, New York - United States




+10 724 1234 567

Data & Tools


1. Sequence data for gene catalogs:

  • Non-redundant gene catalog (nucleotide sequences, fasta)    
  • Non-redundant gene catalog (amino acid sequences, fasta)    

2. Gene annotation table:

  • GeneAnnotationAndSummaryTable.xls.gz    

3. Public data used including:

  • Genes of 1,384 genomes of 66 respiratory tract related bacteria in the Integrated Microbial Genomes(IMG)    
  • Gene set of respiratory tracts from the Human Microbiome Project (HMP)    
  • Genes of 73 respiratory tract related bacteria in the Pathosystems Resource Integration Center (PATRIC)    
Table format
Gene ID Unique ID
Gene Length Length of nucleotide sequence
Taxonomic Annotation(Phylum Level) Annotated phylum for a gene
Taxonomic Annotation(Genus Level) Annotated genus for a gene
Taxonomic Annotation(Species Level) Annotated species for a gene
eggNOG Annotation Annotated eggNOG(s) for a gene
eggNOG Functional Categories eggNOG functional category(ies) of the annotated eggNOG(s)
KEGG Annotation Annotated KO(s) for a gene
KEGG Functional Categories KEGG functional category(ies) of the annotated KO(s)


Gene catalog construction

SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads.

Website: http://soap.genomics.org.cn/soapdenovo.html

MetaGeneMark is a program designed to predict genes in metagenomes.

Website: http://exon.gatech.edu/GeneMark/index.html

CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.

Website: http://weizhong-lab.ucsd.edu/cd-hit/

SOAPaligner/soap2 is a program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The program is designed to handle the huge amounts of short reads generated by parallel sequencing using the new generation Illumina-Solexa sequencing technology.

Website: http://soap.genomics.org.cn/soapaligner.html

Gene annotation

BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.

Website: https://blast.ncbi.nlm.nih.gov/Blast.cgi

KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.

Website: http://www.genome.jp/kegg/

eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories).

Website: http://eggnogdb.embl.de/#/app/home

Aligner against gene catalog

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.

Website: http://bio-bwa.sourceforge.net/