Analysis of exported IMG taxon tables

RM Bowers


Useage

1. Export taxon table from IMG: https://img.jgi.doe.gov .

Search for taxon string in IMG 'Quick Genome Search:' at top of IMG page, i.e. 'Alphaproteobacteria', Select All hits, toggle the number in present view box from 100 to 10 to shrink display, select desired 'IMG Table Configuration', select 'Redisplay' button, select 'Select All' button, then 'Export' button to save a copy of the taxontable and use as input to Shiny App.

Recommended for current version: Remove Environmental or Engineered from Phylum Column. To do this, need to use a regex. For example, to remove Envirnomental from the Phylum column, try < [^Environmental] > . This is a perl regex designed to remove 'Environmental' while leaving everything else. For more information on IMG regex, click on ? mark next to 'Apply' button.

2. Load IMG taxon table with Browse button.

3. Explore data by clicking on tabs


Notes

Requires at least one of the 6 taxonomy levels: Phylum, Class, Order, Family, Genus, Species. Recommend to use all.

Other categarogical variables that should be included: ANI Cluster ID, Is Public. All GOLD Ecoystem and Geography columns are recommended

All numeric columns will be used as Genome parameters, but IDs will not, i.e. TAXON_IDs, ITS_SP_ID, etc.


Application

The purpose of this app is to use the data within the IMG database in order to explore genome counts by taxonomy, the cateogorical metdata such as GOLD Ecosystem and/or ANI group, and the numeric data associated with each genome, such as genome size, gene counts, scaffold counts, etc.

Note: If taxonomy level displays no results, adjust 'Total genomes included', and/or 'Select All' on 'Select sublevel' pull down

Note User can supply a list of taxon_oids in the Data table panel when 'Select All' has been selected from the 'Select sublevel' panel to isolate data specific to a set of taxon_oids (i.e. genome ids). To provide a list, the pipe separator, '|' needs to be between each taxon_oid, or other search terms.

Note: Additional data can be added to the exported IMG metadata table, including but not limited to parameters that can be averaged across each genome, such as: strain ids, geographic location, genome completeness, contamination, SNP count, etc.

Note: Some IMG columns in the IMG Table Configuration will inevitibly wind up in the 'Select metadata variable' box or axis variable pulldown boxes that are not appropriate. If you try, for example to group by individual Genome ID, there will be too many singleton groups, i.e. they're all singletons to create a meaningful plot

Note Additionally, some IMG columns in the IMG Table Configuration are usually entirely blank and therefore do not provide valuble information. Column names that may be problematic: Number_of_Filtered_Reads_assembled, Number_of_Mapped_Reads_assembled, Assembled_Reads_assembled, Total_Filtered_Bases_assembled, Total_Mapped_Bases_assembled, Average_Coverage_of_Assembled_Sequences_assembled

Other columns from IMG that are problematic: Any column that may be a duplicate such as Coding Base Count. Just note that some columns may still be problematic as IMG puts special characters in some of the columns. I am working on this...