Wednesday, February 10, 2016

Watch out gene symbols within Excel

We do not recommend use gene symbols (or synonyms) as primary identifier within a gene list, especially within Excel. Excel irreversibly converts certain symbols into dates and it becomes much worse when gene synonyms are used.

We checked all primary human gene symbols, Excel automatically converts the following 35 symbols into dates: FEB1, FEB2, FEB5, FEB6, FEB7, FEB9, FEB10, MARCH1, MARC1, MARCH2, MARC2, MARCH3, MARCH4, MARCH5, MARCH6, MARCH7, MARCH8, MARCH9, MARCH10, MARCH11, SEPT1, SEPT2, SEPT3, SEPT4, SEPT5, SEPT6, SEPT7, SEPT8, SEPT9, SEPT10, SEPT11, SEPT12, SEPT14, SEP15, DEC1.

The conversion is irreversible for two reasons.  First, the original symbol is lost and the cell only stores an integer representing the number of days since Jan 1, 1900.  Second, notice both MARCH1 and MARC1 map to the same date, as well as MARCH2 and MARC2.

The situation become even worse, if we allow gene synonyms.  E.g., SEP53 (Gene ID 49860) becomes Sep, 1953, 2E4 (Gene ID 11133) becomes 20000. 9-27 (Gene ID 8519) becomes Sep 27th.

It is a wild west, when we look into mouse and rat.  There are primary symbols such as 201E9, 9130022E09, 3e46, NA, NaN, etc.

How to fix a gene symbol in Excel? To enter MARCH1 into a cell, type 'MARCH1 (prefix it with a single quote). This hints Excel to preserve the input, while the single quote is nicely invisible to Excel formula and in data export.

For all the reasons above, we recommend other gene identifier types to be used with Excel.  Metascape supports Entrez Gene ID, RefSeq, UniProt, Ensembl and UCSC identifiers, which all work peacefully with Excel.

Tuesday, February 9, 2016

Why DAVID should no longer be used?

National Cancer Institute's DAVID provides a set of popular bioinformatics tools that help biologists make sense of a list of gene.  However, evidences show that DAVID has not bee updated since January 2010 (Jan 27, 2010 according to Wikipedia)!  The six-year old backend database makes DAVID an inadequate tool for our bioinformatics analysis.

On the DAVID home page, it shows the most recent DAVID version is 6.7, released on Jan 2010.   Visit DAVID Forum, we can see DAVID team has not answered a single user post for the past six years.  As a user concluded on the forum, "DAVID Died"!

Today we would like to systematically check how updated the backend database is behind DAVID.
 
First, we took all latest human Entrez Gene IDs (59803 in total) and tested DAVID's ID Conversion tool.  DAVID only recognized 58% and missed 42%.  As shown in the bar graph below, DAVID essentially missed 87% non-coding RNAs, as ncRNA became a hot topic only within the past three-four years and DAVID's database was too old to capture them. Things do not seem so bad for protein-coding genes, as we only lost 6.4%. 


Second, we repeated the same experiment using the primary gene symbols (ignoring synonyms), and we chose to only focus on protein-coding genes.  Among these 20921 symbols, DAVID failed to recognize 12.6%.  Just to name a few, DXO, CTSV, ACKR1, MYCL, PKM, MYRF, etc.


Third, the same experiment with high-quality protein coding NM_* RefSeq entries shows DAVID missed 31.5% of the human transcriptome!

Our tests demonstrate DAVID is seriously aged.  We all spend tremendous amount of resource in obtaining our gene list, why should we compromise our discoveries by using a six-year old resource? This is why we created Metascape as a DAVID replacement.  Metascape always keep its data source fresh, try it today!