Metascape: DAVID

We live in a big data era, where biological data and thus knowledge extracted grow rapidly. Tools such as Metascape sit on top of various bioinformatics knowledge bases; the quality of analysis results heavily depends on the freshness of the underlying data content.

We know DAVID had not been updated for over ten years, as the result of this, Wadi et al. estimated a total of 2,601 publications within the year 2015 alone only captured ~20% of the annotations compared to what should have been captured [1]! Given all the efforts and costs went into generating our precious data sets, losing 80% of insights due to an outdated tool is a serious issue. Although DAVID finally updated its database after Wadi's publication, no more activity afterwards, 1.5 year went by and counting ...

At Metascape, one of our main goals is to keep our data sources Sushi-fresh. Metascape's update engine used to run once a month. However, due to the large amount of data sources Metascape integrates (Figure 1) and over ten organisms it covers, the automated pipeline broke a few times due to format changes in some sources, due to mistakes in missing species-specific data in NCBI, due to data sources switched to a more protected access mode for funding reasons (OMIM), etc. The volunteers at Metascape were no longer able to keep up with these changes at a monthly bases, therefore, we see some lag in our updates this year.

Figure 1. To bring a rich set of features to users, Metascape integrates many data sources for over ten model organisms. Previously, when one data source breaks, the update workflow halts. In the future, existing snapshot will be used for problematic data source, so that update can resume for the rest sources to produce a release.

We placed our focus on polishing the data update workflow for past few months. Two measures are now in place:

First, when the pipeline failed to fetch a data source, the copy from the previous snapshot will be used, so that computation can continue unaffectedly. We will certainly be notified and take actions afterwards (sometimes the fix can take a while if the issue resides on the data provider's side). Nevertheless, we will be able to produce a release.

Second, the pipeline automatically generates a graphical report at the end, comparing data in the new release to its previous one. An example report is shown here. This is critical to catch issues that do not cause code to crash, e.g., all locus_tag for a certain species is missing in the new NCBI release. The report will be reviewed by us, before we trigger the official deployment of the new knowledge base. The snapshot below (Figure 2) is compiled for A. thaliana. It is very clear that there are some additions to UniPro identifiers highlighted in green, and some GO annotations highlighted in orange were removed probably due to clean up efforts by curators. As these changes are minor, we can assume there is no obvious issue in the new release. Outstanding green/orange bars will deserve our attention, in that case, release will be held off and a careful examination is required.

Figure 2. Comparison plots are automatically generated by Metascape's update engine; we can easily review where the changes are and the magnitude of the change between two releases. Problems can be caught and corrected before they propagate into the release.

We believe with these two new mechanisms in place, Metascape will continue to provide fresh data, so that our users can always extract the maximum value from gene lists.

Metascape has been cited over 70 times by the time of this blog [link], thank you for using Metascape and help spread the words. The best reward for Metascape volunteers is to see Metascape helping users.

Reference

1. Wadi L, et al. Impact of outdated gene annotations on pathway enrichment analysis. Nat Methods. 2016 Aug 30;13(9):705-6. [link]

National Cancer Institute's DAVID provides a set of popular bioinformatics tools that help biologists make sense of a list of gene. However, evidences show that DAVID has not bee updated since January 2010 (Jan 27, 2010 according to Wikipedia)! The six-year old backend database makes DAVID an inadequate tool for our bioinformatics analysis.

On the DAVID home page, it shows the most recent DAVID version is 6.7, released on Jan 2010. Visit DAVID Forum, we can see DAVID team has not answered a single user post for the past six years. As a user concluded on the forum, "DAVID Died"!

Today we would like to systematically check how updated the backend database is behind DAVID.

First, we took all latest human Entrez Gene IDs (59803 in total) and tested DAVID's ID Conversion tool. DAVID only recognized 58% and missed 42%. As shown in the bar graph below, DAVID essentially missed 87% non-coding RNAs, as ncRNA became a hot topic only within the past three-four years and DAVID's database was too old to capture them. Things do not seem so bad for protein-coding genes, as we only lost 6.4%.

Second, we repeated the same experiment using the primary gene symbols (ignoring synonyms), and we chose to only focus on protein-coding genes. Among these 20921 symbols, DAVID failed to recognize 12.6%. Just to name a few, DXO, CTSV, ACKR1, MYCL, PKM, MYRF, etc.

Third, the same experiment with high-quality protein coding NM_* RefSeq entries shows DAVID missed 31.5% of the human transcriptome!

Our tests demonstrate DAVID is seriously aged. We all spend tremendous amount of resource in obtaining our gene list, why should we compromise our discoveries by using a six-year old resource? This is why we created Metascape as a DAVID replacement. Metascape always keep its data source fresh, try it today!

Metascape

Thursday, December 14, 2017

How We Keep Data Fresh?

Tuesday, February 9, 2016

Why DAVID should no longer be used?