Thursday, December 14, 2017

How We Keep Data Fresh?

We live in a big data era, where biological data and thus knowledge extracted grow rapidly.  Tools such as Metascape sit on top of various bioinformatics knowledge bases; the quality of analysis results heavily depends on the freshness of the underlying data content.

We know DAVID had not been updated for over ten years, as the result of this, Wadi et al. estimated a total of 2,601 publications within the year 2015 alone only captured ~20% of the annotations compared to what should have been captured [1]!  Given all the efforts and costs went into generating our precious data sets, losing 80% of insights due to an outdated tool is a serious issue.  Although DAVID finally updated its database after Wadi's publication, no more activity afterwards, 1.5 year went by and counting ...

At Metascape, one of our main goals is to keep our data sources Sushi-fresh.  Metascape's update engine used to run once a month.  However, due to the large amount of data sources Metascape integrates (Figure 1) and over ten organisms it covers, the automated pipeline broke a few times due to format changes in some sources, due to mistakes in missing species-specific data in NCBI, due to data sources switched to a more protected access mode for funding reasons (OMIM), etc.  The volunteers at Metascape were no longer able to keep up with these changes at a monthly bases, therefore, we see some lag in our updates this year.

Figure 1.  To bring a rich set of features to users, Metascape integrates many data sources for over ten model organisms.  Previously, when one data source breaks, the update workflow halts.  In the future, existing snapshot will be used for problematic data source, so that update can resume for the rest sources to produce a release.
We placed our focus on polishing the data update workflow for past few months.  Two measures are now in place:

First, when the pipeline failed to fetch a data source, the copy from the previous snapshot will be used, so that computation can continue unaffectedly.  We will certainly be notified and take actions afterwards (sometimes the fix can take a while if the issue resides on the data provider's side).  Nevertheless, we will be able to produce a release.

Second, the pipeline automatically generates a graphical report at the end, comparing data in the new release to its previous one.  An example report is shown here.  This is critical to catch issues that do not cause code to crash, e.g., all locus_tag for a certain species is missing in the new NCBI release.  The report will be reviewed by us, before we trigger the official deployment of the new knowledge base.  The snapshot below (Figure 2) is compiled for A. thaliana.  It is very clear that there are some additions to UniPro identifiers highlighted in green, and some GO annotations highlighted in orange were removed probably due to clean up efforts by curators.  As these changes are minor, we can assume there is no obvious issue in the new release.  Outstanding green/orange bars will deserve our attention, in that case, release will be held off and a careful examination is required.

Figure 2.  Comparison plots are automatically generated by Metascape's update engine; we can easily review where the changes are and the magnitude of the change between two releases.  Problems can be caught and corrected before they propagate into the release.
We believe with these two new mechanisms in place, Metascape will continue to provide fresh data, so that our users can always extract the maximum value from gene lists.

Metascape has been cited over 70 times by the time of this blog [link], thank you for using Metascape and help spread the words.  The best reward for Metascape volunteers is to see Metascape helping users.

Reference

1. Wadi L, et al. Impact of outdated gene annotations on pathway enrichment analysis. Nat Methods. 2016 Aug 30;13(9):705-6. [link]



Monday, January 30, 2017

Protein-protein Network Analysis


The protein-protein interaction (PPI) network analysis was introduced into Metascape (http://www.metascape.org) on Nov 2, 2016. We initially relied on BioGRID [1] as the proxy of all public-domain physical protein-protein interaction data sources. BioGRID contains over 200k unique human PPI interactions, it is well maintained and frequently updated. Its coverage of all organisms fit very well with our goal to support key model organisms.

The quality of the PPI network analysis certainly depends on the underlying PPI database, therefore, we have been keeping an eye on new PPI data sources. Two recent members caught our attention: OmniPath [2] and InWeb_IM [3], published in Nature Method at the end of 2016. OminPath focuses on human signal-interactions from literature-curated signal pathways, while InWeb_IM focuses on integrating and scoring physical PPI pairs from eight resources (BIND, BioGRID, DIP, IntAct, MatrixDB, NetPath, Reactome, WikiPathways). In addition, InWeb_IM also uses a conservative ortholog mapping approach to "transfer" some interactions from non-human to human.

The following is the Venn diagram showing the overlap of unique human physical PPI pairs among the three databases. Metascape now uses the combined database (~600k) that triples the number of human interactions provided by BioGRID alone. Readers might notice BioGRID is one of the eight sources for InWeb_IM, then an immediate question is why there are still a portion of BioGRID not covered by InWeb_IM? Communications with the authors clarified the puzzle, InWeb_IM only retains data for the proteins that have been reviewed by UniProtKB. E.g., R9QTR3 [4] is "unreviewed" at this time. It interacts with SSX2 and SSX3 according to BioGRID, however, it has no data in InWeb_IM.

Out of curiosity, we compared the public-domain data to a commercial literature-based human PPI database. There remains a large discrepancy. Although protein-protein interactions have many dependencies, such as post-translational modifications, the time dimension - complexes are not formed until key proteins are available, etc., since the commercial database is also literature-based, the weak overlap deserves some attention. We have not done detailed analysis on this topic, nevertheless, a quick search using TLR7 as an example identified unique PPI interactions found by either sources. TLR7-MMP9 interaction was found in the commercial source supported by a co-immuno-precipitation study [5], this is a valid link missed by the public sources. Most of the InWeb_IM-only links were orginated from interactions inferred through ortholog data, understandably missed by the commercial database. TLR7-MLF1 interaction was included in the InWeb_IM release file (through UniProt ID: P58340 and Q9NYK1), indicating there is experimental support missed by the commercial source. However, this interaction pair has a confidence score of 0.148, which is considered lower than the threshold used in the InWeb_IM web tool. However, no threshold was mentioned in the InWeb_IM publication and private communications with the authors confirmed that the analyses presented in the paper were largely based on all interaction pairs; we retain all interaction pairs for Metascape analysis. We also need to point out that commercial database also contains non-PPI interactions (not included in the Venn diagram), such as protein-gene regulation, which is still meaningful for network analysis. Our initial check indicates commercial sources contains many literature-based PPI data that is missed by the public sources, while public sources provide some additional experimental data and some inferred interactions. They remain complementary.



A comprehensive PPI database is only one side of a coin for an informative PPI analysis. The PPI analysis in Metascape is rather unique in the way that we automatically apply Molecular Complex Detection (MCODE) algorithm [6] to the resultant networks to identify tightly connected network cores. This is extremely helpful when the larger network is hard to read. Metascape also automatically analyzes each network components for pathway enrichment, therefore assign them biological functions for easy interpretation. All networks identified are available in PNG, PDF and Cytoscape [7] formats. These are rather unique features compared to other online PPI analysis tools.

In summary, Metascape currently contains comprehensive public-domain PPI data sources, combined with its broad-spectrum algorithms, protein network analyses have never been better.

Reference:
  1. Stark C, et al. Nucleic Acids Res. 2006. 34:D535-539 (https://thebiogrid.org)
  2. Turei D, et al. Nat Methods. 2016. 13:966-967 (http://omnipathdb.org)
  3. Li T, et al. Nat Methods. 2017. 14:61-64 (https://www.intomics.com/inbiomap
  4. http://www.uniprot.org/uniprot/R9QTR3
  5. Abdulkhalek S., Szewczuk MR. Cellular Signal. 25 (2013) 2093–2105.
  6. Bader GD, Hogue CW. BMC Bioinformatics 2003. 4:2.
  7. Shannon P. et al. Genome Res (2003) 11:2498-2504.(http://cytoscape.org)