HTP

Visitors:	312927
# of TMPs:	15327
Server version:	v2.0.0
Database release:	d.2.2
Release date:	22.07.2024

Documents

Database used/linked

The human proteome has been downloaded from UniProt (Release 2023_03). It compromises 20.586 sequences. In the current version of the HTP database the alternatively spliced protein sequences were not used.

Topology data for the constrained prediction methods were collected from three different resources. The most reliable data can be found in the PDBTM database, which contains the 3D structure of transmembrane proteins together with the most likely membrane orientation determined by the TMDET algorithm. Since, PDBTM contains only topography, i.e. the sequential localization of the transmembrane helices, these information have been extended to topology data by using the TOPDB database.

TOPDB database was established in 2008, containing the experimentally established topology data of transmembrane proteins. TOPDB was updated in 2015 and in 2024 using several sources.

The third resource was TOPDOM. TOPDOM is a collection of domains and sequence motifs located conservatively in the cytosolic or extra-cytosolic side of transmembrane proteins. We used the search engine of TOPDOM homepage to locate these domains/motifs in the human sequences and we used the position and topology localization of the result(s) as constraint(s).

Preparation of benchmark sets

The TOPDB database has been split into two parts; the first contains entries, which have known 3D structure, while the second set contains entries with topologies confirmed only by molecular biology experiments. Entries, whose reliability is above 99% and 95% for bitopic and polytopic transmembrane proteins were selected, respectively. For each sequence in the human proteome, BLAST searching was done against these two sets. The resulting hits were aligned with the query sequences using HSPs, and those were kept, which had a sequence similarity above 40%, the overlapping sequences covered all TM helices of the TOPDB entry, and the length of the hit sequence was above 80% of the length of the query sequence. Finally, we have filtered these sets by the CD-HIT algorithm to 40% similarity. This resulted in 134 sequences, which homologous partner's structure is known ("3D benchmark set"), and in 333 sequences, which homologous partner contain only experimental topology data ("experimental benchmark set").

Filtering transmembrane proteins

Eight prediction methods have been tested to filter transmembrane proteins, i.e. determining whether a sequence codes a transmembrane protein or a non-transmembrane one. These methods are Memsat, Octopus, Philius, Phobius, Pro-TMHMM Scampi-single, Scampi-msa and TMHMM. They were executed on preprocessed sequences, i.e. after removing transit and/or signal peptides from the query sequences. As none of these method’s accuracies were as high as desired, a simple consensus approach were utilized to increase the prediction accuracy. Dozens of combinations of these approaches and parameters were tested and the best was chosen as the final consensus algorithm. The highest accuracy was reached, when three specific methods were used for filtering, namely Phobius, Scampi-single, TMHMM, and at least two of these predicted at least one membrane region.

Constrained Consensus TOPology prediction

The Constrained Consensus Topology prediction method (CCTOP) is composed by three basic steps. The first step is the prediction of signal peptides. Depending on the signal peptide prediction’s output, in the case of a positive result the signal peptide is cut before any further investigation, because most of the topology prediction methods confuse signal peptides and TMHs. Next, it makes a decision whether the investigated protein sequence codes a TMP or non-TMP. The final step is the topology prediction. CCTOP, as its name shows, utilizes several methods to perform these tasks and incorporates the results of already known topological data or bioinformatical evidences into the final topology prediction as constraints. You can find more information about these steps and the results of predictions in HTP manuscript.

Measuring the reliability of the consensus prediction

Reliability of topology prediction is calculated by summing up the posterior probabilities along the state path determined by the Viterbi algorithm in the HMMTOP program. The reliability highly correlates with the prediction accuracy, measured in a human benchmark set, therefore it can be used to estimate the prediction accuracy of an individual prediction without any information about the topology at all.

Generating the human transmembrane proteome database within the framework of UniTmp

Using CCTOP as an accurate and precise filtering, signal peptide and topology prediction method on human sequences, we investigated all human sequences in the human proteome defined by the UniProt database (Release 2023_03). We have filtered 5506 sequences as TMP, which is 27% of the proteome as expected from earlier studies. HTP is now part of the UniTmp framework - which means information in PDBTM, TOPDB and TOPDOM are stored in the same mysql database and therefore new topology data in any of the source databases are instantly reflected in HTP.

API documentation

Click here to access API documentation.