A web server for predicting transporters' substrate specificities using word embeddings
Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis. This study defines a protein sequence by the word embeddings including sub-word information of its biological words which then serve as features to be fed in binary classifiers. Different ways to generate word embeddings were tried by changing the lengths of protein sequence’s constituent biological words.
Fig.1. The model architecture to identity substrate specificities of transporters
Compared to 4 different feature types generated from protein sequences, our proposed features can help prediction models yield superior performance. FastTran, our best model, can reach the average area under the curve of 0.96 on 5-fold cross validation data and 0.99 on independent test data. Accordingly, this study provides a basis for further research that can enrich a field of applying natural language processing techniques in bioinformatics.
The dataset used in this server were retrieved from UniProt. The detail of the dataset is listed in the table below.
Class name | Number of proteins | |||
Original | After 20% similarity check | Train dataset | Test dataset | |
Amino acid transporter | 189 | 73 | 61 | 12 |
Electron transporter | 596 | 221 | 184 | 37 |
Cation transporter | 294 | 88 | 73 | 15 |
Lipid transporter | 144 | 78 | 66 | 12 |
Protein/mRNA transporter | 1056 | 455 | 380 | 75 |
Sugar transporter | 205 | 84 | 71 | 13 |
Other transporter | 471 | 198 | 165 | 33 |
Membrane | 3853 | 1050 | 875 | 175 |
If you would like to build a model and evaluate our model, we provide the dataset as the below link.
Download dataset.zipIn order to avoid the errors, please submit the sequence in fasta format (we also give you the fasta file examples). The user can choose two options to submit, including paste the sequence into text area and upload sequence file. The user can submit one single fasta file or multiple fasta file. In the result page, we show the results for the sequences which belong to electron transport proteins or not.
>O68460 MAGIYLFVVAAALAALGYGALTIKTIMAADAGTARMQEISGAVQEGASAFLNRQYKTIAV VGAVVFVILTALLGISVGFGFLIGAVCSGIAGYVGMYISVRANVRVAAGAQQGLARGLEL AFQSGAVTGMLVAGLALLSVAFYYILLVGIGATGRALIDPLVALGFGASLISIFARLGGG IFTKGADVGADLVGKVEAGIPEDDPRNPAVIADNVGDNVGDCAGMAADLFETYAVTVVAT MVLASIFFAGVPAMTSMMAYPLAIGGVCILASILGTKFVKLGPKNNIMGALYRGFLVSAG ASFVGIILATAIVPGFGDIQGANGVLYSGFDLFLCAVIGLLVTGLLIWVTEYYTGTNFRP VRSVAKASTTGHGTNVIQGLAISMEATALPALIICAAIITTYQLSGLFGIAITVTSMLAL AGMVVALDAYGPVTDNAGGIAEMANLPEDVRKTTDALDAVGNTTKAVTKGYAIGSAGLGA LVLFAAYTEDLAFFKANVDAYPAFAGVDVNFSLSSPYVVVGLFIGGLLPYLFGSMGMTAV GRAAGSVVEEVRRQFREIPGIMEGTAKPEYGRCVDMLTKAAIKEMIIPSLLPVLAPIVLY FVILGIADKSAAFSALGAMLLGVIVTGLFVAISMTAGGGAWDNAKKYIEDGHYGGKGSEA HKAAVTGDTVGDPYKDTAGPAVNPMIKITNIVALLLLAVLAH >O06342 MFPAAVGVLWQSGLRDPTPPGGPHGIEGLSLAFEKPSPVTALTQELRFATTMTGGVSLAI WMAGVTREINLLAQASQWRRLGGTFPTNSQLTNESAASLRLYAQLIDLLDMVVDVDILSG TSAGGINAALLASSRVTGSDLGGIRDLWLDLGALTELLRDPRDKKTPSLLYGDERIFAAL AKRLPKLATGPFPPTTFPEAARTPSTTLYITTTLLAGETSRFTDSFGTLVQDVDLRGLFT FTETDLARPDTAPALALAARSSASFPLAFEPSFLPFTKGTAKKGEVPARPAMAPFTSLTR PHWVSDGGLLDNRPIGVLFKRIFDRPARRPVRRVLLFVVPSSGPAPDPMHEPPPDNVDEP LGLIDGLLKGLAAVTTQSIAADLRAIRAHQDCMEARTDAKLRLAELAATLRNGTRLLTPS LLTDYRTREATKQAQTLTSALLRRLSTCPPESGPATESLPKSWSAELTVGGDADKVCRQQ ITATILLSWSQPTAQPLPQSPAELARFGQPAYDLAKGCALTVIRAAFQLARSDADIAALA EVTEAIHRAWRPTASSDLSVLVRTMCSRPAIRQGSLENAADQLAADYLQQSTVPGDAWER LGAALVNAYPTLTQLAASASADSGAPTDSLLARDHVAAGQLETYLSYLGTYPGRADDSRD APTMAWKLFDLATTQRAMLPADAEIEQGLELVQVSADTRSLLAPDWQTAQQKLTGMRLHH FGAFYKRSWRANDWMWGRLDGAGWLVHVLLDPRRVRWIVGERADTNGPQSGAQWFLGKLK ELGAPDFPSPGYPLPAVGGGPAQHLTEDMLLDELGFLDDPAKPLPASIPWTALWLSQAWQ QRVLEEELDGLANTVLDPQPGKLPDWSPTSSRTWATKVLAAHPGDAKYALLNENPIAGET FASDKGSPLMAHTVAKAAATAAGAAGSVRQLPSVLKPPLITLRTLTLSGYRVVSLTKGIA RSTIIAGALLLVLGVAAAIQSVTVFGVTGLIAAGTGGLLVVLGTWQVSGRLLFALLSFSV VGAVLALATPVVREWLFGTQQQPGWVGTHAYWLGAQWWHPLVVVGLIALVAIMIAAATPG RR >P11166 MEPSSKKLTGRLMLAVGGAVLGSLQFGYNTGVINAPQKVIEEFYNQTWVHRYGESILPTT LTTLWSLSVAIFSVGGMIGSFSVGLFVNRFGRRNSMLMMNLLAFVSAVLMGFSKLGKSFE MLILGRFIIGVYCGLTTGFVPMYVGEVSPTALRGALGTLHQLGIVVGILIAQVFGLDSIM GNKDLWPLLLSIIFIPALLQCIVLPFCPESPRFLLINRNEENRAKSVLKKLRGTADVTHD LQEMKEESRQMMREKKVTILELFRSPAYRQPILIAVVLQLSQQLSGINAVFYYSTSIFEK AGVQQPVYATIGSGIVNTAFTVVSLFVVERAGRRTLHLIGLAGMAGCAILMTIALALLEQ LPWMSYLSIVAIFGFVAFFEVGPGPIPWFIVAELFSQGPRPAAIAVAGFSNWTSNFIVGM CFQYVEQLCGPYVFIIFTVLLVLFFIFTYFKVPETKGRTFDEIASGFRQGGASQSDKTPE ELFHPLGADSQV
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
School of Humanities
Nanyang Technological University
48 Nanyang Ave, Singapore 6397983
Deparment of Statistics – Informatics
University of Economics, University of Danang
71 Ngu Hanh Son St, Danang, Vietnam 550000
If you have any problem or suggest any idea for our website, feel free to contact us via email: [email protected]