FastTrans

A web server for predicting transporters' substrate specificities using word embeddings

Submit your proteins Download dataset

Introduction

Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis. This study defines a protein sequence by the word embeddings including sub-word information of its biological words which then serve as features to be fed in binary classifiers. Different ways to generate word embeddings were tried by changing the lengths of protein sequence’s constituent biological words.



Fig.1. The model architecture to identity substrate specificities of transporters

Result

Compared to 4 different feature types generated from protein sequences, our proposed features can help prediction models yield superior performance. FastTran, our best model, can reach the average area under the curve of 0.96 on 5-fold cross validation data and 0.99 on independent test data. Accordingly, this study provides a basis for further research that can enrich a field of applying natural language processing techniques in bioinformatics.

Dataset

The dataset used in this server were retrieved from UniProt. The detail of the dataset is listed in the table below.

Class name Number of proteins
Original After 20% similarity check Train dataset Test dataset
Amino acid transporter 189 73 61 12
Electron transporter 596 221 184 37
Cation transporter 294 88 73 15
Lipid transporter 144 78 66 12
Protein/mRNA transporter 1056 455 380 75
Sugar  transporter 205 84 71 13
Other  transporter 471 198 165 33
Membrane 3853 1050 875 175

If you would like to build a model and evaluate our model, we provide the dataset as the below link.

Download dataset.zip

Submission

In order to avoid the errors, please submit the sequence in fasta format (we also give you the fasta file examples). The user can choose two options to submit, including paste the sequence into text area and upload sequence file. The user can submit one single fasta file or multiple fasta file. In the result page, we show the results for the sequences which belong to electron transport proteins or not.

Sample fasta Sequence(s)
>O68460
MAGIYLFVVAAALAALGYGALTIKTIMAADAGTARMQEISGAVQEGASAFLNRQYKTIAV
VGAVVFVILTALLGISVGFGFLIGAVCSGIAGYVGMYISVRANVRVAAGAQQGLARGLEL
AFQSGAVTGMLVAGLALLSVAFYYILLVGIGATGRALIDPLVALGFGASLISIFARLGGG
IFTKGADVGADLVGKVEAGIPEDDPRNPAVIADNVGDNVGDCAGMAADLFETYAVTVVAT
MVLASIFFAGVPAMTSMMAYPLAIGGVCILASILGTKFVKLGPKNNIMGALYRGFLVSAG
ASFVGIILATAIVPGFGDIQGANGVLYSGFDLFLCAVIGLLVTGLLIWVTEYYTGTNFRP
VRSVAKASTTGHGTNVIQGLAISMEATALPALIICAAIITTYQLSGLFGIAITVTSMLAL
AGMVVALDAYGPVTDNAGGIAEMANLPEDVRKTTDALDAVGNTTKAVTKGYAIGSAGLGA
LVLFAAYTEDLAFFKANVDAYPAFAGVDVNFSLSSPYVVVGLFIGGLLPYLFGSMGMTAV
GRAAGSVVEEVRRQFREIPGIMEGTAKPEYGRCVDMLTKAAIKEMIIPSLLPVLAPIVLY
FVILGIADKSAAFSALGAMLLGVIVTGLFVAISMTAGGGAWDNAKKYIEDGHYGGKGSEA
HKAAVTGDTVGDPYKDTAGPAVNPMIKITNIVALLLLAVLAH
>O06342
MFPAAVGVLWQSGLRDPTPPGGPHGIEGLSLAFEKPSPVTALTQELRFATTMTGGVSLAI
WMAGVTREINLLAQASQWRRLGGTFPTNSQLTNESAASLRLYAQLIDLLDMVVDVDILSG
TSAGGINAALLASSRVTGSDLGGIRDLWLDLGALTELLRDPRDKKTPSLLYGDERIFAAL
AKRLPKLATGPFPPTTFPEAARTPSTTLYITTTLLAGETSRFTDSFGTLVQDVDLRGLFT
FTETDLARPDTAPALALAARSSASFPLAFEPSFLPFTKGTAKKGEVPARPAMAPFTSLTR
PHWVSDGGLLDNRPIGVLFKRIFDRPARRPVRRVLLFVVPSSGPAPDPMHEPPPDNVDEP
LGLIDGLLKGLAAVTTQSIAADLRAIRAHQDCMEARTDAKLRLAELAATLRNGTRLLTPS
LLTDYRTREATKQAQTLTSALLRRLSTCPPESGPATESLPKSWSAELTVGGDADKVCRQQ
ITATILLSWSQPTAQPLPQSPAELARFGQPAYDLAKGCALTVIRAAFQLARSDADIAALA
EVTEAIHRAWRPTASSDLSVLVRTMCSRPAIRQGSLENAADQLAADYLQQSTVPGDAWER
LGAALVNAYPTLTQLAASASADSGAPTDSLLARDHVAAGQLETYLSYLGTYPGRADDSRD
APTMAWKLFDLATTQRAMLPADAEIEQGLELVQVSADTRSLLAPDWQTAQQKLTGMRLHH
FGAFYKRSWRANDWMWGRLDGAGWLVHVLLDPRRVRWIVGERADTNGPQSGAQWFLGKLK
ELGAPDFPSPGYPLPAVGGGPAQHLTEDMLLDELGFLDDPAKPLPASIPWTALWLSQAWQ
QRVLEEELDGLANTVLDPQPGKLPDWSPTSSRTWATKVLAAHPGDAKYALLNENPIAGET
FASDKGSPLMAHTVAKAAATAAGAAGSVRQLPSVLKPPLITLRTLTLSGYRVVSLTKGIA
RSTIIAGALLLVLGVAAAIQSVTVFGVTGLIAAGTGGLLVVLGTWQVSGRLLFALLSFSV
VGAVLALATPVVREWLFGTQQQPGWVGTHAYWLGAQWWHPLVVVGLIALVAIMIAAATPG
RR
>P11166
MEPSSKKLTGRLMLAVGGAVLGSLQFGYNTGVINAPQKVIEEFYNQTWVHRYGESILPTT
LTTLWSLSVAIFSVGGMIGSFSVGLFVNRFGRRNSMLMMNLLAFVSAVLMGFSKLGKSFE
MLILGRFIIGVYCGLTTGFVPMYVGEVSPTALRGALGTLHQLGIVVGILIAQVFGLDSIM
GNKDLWPLLLSIIFIPALLQCIVLPFCPESPRFLLINRNEENRAKSVLKKLRGTADVTHD
LQEMKEESRQMMREKKVTILELFRSPAYRQPILIAVVLQLSQQLSGINAVFYYSTSIFEK
AGVQQPVYATIGSGIVNTAFTVVSLFVVERAGRRTLHLIGLAGMAGCAILMTIALALLEQ
LPWMSYLSIVAIFGFVAFFEVGPGPIPWFIVAELFSQGPRPAAIAVAGFSNWTSNFIVGM
CFQYVEQLCGPYVFIIFTVLLVLFFIFTYFKVPETKGRTFDEIASGFRQGGASQSDKTPE
ELFHPLGADSQV

Members

Yu-Yen Ou
Associate Professor

Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.

Trung-Duong Nguyen-Trinh
Research Scholar

Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.

Quang-Thai Ho
Research Scholar

Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.

Nguyen-Quoc-Khanh Le
Research Scholar

School of Humanities
Nanyang Technological University
48 Nanyang Ave, Singapore 6397983

Dinh-Van Phan
Research Scholar

Deparment of Statistics – Informatics
University of Economics, University of Danang
71 Ngu Hanh Son St, Danang, Vietnam 550000

Contact us


Department of Computer Science and Engineering
Graduate Program in Biomedical Informatics
Bioinformatics Laboratory (R1607B)
Address: No. 135, Yuandong Road, Chungli City, Taoyuan County, Taiwan R.O.C .32003
Tel: (03) 463-8800

If you have any problem or suggest any idea for our website, feel free to contact us via email: [email protected]