These scripts can be used to mine database files from UniProt and TopDB into transmembrane and perform sequence distribution data analysis.
The downloaded zip includes the original files downloaded from the respective databases and the Uniprot non-redundant datasets. The zip also contains the python scripts used to generate the datasets, tables, figures, as well as the parsed .csv files of each dataset. Within the .csv files is the ID for the respective databases, the full protein sequence, the TMH sequences, the flank sequences (each file for each dataset has a different cut-off flank length: 5, 10 or 20), the number of TMHs in the given protein, and the orientation of the TMH.
Scripts.zip includes scripts used for many of the figures and tables throughout the results:
The scripts folder also includes various text files to assist the scripts.
Datasets.zip includes the processed database files used in the distribution analysis throughout the results.
For each of the database files, there are several processed dataset files. The name of the dataset file describes the processing used to generate the dataset file. The csv file names include the name of the original database file, the maximum allowed flank length, the Flankclash variable state (if Flankclash is true, then the dataset does not allow overlap between flanking regions of TMHs. If the variable is false then the dataset allows overlap), and an additional condition if only TMH records with flanks of either half or full length are included in the dataset.
TopDB_5_flanklength_flankclashFalse_only_full_flanks.csv
TopDB_5_flanklength_flankclashFalse_only_half_flanks.csv
TopDB_5_flanklength_flankclashFalse.csv
TopDB_5_flanklength_flankclashTrue_only_full_flanks.csv
TopDB_5_flanklength_flankclashTrue_only_half_flanks.csv
TopDB_5_flanklength_flankclashTrue.csv
TopDB_10_flanklength_flankclashFalse_only_full_flanks.csv
TopDB_10_flanklength_flankclashFalse_only_half_flanks.csv
TopDB_10_flanklength_flankclashFalse.csv
TopDB_10_flanklength_flankclashTrue_only_full_flanks.csv
TopDB_10_flanklength_flankclashTrue_only_half_flanks.csv
TopDB_10_flanklength_flankclashTrue.csv
TopDB_20_flanklength_flankclashFalse_only_full_flanks.csv
TopDB_20_flanklength_flankclashFalse_only_half_flanks.csv
TopDB_20_flanklength_flankclashFalse.csv
TopDB_20_flanklength_flankclashTrue_only_full_flanks.csv
TopDB_20_flanklength_flankclashTrue_only_half_flanks.csv
TopDB_20_flanklength_flankclashTrue.csv
UniArch_5_flanklength_flankclashFalse_only_full_flanks.csv
UniArch_5_flanklength_flankclashFalse_only_half_flanks.csv
UniArch_5_flanklength_flankclashFalse.csv
UniArch_5_flanklength_flankclashTrue_only_full_flanks.csv
UniArch_5_flanklength_flankclashTrue_only_half_flanks.csv
UniArch_5_flanklength_flankclashTrue.csv
UniArch_10_flanklength_flankclashFalse_only_full_flanks.csv
UniArch_10_flanklength_flankclashFalse_only_half_flanks.csv
UniArch_10_flanklength_flankclashFalse.csv
UniArch_10_flanklength_flankclashTrue_only_full_flanks.csv
UniArch_10_flanklength_flankclashTrue_only_half_flanks.csv
UniArch_10_flanklength_flankclashTrue.csv
UniArch_20_flanklength_flankclashFalse_only_full_flanks.csv
UniArch_20_flanklength_flankclashFalse_only_half_flanks.csv
UniArch_20_flanklength_flankclashFalse.csv
UniArch_20_flanklength_flankclashTrue_only_full_flanks.csv
UniArch_20_flanklength_flankclashTrue_only_half_flanks.csv
UniArch_20_flanklength_flankclashTrue.csv
UniBacilli_5_flanklength_flankclashFalse_only_full_flanks.csv
UniBacilli_5_flanklength_flankclashFalse_only_half_flanks.csv
UniBacilli_5_flanklength_flankclashFalse.csv
UniBacilli_5_flanklength_flankclashTrue_only_full_flanks.csv
UniBacilli_5_flanklength_flankclashTrue_only_half_flanks.csv
UniBacilli_5_flanklength_flankclashTrue.csv
UniBacilli_10_flanklength_flankclashFalse_only_full_flanks.csv
UniBacilli_10_flanklength_flankclashFalse_only_half_flanks.csv
UniBacilli_10_flanklength_flankclashFalse.csv
UniBacilli_10_flanklength_flankclashTrue_only_full_flanks.csv
UniBacilli_10_flanklength_flankclashTrue_only_half_flanks.csv
UniBacilli_10_flanklength_flankclashTrue.csv
UniBacilli_20_flanklength_flankclashFalse_only_full_flanks.csv
UniBacilli_20_flanklength_flankclashFalse_only_half_flanks.csv
UniBacilli_20_flanklength_flankclashFalse.csv
UniBacilli_20_flanklength_flankclashTrue_only_full_flanks.csv
UniBacilli_20_flanklength_flankclashTrue_only_half_flanks.csv
UniBacilli_20_flanklength_flankclashTrue.csv
UniCress_5_flanklength_flankclashFalse_only_full_flanks.csv
UniCress_5_flanklength_flankclashFalse_only_half_flanks.csv
UniCress_5_flanklength_flankclashFalse.csv
UniCress_5_flanklength_flankclashTrue_only_full_flanks.csv
UniCress_5_flanklength_flankclashTrue_only_half_flanks.csv
UniCress_5_flanklength_flankclashTrue.csv
UniCress_10_flanklength_flankclashFalse_only_full_flanks.csv
UniCress_10_flanklength_flankclashFalse_only_half_flanks.csv
UniCress_10_flanklength_flankclashFalse.csv
UniCress_10_flanklength_flankclashTrue_only_full_flanks.csv
UniCress_10_flanklength_flankclashTrue_only_half_flanks.csv
UniCress_10_flanklength_flankclashTrue.csv
UniCress_20_flanklength_flankclashFalse_only_full_flanks.csv
UniCress_20_flanklength_flankclashFalse_only_half_flanks.csv
UniCress_20_flanklength_flankclashFalse.csv
UniCress_20_flanklength_flankclashTrue_only_full_flanks.csv
UniCress_20_flanklength_flankclashTrue_only_half_flanks.csv
UniCress_20_flanklength_flankclashTrue.csv
UniEcoli_5_flanklength_flankclashFalse_only_full_flanks.csv
UniEcoli_5_flanklength_flankclashFalse_only_half_flanks.csv
UniEcoli_5_flanklength_flankclashFalse.csv
UniEcoli_5_flanklength_flankclashTrue_only_full_flanks.csv
UniEcoli_5_flanklength_flankclashTrue_only_half_flanks.csv
UniEcoli_5_flanklength_flankclashTrue.csv
UniEcoli_10_flanklength_flankclashFalse_only_full_flanks.csv
UniEcoli_10_flanklength_flankclashFalse_only_half_flanks.csv
UniEcoli_10_flanklength_flankclashFalse.csv
UniEcoli_10_flanklength_flankclashTrue_only_full_flanks.csv
UniEcoli_10_flanklength_flankclashTrue_only_half_flanks.csv
UniEcoli_10_flanklength_flankclashTrue.csv
UniEcoli_20_flanklength_flankclashFalse_only_full_flanks.csv
UniEcoli_20_flanklength_flankclashFalse_only_half_flanks.csv
UniEcoli_20_flanklength_flankclashFalse.csv
UniEcoli_20_flanklength_flankclashTrue_only_full_flanks.csv
UniEcoli_20_flanklength_flankclashTrue_only_half_flanks.csv
UniEcoli_20_flanklength_flankclashTrue.csv
UniER_5_flanklength_flankclashFalse_only_full_flanks.csv
UniER_5_flanklength_flankclashFalse_only_half_flanks.csv
UniER_5_flanklength_flankclashFalse.csv
UniER_5_flanklength_flankclashTrue_only_full_flanks.csv
UniER_5_flanklength_flankclashTrue_only_half_flanks.csv
UniER_5_flanklength_flankclashTrue.csv
UniER_10_flanklength_flankclashFalse_only_full_flanks.csv
UniER_10_flanklength_flankclashFalse_only_half_flanks.csv
UniER_10_flanklength_flankclashFalse.csv
UniER_10_flanklength_flankclashTrue_only_full_flanks.csv
UniER_10_flanklength_flankclashTrue_only_half_flanks.csv
UniER_10_flanklength_flankclashTrue.csv
UniER_20_flanklength_flankclashFalse_only_full_flanks.csv
UniER_20_flanklength_flankclashFalse_only_half_flanks.csv
UniER_20_flanklength_flankclashFalse.csv
UniER_20_flanklength_flankclashTrue_only_full_flanks.csv
UniER_20_flanklength_flankclashTrue_only_half_flanks.csv
UniER_20_flanklength_flankclashTrue.csv
UniFungi_5_flanklength_flankclashFalse_only_full_flanks.csv
UniFungi_5_flanklength_flankclashFalse_only_half_flanks.csv
UniFungi_5_flanklength_flankclashFalse.csv
UniFungi_5_flanklength_flankclashTrue_only_full_flanks.csv
UniFungi_5_flanklength_flankclashTrue_only_half_flanks.csv
UniFungi_5_flanklength_flankclashTrue.csv
UniFungi_10_flanklength_flankclashFalse_only_full_flanks.csv
UniFungi_10_flanklength_flankclashFalse_only_half_flanks.csv
UniFungi_10_flanklength_flankclashFalse.csv
UniFungi_10_flanklength_flankclashTrue_only_full_flanks.csv
UniFungi_10_flanklength_flankclashTrue_only_half_flanks.csv
UniFungi_10_flanklength_flankclashTrue.csv
UniFungi_20_flanklength_flankclashFalse_only_full_flanks.csv
UniFungi_20_flanklength_flankclashFalse_only_half_flanks.csv
UniFungi_20_flanklength_flankclashFalse.csv
UniFungi_20_flanklength_flankclashTrue_only_full_flanks.csv
UniFungi_20_flanklength_flankclashTrue_only_half_flanks.csv
UniFungi_20_flanklength_flankclashTrue.csv
UniGolgi_5_flanklength_flankclashFalse_only_full_flanks.csv
UniGolgi_5_flanklength_flankclashFalse_only_half_flanks.csv
UniGolgi_5_flanklength_flankclashFalse.csv
UniGolgi_5_flanklength_flankclashTrue_only_full_flanks.csv
UniGolgi_5_flanklength_flankclashTrue_only_half_flanks.csv
UniGolgi_5_flanklength_flankclashTrue.csv
UniGolgi_10_flanklength_flankclashFalse_only_full_flanks.csv
UniGolgi_10_flanklength_flankclashFalse_only_half_flanks.csv
UniGolgi_10_flanklength_flankclashFalse.csv
UniGolgi_10_flanklength_flankclashTrue_only_full_flanks.csv
UniGolgi_10_flanklength_flankclashTrue_only_half_flanks.csv
UniGolgi_10_flanklength_flankclashTrue.csv
UniGolgi_20_flanklength_flankclashFalse_only_full_flanks.csv
UniGolgi_20_flanklength_flankclashFalse_only_half_flanks.csv
UniGolgi_20_flanklength_flankclashFalse.csv
UniGolgi_20_flanklength_flankclashTrue_only_full_flanks.csv
UniGolgi_20_flanklength_flankclashTrue_only_half_flanks.csv
UniGolgi_20_flanklength_flankclashTrue.csv
UniHuman_5_flanklength_flankclashFalse_only_full_flanks.csv
UniHuman_5_flanklength_flankclashFalse_only_half_flanks.csv
UniHuman_5_flanklength_flankclashFalse.csv
UniHuman_5_flanklength_flankclashTrue_only_full_flanks.csv
UniHuman_5_flanklength_flankclashTrue_only_half_flanks.csv
UniHuman_5_flanklength_flankclashTrue.csv
UniHuman_10_flanklength_flankclashFalse_only_full_flanks.csv
UniHuman_10_flanklength_flankclashFalse_only_half_flanks.csv
UniHuman_10_flanklength_flankclashFalse.csv
UniHuman_10_flanklength_flankclashTrue_only_full_flanks.csv
UniHuman_10_flanklength_flankclashTrue_only_half_flanks.csv
UniHuman_10_flanklength_flankclashTrue.csv
UniHuman_20_flanklength_flankclashFalse_only_full_flanks.csv
UniHuman_20_flanklength_flankclashFalse_only_half_flanks.csv
UniHuman_20_flanklength_flankclashFalse.csv
UniHuman_20_flanklength_flankclashTrue_only_full_flanks.csv
UniHuman_20_flanklength_flankclashTrue_only_half_flanks.csv
UniHuman_20_flanklength_flankclashTrue.csv
UniPM_5_flanklength_flankclashFalse_only_full_flanks.csv
UniPM_5_flanklength_flankclashFalse_only_half_flanks.csv
UniPM_5_flanklength_flankclashFalse.csv
UniPM_5_flanklength_flankclashTrue_only_full_flanks.csv
UniPM_5_flanklength_flankclashTrue_only_half_flanks.csv
UniPM_5_flanklength_flankclashTrue.csv
UniPM_10_flanklength_flankclashFalse_only_full_flanks.csv
UniPM_10_flanklength_flankclashFalse_only_half_flanks.csv
UniPM_10_flanklength_flankclashFalse.csv
UniPM_10_flanklength_flankclashTrue_only_full_flanks.csv
UniPM_10_flanklength_flankclashTrue_only_half_flanks.csv
UniPM_10_flanklength_flankclashTrue.csv
UniPM_20_flanklength_flankclashFalse_only_full_flanks.csv
UniPM_20_flanklength_flankclashFalse_only_half_flanks.csv
UniPM_20_flanklength_flankclashFalse.csv
UniPM_20_flanklength_flankclashTrue_only_full_flanks.csv
UniPM_20_flanklength_flankclashTrue_only_half_flanks.csv
UniPM_20_flanklength_flankclashTrue.csv
The scripts can be used to mine UniProt files and the Fasta TopDB file into tables that have easier to handle information about their transmembrane domain and neighbouring residue sequences in csv format.
Additional scripts that were used to analyse the data are included, however, these are provided as and may not work out of the box since they rely on more modules.