The goal of this document is to get you up and running with BaMORC as quickly as possible. BaMORC stands for the Bayesian Model Optimized Reference Correction (BaMORC) Method for Assigned and Unassigned Protein NMR Spectra. For a detailed explanation of the algorithm, please refer to “Automatic 13C chemical shift reference correction for unassigned protein NMR spectra”.
There are two important parts to BaMORC: the bamorc()
reference correction function for assigned 13C protein NMR spectra, and the unassigned_bamorc()
reference correction function for unassigned 13C protein NMR spectra. In the first section, you’ll learn about the basics of running both of these functions. In the second section, you will dive into the basics of input data process, and in third section, you’ll learn a little more about the functions used behind-the-scenes by the BaMORC algorithm.
To make a correction, first load BaMORC, then call bamorc()
or unassigned_bamorc()
with the correct arguments:
Here we will using the built-in data to demonstrate the arguments that will be passed into bamorc()
.
## Arguments:
sequence = paste(RefDB_data$carbonDat[[1]]$AA, collapse = "")
secondary_structure = paste(RefDB_data$carbonDat[[1]]$SS, collapse = "")
chemical_shifts_input = RefDB_data$carbonDat[[1]][, c(4, 5)]
from = -5
to = 5
## Running bamorc() function:
bamorc(sequence = sequence, secondary_structure = secondary_structure,
chemical_shifts_input = chemical_shifts_input, from = -5,
to = 5)
#> [1] 0.0142443
sequence
: secondary_structure
: chemical_shifts_input
: <data frame n by 2> the carbon 13 chemical shift of the protein.from
and to
: from
must be lower than to
.The length of the sequence
and secondary_structure
should be the same, if not please assign secondary_structure = NULL
, however, the peaklist groups number could be more or less than the length of the sequence. Often peak lists will have fewer CA/CB pairs than the length of the protein sequence.
Printing an argument object gives you some useful information: the actual format, the size, the object type, and if it’s a string.
Next, we will use the built-in data to demonstrate the arguments that will be passed in the unassigned_bamorc()
. The output will be slightly different each time it runs, due to the randomness of the optimization.
## Arguments:enerate a temperary sample NMR spectra file and
## later will be removed.
sequence = "RPAFCLEPPYAGPGKARIIRYFYNAAAGAAQAFVYGGVRAKRNNFASAADALAACAAA"
sample_data_generator(input_type = "ssc_sample")
file_path = "./bpti_HNcoCACB.txt" # temperary sample file path.
## Running unassigned_bamorc() function: unassigned_bamorc
## will take a while to run
unassigned_bamorc(peakList_file_loc = file_path, sequence = sequence,
secondary_structure = NULL, from = -5, to = 5)
## Delete the temperary sample file.
unlink("./bpti_HNcoCACB.txt")
The data passed in both above functions should be pre-processed. Luckly, we provide a variety of helper functions. See the following on how to process the data.
There are three file parsing functions within the BaMORC package: read_raw_file()
, read_nmrstar_file()
, and read_db_file()
. These functions handle a wide range of the input files. You can check the sample file using the sample_data_generator()
for a recommended format:
Delimiter of white space:
## Arguments:enerate a temperary sample NMR spectra file and
## later will be removed.
input_type = "ws"
sample_data_generator(input_type = input_type)
## Running reading function
head(read_raw_file(file_path = "sample_input_ws.txt", delim = "ws"))
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> X2 = col_double()
#> )
#> [[1]]
#> [1] "NHQDHNNFQTLPYVPCSTCEGNLACLSLCHIE"
#>
#> [[2]]
#> # A tibble: 32 x 2
#> X1 X2
#> <dbl> <dbl>
#> 1 NA 38.6
#> 2 NA 28.8
#> 3 56.3 30.4
#> 4 54.4 41.3
#> 5 NA 28.8
#> 6 55.1 38.9
#> 7 NA 38.8
#> 8 NA 39.3
#> 9 NA 30.3
#> 10 62.1 69.8
#> # ... with 22 more rows
unlink("sample_input_ws.txt")
Delimiter of comma:
## Arguments:enerate a temperary sample NMR spectra file and
## later will be removed.
input_type = "csv"
sample_data_generator(input_type = input_type)
## Running reading function
head(read_raw_file(file_path = "sample_input.csv", delim = "comma"))
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> X2 = col_double()
#> )
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 2)
#> Warning: 1 parsing failure.
#> row # A tibble: 1 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 33 <NA> 2 columns 1 columns 'sample_input.csv' file # A tibble: 1 x 5
#> [[1]]
#> [1] "NHQDHNNFQTLPYVPCSTCEGNLACLSLCHIE"
#>
#> [[2]]
#> # A tibble: 33 x 2
#> X1 X2
#> <dbl> <dbl>
#> 1 NA 38.6
#> 2 NA 28.8
#> 3 56.3 30.4
#> 4 54.4 41.3
#> 5 NA 28.8
#> 6 55.1 38.9
#> 7 NA 38.8
#> 8 NA 39.3
#> 9 NA 30.3
#> 10 62.1 69.8
#> # ... with 23 more rows
unlink("sample_input.csv")
Delimiter of semicolon:
## Arguments:enerate a temperary sample NMR spectra file and
## later will be removed.
input_type = "sc"
sample_data_generator(input_type = input_type)
## Running reading function
head(read_raw_file(file_path = "sample_input_sc.txt", delim = "semicolon"))
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> X2 = col_double()
#> )
#> [[1]]
#> [1] "NHQDHNNFQTLPYVPCSTCEGNLACLSLCHIE"
#>
#> [[2]]
#> # A tibble: 32 x 2
#> X1 X2
#> <dbl> <dbl>
#> 1 NA 38.6
#> 2 NA 28.8
#> 3 56.3 30.4
#> 4 54.4 41.3
#> 5 NA 28.8
#> 6 55.1 38.9
#> 7 NA 38.8
#> 8 NA 39.3
#> 9 NA 30.3
#> 10 62.1 69.8
#> # ... with 22 more rows
unlink("sample_input_sc.txt")
read_raw_file()
: parses user-provided file in customed format as show above examples.
read_nmrstar_file()
: parses user-provided file in BMRB Star 2/3 format.
## Download a BMRB file
library(BMRBr)
bmrb_download(4020, output_dir = "./")
## Read in BMRB file and procec
file_path = "bmr4020.str"
head(read_nmrstar_file(file_path = file_path))
## Delete downloaded BMRB file
unlink("./bmr4020.str")
read_db_file()
: parses file from BMRB by a given entry ID.Secondary structure information predicted based on the protein sequence is required for the optimization. This is predicted through the JPred and jpredapi.
The BaMORC algorithm minimizes the difference between the actual relative cummulative frequency (RCF) of the protein sequence and the estimated RCF predicted from the chemical shifts information. The statistical functions used by the algorithm are shown below with examples. For detailed function descriptions, please see the reference:
calculate_aa_prob()
: returns the probability (density) for a certain type of amino acid based on a chi-squared statistics wtih 2 degrees of freedom.calculate_aa_prob(chi_squared_stat = c(0.05, 0.1, 0.5), df = 2)
#> [1] 0.4876550 0.4756147 0.3894004
calculate_chi_squared_stat()
: given a pair of alpha and beta carbons chemical shifts, this function will return a list of calculated chi-squared statistics based on the combination of amino acid typings and secondary structures. Here we illustrate with a pair of chemical shifts from alpha and beta carbon.calculate_chi_squared_stat(cacb_pair = c(54, 45))
#> AA H B C
#> 1: A 1023.16868895477 169.099242775473 469.048336795045
#> 2: B 180.178895468627 54.4937018085712 74.2438106347521
#> 3: C 24.0426461856623 6.02618817827703 23.2477754200126
#> 4: D 14.4848786940741 2.79850128029531 10.2715883462805
#> 5: E 13.3897615181103 1.88818580572447 4.60208889084118
#> 6: F 165.514869003217 77.9438324817418 62.5815610717518
#> 7: H 321.477487110107 53.4257171238817 91.6072511229365
#> 8: I 253.099344063356 44.4933903131377 94.8605497904786
#> 9: K 128.989994525817 35.9002492383691 39.0974374716513
#> 10: L 100.123349476134 19.1632752720135 22.6939781262421
#> 11: M 14.8396700069463 0.383003397426531 2.60892370474794
#> 12: N 222.558693666307 34.5192736984713 53.7891077582836
#> 13: P 89.4591494720313 19.26900894692 26.5279676961367
#> 14: Q 40.3446974047416 5.81357432329253 11.1635747587969
#> 15: R 267.220874922648 179.412483703255 236.676686103036
#> 16: S 203.258956942293 177.779740165393 190.175848300641
#> 17: T 451.922344250779 363.817995197962 408.684228724244
#> 18: V 58.3956321465275 6.77558973670799 13.8132678105348
#> 19: W 127.135809610673 63.1703556188059 77.6629767003917
#> 20: Y 358.237467618581 47.5540232633832 80.5896644421314
calculate_rcf()
: calculates the relative cummulative freqeucny of amino acid and secondary structure combination.## Arguments:
sequence = paste(RefDB_data$carbonDat[[1]]$AA, collapse = "")
secondary_structure = paste(RefDB_data$carbonDat[[1]]$SS, collapse = "")
## Function:
calculate_rcf(sequence = sequence, secondary_structure = secondary_structure)
#> [[1]]
#> [1] "S-C" "I-C" "P-C" "C-B" "L-B" "L-B" "S-B" "P-C" "W-B" "S-B" "E-B"
#> [12] "W-B" "S-B" "D-C" "C-B" "S-B" "V-B" "T-C" "C-C" "K-B" "M-B" "R-B"
#> [23] "T-B" "R-B" "Q-B" "R-B" "M-B" "L-B" "K-B" "S-C" "L-C" "A-C" "E-C"
#> [34] "L-C" "D-C" "C-C" "N-C" "E-C" "D-C" "L-B" "E-B" "Q-B" "A-B" "E-B"
#> [45] "K-B" "C-B" "M-B" "L-B" "P-B" "E-B" "C-C" "P-C"
#>
#> [[2]]
#> AA_SS Freq
#> 1 A-B 0.01923077
#> 2 A-C 0.01923077
#> 3 C-B 0.05769231
#> 4 C-C 0.05769231
#> 5 D-C 0.05769231
#> 6 E-B 0.07692308
#> 7 E-C 0.03846154
#> 8 I-C 0.01923077
#> 9 K-B 0.05769231
#> 10 L-B 0.09615385
#> 11 L-C 0.03846154
#> 12 M-B 0.05769231
#> 13 N-C 0.01923077
#> 14 P-B 0.01923077
#> 15 P-C 0.05769231
#> 16 Q-B 0.03846154
#> 17 R-B 0.05769231
#> 18 S-B 0.07692308
#> 19 S-C 0.03846154
#> 20 T-B 0.01923077
#> 21 T-C 0.01923077
#> 22 V-B 0.01923077
#> 23 W-B 0.03846154
calculate_mse()
: calculates mean squared error for each correction value (step).