Configuring Kojak
Due to the large number of parameters used in any Kojak analysis, all parameters are set in a configuration file rather than typed out on the command line. A sample configuration file is provided with Kojak, and can also be copied from this example.
Changing parameters should be done from any ASCII text editor, such as vim
in Linux, or notepad
in Windows.
For ease of reading, parameter names use english words instead of single letters. Notes are defined as any
text that appears after the #
symbol on each line. Notes are not processed as parameters and are there to
provide helpful information to users. Feel free to add additional notes as needed for your analysis.
Helpful Hint
Make a different configuration file for each analysis you wish to perform. Simply copy an existing configuration file, then give it a unique name and adjust the parameters. Thus, you preserve the unique parameter settings for each of your analyses.
For clarity, the parameters have been divided into six groups by similarity. This page will describe the most important parameters within those groups. A complete list of all parameters and their descriptions can be found on the Parameters page.
1. Data I/O
The Data I/O parameters describe files that will be imported and exported from Kojak during analysis. There are two key files for input: your MS data file and a protein sequence database:
database = polymerase_proteins.fasta
MS_data_file = my_data.mzML
There are two points worth mentioning in the above example. First, the protein sequence database is an ASCII text
file of the FASTA
format. It should contain both target proteins (sequences you hope to find) and decoy proteins
(sequences you know are false). Second, the mass spectrometry data file must be in an approved format. On all systems,
those formats are mzML
(recommended) and mzXML
. Additionally, Thermo RAW
files may be used on Windows systems
if the appropriate drivers have
been installed.
Helpful Hint
FASTA sequence database files should contain both target and decoy sequences if you wish to use Percolator for results validation. Below is an example of what a target protein sequence and its matching decoy sequence might look like:
>Protein_1 MTHISISMYPRTEINSEKWENCE >Decoy_1_reverse ECNEWKESNIETRPYMSISIHTM
Helpful Hint
Kojak looks for files in the current working directory, i.e. the directory you are in. You can provide relative paths to parent- and sub-directories if desired. To ensure input and output is directed from and to a specific location no matter which directory you are in, provide an absolute path with your files:
database = C:\databases\e_coli.fasta
The remaining parameters in this group define exported content after Kojak analysis is complete. All output is in tab-delimited ASCII text files. The standard output file contains a complete summary of the results in a single file:
output_file = my_results.txt
The Percolator parameters return the same results, but formatted such that they can be directly imported to Percolator for validation of the peptide spectrum matches (PSMs):
percolator_file = my_results_perc_format
percolator_version = 2.07
For the percolator_file
parameter, no extension is necessary. The parameter requires only a base name because Kojak will
automatically divide the Percolator-formatted results into multiple files according to PSM type (e.g. intra-protein cross-links)
and append a txt
extension. The percolator_version
number should match the version of Percolator you will be using, as input
formats differ between Percolator versions. Kojak will automatically format the results to match the version you specify.
2. Data Descriptors
The Data Descriptor parameters describe the spectra to be analyzed. Accurate description of the spectra ensures the correct spectrum processing is performed during the Kojak analysis.
During Kojak analysis, all profile MS peaks must be converted to centroid MS peaks. Kojak has built-in functionality to perform this operation, but requires additional information about the instrument used and the resolution settings. Correctly setting these values will produce the most accurate centroid peak conversion. First, identify whether or not the spectra are centroided at the MS and MS/MS levels:
MS1_centroid = 0
MS2_centroid = 1
In the above example, 0 indicates profile peaks and 1 indicates centroid peaks. The parameters thus tell Kojak that the MS spectra are profile, and the MS/MS spectra are centroid. As a result, only the MS peaks will be converted to centroid values during the analysis. However, Kojak will need additional information about the peaks, such as what instrument they were acquired on, and at what resolution:
instrument = 0
MS1_resolution = 55000
Here, the 0 for instrument
indicates an orbitrap mass analyzer. The resolution is approximated at 400 m/z, which
may differ from m/z value the manufacturer reports. Make sure you report the resolution at 400 m/z. MS2_resolution
is not used in this example, because we set MS2_centroid = 1
, and thus any value for that parameter is ignored
during analysis.
Helpful Hint
Don't want to worry about setting the correct parameters to convert profile peaks to centroid? Simply give Kojak a data file where all the peaks are already centroided. With the right tool, such as msconvert from ProteoWizard, you can do this when converting your raw data to an mzML file.
The enrichment
parameter is always set to 0, unless you performed peptide c-terminus labeling during the enzymatic
digestion step of your sample prep. Details of how and when to use this parameter are found on the
Parameters page.
3. Cross-links
Although there are only two parameters in this category, they have special syntax that allow for analysis of data cross-linked with a large, diverse set of chemicals.
Chemical cross-linkers are assumed to be bifunctional. This means that they have two reactive groups that may bind to the same (homobifunctional) or different (heterobifunctional) moieties. Cross-linkers also have different spacer arm compositions that influence their mass. As an example, here is the parameter you would set if you used either DSS (or BS3) as your cross-linker:
cross_link = 1 1 138.068742
There are three values for the parameter. The first two values indicate the reaction targets on both ends of the cross-linker. DSS is homobifunctional and specifically targets primary amines (lysine residues or the n-terminus). Thus, the first two values are both 1, which is the chemical code for primary amines. The third value is the amount of additional mass the spacer arm adds to the mass of two cross-linked peptides.
It is also common for cross-linkers to produce partial linkages. These occur when one side of the cross-linker binds to a protein, but the other side becomes inert through hydrolysis. The resulting peptide remains unlinked, but with a differential modification equal to the hydrolyzed cross-linker at the target moiety:
mono_link = 1 156.0786
Similar to the cross_link
parameter, the mono_link
parameter requires a reaction target, albeit only one. The
differential mass is different as well, to reflect the different atomic composition of the incompletely linked
spacer arm.
Often, you will use both the cross_link
and mono_link
parameters in the same configuration file. However,
some cross-linkers may not produce partial links, and thus the mono_link
lines can be deleted or commented out
using the note #
character. Also, as is the case for zero-length cross-linkers, the differential mass may be
negative if there is no corresponding spacer arm and instead the target residues are linked in a hydrolysis reaction.
It is also possible to add multiple cross_link
and mono_link
parameter lines to a configuration file if you
used more than one cross-linker in your sample preparation, or there is more than one outcome for the reactions.
For example, if you used a light-heavy labeled combination of BS3, your parameters may look like this:
cross_link = 1 1 138.068742 #normal BS3
cross_link = 1 1 150.143404 #heavy isotope-labeled BS3
mono_link = 1 156.0786 #normal
mono_link = 1 168.15393 #heavy
Or perhaps you used two cross-linkers of different chemistries. Your parameters may look like this:
cross_link = 1 1 138.068742 #DSS
cross_link = 1 2 -18.010595 #EDC
mono_link = 1 156.0786 #DSS half-reaction
4. Modifications
The Modifications parameters refer to amino acid modifications that are independent of the cross-linking reactions. These occur naturally (e.g. post-translational modifications) or as a result of sample preparation (e.g. carbamidomethylcysteine):
fixed_modification = C 57.02146
modification = M 15.9949
max_mods_per_peptide = 2
The fixed_modification
and modification
parameters each require two values: a target and a mass. The
fixed_modification
parameter specifies a static mass that is added to the normal mass of the indicated peptide. The
modification
parameter specifies a mass that may or may not be on the peptide. Because each modification
exponentially
increases search time, you can limit the number of differential modifications
to a practical number. You can also
specify multiple fixed_modification
and modification
parameter lines to indication multiple different mass modifications
you expect to find in your analysis.
Helpful Hint
You can specify multiple differential modifications for the same amino acid. Place each modification on a separate line, but indicate the same amino acid with a different mass.
5. Scoring Algorithm
The Scoring Algorithm parameters describe the tolerances applied to the scoring algorithm for analyzing MS/MS spectra of different resolutions and mass accuracy. High-resolution MS/MS have a really small mass tolerance and no mass offset. Below is an example of values appropriate for high-resolution (e.g. orbitrap) MS/MS spectra:
fragment_bin_size = 0.03
fragment_bin_offset = 0.0
Low-resolution (e.g. ion trap) MS/MS spectra need a much wider mass tolerance, and an offset to make sure the peak values are most likely to fall within the center of the mass tolerance. For low-resolution MS/MS spectra, the following parameter values are recommended:
fragment_bin_size = 1.0005
fragment_bin_offset = 0.4
6. Analysis
The Analysis parameters are the largest set of parameters, but are easy to understand and use. Not all parameters are listed in these examples. You can find the complete list on the Parameters page.
enzyme = [KR]|{P}
max_miscleavages = 2
The above two parameters indicate the trypsin enzyme cleavage rule and the number of missed cleavages allowed. The missed cleavages include those caused by cross-linkers binding to the enzyme target residues.
min_peptide_mass = 500.0
max_peptide_mass = 6000.0
The peptide mass boundaries shown above are important to understand, but once set, you probably will use the same values
for all Kojak analyses. These parameters determine the smallest and largest peptides to be searched, by mass in Daltons,
not by number of amino acids. Setting a low min_peptide_mass
will allow you to search for cross-links to small peptides,
for example, peptides with just three amino acids. Although links to small peptides do happen, they are not likely to be
seen after validation because there would be very few fragment ion peaks to contribute to the score of the PSM. Thus, a
more practical value such as 500.0 Daltons is recommended. For max_peptide_mass
, consider that cross-linkers binding
to enzymatic cleavage sites will produce a larger number of long peptides. As a result, it might make sense to use a
large value, such as 6000.0 Daltons, for a Kojak analysis, although such a number might be considered excessive in a
typical shotgun database search.
ppm_tolerance_pre = 25.0
prefer_precursor_pred = 1
These two parameters help define the search space for each spectrum. ppm_tolerance_pre
defines the tolerance around
the precursor mass, in parts per million, that a PSM is allowed to have to be accepted. Setting a value similar to your
instrument mass accuracy is recommended, but there are various reasons to use much larger or smaller values that are
outside the scope of this manual. Sometimes, such as with Thermo RAW
data, a suggested monoisotopic precursor mass
is provided for a spectrum. It is often helpful to use this value, which is indicated by setting prefer_precursor_pred
to 1. If you set the value to 0, then Kojak will compute new monoisotopic precursor masses for the spectra. Regardless
of this parameter setting, if no precursor monoisotopic masses were previously assigned to the spectra, Kronik will compute
them.
spectrum_processing = 1
max_spectrum_peaks = 0
The two parameters above determine if and how MS/MS spectra are processed. spectrum_processing
can be turned off by
setting the value to 0. As spectrum_processing
is only intended for high-resolution MS/MS, it should be turned off
for low-resolution MS/MS data. Processing the spectra involves collapsing isotope distributions to the monoisotopic
peak to facilitate better scoring algorithm performance. Additionally, you can remove presumed noise peaks by limiting
the number of peaks to analyze after collapsing using the max_spectrum_peaks
parameter. For example, setting this
parameter to 200 will limit the analysis to only the tallest 200 peaks in the spectrum. Setting the value to 0 will
cause analysis using all peaks in the spectrum.
decoy_filter = DECOY
This last parameter is critical for downstream validation using Percolator. It defines a case-sensitive word that is found only in the labels of your decoy database sequences. The decoy word can be any that you prefer, but make sure it does not appear in the labels of your target sequences, or they will incorrectly be assigned as decoys.