Kojak Configuration

Configuring Kojak

Due to the large number of parameters used in any Kojak analysis, all parameters are set in a configuration file rather than typed out on the command line. A sample configuration file is provided with Kojak, and can also be copied from this example.

Changing parameters should be done from any ASCII text editor, such as vim in Linux, or notepad in Windows. For ease of reading, parameter names use english words instead of single letters. Notes are defined as any text that appears after the # symbol on each line. Notes are not processed as parameters and are there to provide helpful information to users. Feel free to add additional notes as needed for your analysis.

Helpful Hint

Make a different configuration file for each analysis you wish to perform. Simply copy an existing configuration file, then give it a unique name and adjust the parameters. Thus, you preserve the unique parameter settings for each of your analyses.

For clarity, the parameters have been divided into six groups by similarity. This page will describe the most important parameters within those groups. A complete list of all parameters and their descriptions can be found on the Parameters page.

1. Data I/O

The Data I/O parameters describe files that will be imported and exported from Kojak during analysis. There are two key files for input: your MS data file and a protein sequence database:

database = polymerase_proteins.fasta
MS_data_file = my_data.mzML

There are two points worth mentioning in the above example. First, the protein sequence database is an ASCII text file of the FASTA format. It should contain both target proteins (sequences you hope to find) and decoy proteins (sequences you know are false). Second, the mass spectrometry data file must be in an approved format. On all systems, those formats are mzML (recommended) and mzXML. Additionally, Thermo RAW files may be used on Windows systems if the appropriate drivers have been installed.

Helpful Hint

FASTA sequence database files should contain both target and decoy sequences if you wish to use Percolator for results validation. Below is an example of what a target protein sequence and its matching decoy sequence might look like:

>Protein_1
MTHISISMYPRTEINSEKWENCE
>Decoy_1_reverse
ECNEWKESNIETRPYMSISIHTM

Helpful Hint

Kojak looks for files in the current working directory, i.e. the directory you are in. You can provide relative paths to parent- and sub-directories if desired. To ensure input and output is directed from and to a specific location no matter which directory you are in, provide an absolute path with your files:

database = C:\databases\e_coli.fasta

The remaining parameters in this group define exported content after Kojak analysis is complete. All output is in tab-delimited ASCII text files. The standard output file contains a complete summary of the results in a single file:

output_file = my_results.txt

The Percolator parameters return the same results, but formatted such that they can be directly imported to Percolator for validation of the peptide spectrum matches (PSMs):

percolator_file = my_results_perc_format
percolator_version = 2.07

For the percolator_file parameter, no extension is necessary. The parameter requires only a base name because Kojak will automatically divide the Percolator-formatted results into multiple files according to PSM type (e.g. intra-protein cross-links) and append a txt extension. The percolator_version number should match the version of Percolator you will be using, as input formats differ between Percolator versions. Kojak will automatically format the results to match the version you specify.

2. Data Descriptors

The Data Descriptor parameters describe the spectra to be analyzed. Accurate description of the spectra ensures the correct spectrum processing is performed during the Kojak analysis.

During Kojak analysis, all profile MS peaks must be converted to centroid MS peaks. Kojak has built-in functionality to perform this operation, but requires additional information about the instrument used and the resolution settings. Correctly setting these values will produce the most accurate centroid peak conversion. First, identify whether or not the spectra are centroided at the MS and MS/MS levels:

MS1_centroid = 0
MS2_centroid = 1

In the above example, 0 indicates profile peaks and 1 indicates centroid peaks. The parameters thus tell Kojak that the MS spectra are profile, and the MS/MS spectra are centroid. As a result, only the MS peaks will be converted to centroid values during the analysis. However, Kojak will need additional information about the peaks, such as what instrument they were acquired on, and at what resolution:

instrument = 0
MS1_resolution = 55000

Here, the 0 for instrument indicates an orbitrap mass analyzer. The resolution is approximated at 400 m/z, which may differ from m/z value the manufacturer reports. Make sure you report the resolution at 400 m/z. MS2_resolution is not used in this example, because we set MS2_centroid = 1, and thus any value for that parameter is ignored during analysis.

Helpful Hint

Don't want to worry about setting the correct parameters to convert profile peaks to centroid? Simply give Kojak a data file where all the peaks are already centroided. With the right tool, such as msconvert from ProteoWizard, you can do this when converting your raw data to an mzML file.

The enrichment parameter is always set to 0, unless you performed peptide c-terminus labeling during the enzymatic digestion step of your sample prep. Details of how and when to use this parameter are found on the Parameters page.

3. Cross-links

Although there are only two parameters in this category, they have special syntax that allow for analysis of data cross-linked with a large, diverse set of chemicals.

Chemical cross-linkers are assumed to be bifunctional. This means that they have two reactive groups that may bind to the same (homobifunctional) or different (heterobifunctional) moieties. Cross-linkers also have different spacer arm compositions that influence their mass. As an example, here is the parameter you would set if you used either DSS (or BS3) as your cross-linker:

cross_link = 1 1 138.068742

There are three values for the parameter. The first two values indicate the reaction targets on both ends of the cross-linker. DSS is homobifunctional and specifically targets primary amines (lysine residues or the n-terminus). Thus, the first two values are both 1, which is the chemical code for primary amines. The third value is the amount of additional mass the spacer arm adds to the mass of two cross-linked peptides.

It is also common for cross-linkers to produce partial linkages. These occur when one side of the cross-linker binds to a protein, but the other side becomes inert through hydrolysis. The resulting peptide remains unlinked, but with a differential modification equal to the hydrolyzed cross-linker at the target moiety:

mono_link = 1 156.0786

Similar to the cross_link parameter, the mono_link parameter requires a reaction target, albeit only one. The differential mass is different as well, to reflect the different atomic composition of the incompletely linked spacer arm.

Often, you will use both the cross_link and mono_link parameters in the same configuration file. However, some cross-linkers may not produce partial links, and thus the mono_link lines can be deleted or commented out using the note # character. Also, as is the case for zero-length cross-linkers, the differential mass may be negative if there is no corresponding spacer arm and instead the target residues are linked in a hydrolysis reaction.

It is also possible to add multiple cross_link and mono_link parameter lines to a configuration file if you used more than one cross-linker in your sample preparation, or there is more than one outcome for the reactions. For example, if you used a light-heavy labeled combination of BS3, your parameters may look like this:

cross_link = 1 1 138.068742  #normal BS3
cross_link = 1 1 150.143404  #heavy isotope-labeled BS3
mono_link = 1 156.0786       #normal
mono_link = 1 168.15393      #heavy

Or perhaps you used two cross-linkers of different chemistries. Your parameters may look like this:

cross_link = 1 1 138.068742  #DSS
cross_link = 1 2 -18.010595  #EDC
mono_link = 1 156.0786       #DSS half-reaction

4. Modifications

The Modifications parameters refer to amino acid modifications that are independent of the cross-linking reactions. These occur naturally (e.g. post-translational modifications) or as a result of sample preparation (e.g. carbamidomethylcysteine):

fixed_modification = C 57.02146
modification = M 15.9949
max_mods_per_peptide = 2

The fixed_modification and modification parameters each require two values: a target and a mass. The fixed_modification parameter specifies a static mass that is added to the normal mass of the indicated peptide. The modification parameter specifies a mass that may or may not be on the peptide. Because each modification exponentially increases search time, you can limit the number of differential modifications to a practical number. You can also specify multiple fixed_modification and modification parameter lines to indication multiple different mass modifications you expect to find in your analysis.

Helpful Hint

You can specify multiple differential modifications for the same amino acid. Place each modification on a separate line, but indicate the same amino acid with a different mass.

5. Scoring Algorithm

The Scoring Algorithm parameters describe the tolerances applied to the scoring algorithm for analyzing MS/MS spectra of different resolutions and mass accuracy. High-resolution MS/MS have a really small mass tolerance and no mass offset. Below is an example of values appropriate for high-resolution (e.g. orbitrap) MS/MS spectra:

fragment_bin_size = 0.03
fragment_bin_offset = 0.0

Low-resolution (e.g. ion trap) MS/MS spectra need a much wider mass tolerance, and an offset to make sure the peak values are most likely to fall within the center of the mass tolerance. For low-resolution MS/MS spectra, the following parameter values are recommended:

fragment_bin_size = 1.0005
fragment_bin_offset = 0.4

6. Analysis

The Analysis parameters are the largest set of parameters, but are easy to understand and use. Not all parameters are listed in these examples. You can find the complete list on the Parameters page.

enzyme = [KR]|{P}
max_miscleavages = 2

The above two parameters indicate the trypsin enzyme cleavage rule and the number of missed cleavages allowed. The missed cleavages include those caused by cross-linkers binding to the enzyme target residues.

min_peptide_mass = 500.0
max_peptide_mass = 6000.0

The peptide mass boundaries shown above are important to understand, but once set, you probably will use the same values for all Kojak analyses. These parameters determine the smallest and largest peptides to be searched, by mass in Daltons, not by number of amino acids. Setting a low min_peptide_mass will allow you to search for cross-links to small peptides, for example, peptides with just three amino acids. Although links to small peptides do happen, they are not likely to be seen after validation because there would be very few fragment ion peaks to contribute to the score of the PSM. Thus, a more practical value such as 500.0 Daltons is recommended. For max_peptide_mass, consider that cross-linkers binding to enzymatic cleavage sites will produce a larger number of long peptides. As a result, it might make sense to use a large value, such as 6000.0 Daltons, for a Kojak analysis, although such a number might be considered excessive in a typical shotgun database search.

ppm_tolerance_pre = 25.0
prefer_precursor_pred = 1

These two parameters help define the search space for each spectrum. ppm_tolerance_pre defines the tolerance around the precursor mass, in parts per million, that a PSM is allowed to have to be accepted. Setting a value similar to your instrument mass accuracy is recommended, but there are various reasons to use much larger or smaller values that are outside the scope of this manual. Sometimes, such as with Thermo RAW data, a suggested monoisotopic precursor mass is provided for a spectrum. It is often helpful to use this value, which is indicated by setting prefer_precursor_pred to 1. If you set the value to 0, then Kojak will compute new monoisotopic precursor masses for the spectra. Regardless of this parameter setting, if no precursor monoisotopic masses were previously assigned to the spectra, Kronik will compute them.

spectrum_processing = 1
max_spectrum_peaks = 0

The two parameters above determine if and how MS/MS spectra are processed. spectrum_processing can be turned off by setting the value to 0. As spectrum_processing is only intended for high-resolution MS/MS, it should be turned off for low-resolution MS/MS data. Processing the spectra involves collapsing isotope distributions to the monoisotopic peak to facilitate better scoring algorithm performance. Additionally, you can remove presumed noise peaks by limiting the number of peaks to analyze after collapsing using the max_spectrum_peaks parameter. For example, setting this parameter to 200 will limit the analysis to only the tallest 200 peaks in the spectrum. Setting the value to 0 will cause analysis using all peaks in the spectrum.

decoy_filter = DECOY

This last parameter is critical for downstream validation using Percolator. It defines a case-sensitive word that is found only in the labels of your decoy database sequences. The decoy word can be any that you prefer, but make sure it does not appear in the labels of your target sequences, or they will incorrectly be assigned as decoys.