Validating PSMs

Validating Kojak PSMs

The Kojak results are a list of the best peptide-spectrum match (PSM) for each spectrum analyzed. Each PSM is simply the best theoretical sequence match to the observed spectrum peaks, and not necessarily the correct sequence match. There are many reasons why the correct PSM may not have been found (e.g. the sequence was not in your database, the peptide contains an unknown PTM, poor spectrum quality, etc.), and thus the observed PSMs must be statistically validated.

Kojak results come pre-formatted for use with Percolator, a semi-supervised machine learning algorithm for validating PSMs at a user-defined false discovery rate. Percolator is developed and maintained by the Käll Lab at KTH, and can be downloaded for free here. Extensive discussion of Percolator and its features is beyond the scope of these instructions; however, basic instructions are provided below.

Advanced Users

There are many alternatives to Percolator for validating PSMs. If you prefer an alternative, all the parameters and scoring metrics from the Kojak algorithm are provided in the basic Kojak results file. These results should be parsed and formatted for your algorithm of choice.

Running Percolator

Percolator is run from a command line. The command line syntax and available options change with each version. These instructions describe using Percolator version 2.08. Descriptions of command line options are found when executing Percolator with the -h option.

Run Percolator from the directory containing your Kojak results. To validate Kojak results, provide the Percolator-formatted file as input, and redirect output to a new file:

$ /usr/bin/percolator data_set1.perc.intra.txt > data_set1.intra.validated.txt

In the above example, intra-protein cross-linked PSMs from Kojak are analyzed with Percolator to obtain a non-redundant list of unique cross-linked peptides. All these unique peptide cross-links have q-values that are used to filter for a given false discovery rate.

Helpful Hint

Kojak formats and partitions all PSMs into subsets designed for Percolator analysis. These subsets are labeled with perc to indicate they are input for Percolator. Additionally, they are labeled with inter, intra, loop, and single to describe which type of PSM is contained in the files.

Label	Description
inter	inter-protein cross-linked PSMs
intra	intra-protein cross-linked PSMs
loop	loop-linked PSMs
single	non-linked PSMs (i.e. typical shotgun PSMs)

Interpreting Percolator Results

Percolator results are tab-delimited and can be viewed with spreadsheet software. Detailed instructions are found in the Percolator Results section.

Additional Information

Percolator trains its parameters on a subset of the input PSMs. If the number of input PSMs is low (e.g. less than 100), then the train and test procedure may fail. An error message will be reported to the user. If the train and test procedure passes despite a low number of input PSMs, the results may not accurately model the true error distribution. A warning may be issued by Percolator. It may be necessary to use alternative forms of validation in the case of low numbers of PSMs in the input.

Percolator requires that the input PSMs contain both target and decoy labels. Thus, it is essential that Kojak was used with decoy protein sequences in the database. Additionally, the decoy_filter parameter must be correctly set to a tag that is contained in the descriptions of all decoy protein sequences. Additional details can be found in the Configuration instructions. Failure to operate Kojak in this manner will result in failure of Percolator to assign q-values to each PSM.