Video Tutorial

Click Here to download the files used in the video tutorial. The files used in this tutorial can be further described in the reference located here. A more descriptive tutorial can be found at the further down on the page.

Uploading datasets to CIIPro

Profiling Compounds and Optimizing Bioprofiles

Making Predictions on a Test Set

Interpreting CIIPro Similarity Graphs

Tutorial


The tutorial introduces CIIPro in two parts:


    I. CIIProfiler
    II. CIIP Predictor

Please forward feedback to danrusso@scarletmail.rutgers.edu.

Before using CIIPro, upload a compound dataset to work with. This is done under the Datasets tab. Compounds can be uploaded in a variety of formats and will automatically be converted to their PubChem Compound Identifier (CID), if available. The dataset will be stored and made available to the user. (Please note: compounds needing to be converted to PubChem CIDs are done so by PubChem's Power User Gateway. This may take a long amount of time for large datasets.)

Each uploaded dataset must be labeled as either a traning set or a test set. Training sets are used for biological profiling and can be used to predict the biological activity of test sets. Training sets and test sets should both be uploaded as tab-delimited text files containing two columns. The first column should be the chemical identifier (e.g., PubChem CIDs, CAS Registery Number, IUPAC name, etc.), and the second should contain a binary characterization of the activity distribution: 1 for active or 0 for inactive.

The upload form defaults to using PubChem CID as the chemical identifier. If your dataset uses a different chemical identifier, use the radio buttons to specify which identifier your dataset uses. Specifiying the wrong identifier may result in errors when the dataset is used in the CIIP Predictor. If this happens, delete the uploaded file and upload again using the correct identifier.

I. CIIProfiler


After uploading a dataset, create a biological profile under the CIIProfiler tab. The biological profile is created by extracting all the relevant biological testing results for compounds in the training set. This biological profile can be optimized by requiring a minimum number of active responses per assay (default minimum number of actives per assay is set at 6). This workflow of the CIIProfiler is shown in Figure 1. The resulting biological profile is a matrix consisiting of m rows of compounds and n columns of assays and will automatically be displayed as a heatmap. In the matrix, cell ai,j will be the response of the compound in the i-th row against the assay in the j-th column. The response for a compound can be categorized as 1 for active (dark blue), 0 for inconclusive or untested (grey), and -1 for inactive (light blue). The activity of a given cell will be displayed by hovering over it with the mouse. Additionally, the size of the heatmap can be changed by clicking and expanding/collapsing it in the desired dimension. Users also have the option to download the heatmap by clicking the save icon in the upper right-hand corner. The CIIProfiler will also calculate the performance of each in vitro assay for predicting the in vivo responses. A table displaying the performance statistics for each assay will be displayed along with the heatmap. Assays can be ranked by a given statistic by clicking the column header. Click here for a glossary of the statistical metrics calculated by CIIPro.

Figure 1. The CIIProfiler tool will remove insignificant assays by allowing the user to adjust the minimum number of actives required per assay; this creates a less biased, optimized biological profile.

II. CIIP Predictor


After creating a biological profile for the training set compounds, use CIIP Predictor to calculate the Weighted Estimated Biological Similarity (WEBS) between the compounds in the test set and the compounds in the training set. The WEBS tool calculates two values for each compound pair, the biological similarity and its respective confidence score. The biological similarity is a value between 0 and 1, and represents an estimate of the similarity of two compounds based upon their respective in vitro responses. Two compounds with a similarity score of 1.0 would be considered identical, and two compounds with a similarity score of 0.0 would be considered totally dissimilar. Each biological similarity value is assigned a confidence score estimating the reliability of the calculated biological similarity to account for missing data. A higher confidence score indicates a more reliable biological similarity value.

The output files for both biological similarity and confidence scores are matrices, in which rows represent compounds in the test set and columns are compounds in the training set. Cell ai,j in the biological similarity matrix contains the similarity score of the two compounds in the i-th row and in the j-th column. The same cell in the confidence scores matrix will contain the reliability (i.e. confidence score) of the biological similarity calculated for those two compounds.

Based on the generated biological similarity and confidence scores calculated by the WEBS tool, the biological nearest neighbors can be calculated by using suitable parameter cutoffs for both the biological similarity and the confidence scores. The biological similarity cutoff is the minimum biological similarity score for a compound to be considered as a nearest neighbor to the target compound. Before running CIIP Predictor, set this value to a floating point number between 0.0 and 1.0. The default value is 0.5. Then, enter a confidence score cutoff. This cutoff is the percentage of assays in the biological profile that both compounds need to have responses in for a biological similarity calculation to be meaningful. The CIIP Predictor performs best on most datasets with a confidence value of 0%, however, you can adjust this value as needed for specific datasets.

Lastly, select the number of biological nearest neighbors to be used for predictions. This is a number from 1 to 5. The activities of each test compound's biological nearest neighbors' will be averaged together to predict the target compound's activity. Compounds that do not have enough biological nearest neighbors to make a prediction will be labelled as 'N/A'.

Once you have set all the input parameters, click "Predict". CIIP Predictor will return a table listing the compounds in the test set, the compounds' in vivo activity, and the activity for each compound predicted by CIIP Predictor. The workflow of the CIIP Predictor is show Figure 2.

To visualize the biological nearest neighbors and the chemical nearest neighbors (i.e. compounds in the training set structurally similar to the target compound in the test set) of the predicted compounds, click on the PubChem CID for any compound in the prediction table that has an activity prediction return. This opens up a new browser tab containing a similarity graph, a plot that shows biological nearest neighbors on the right and chemical nearest neighbors on the left of the target compounds' predicted activity. The y axis represents the similarity score for each compound, ranging from 0-1. Hovering over data points will display information for that compound. The vertical bar in the center is the prediction scores from the chemical nearest neihghbors (left half) and the biological nearest neighbors (right half). Chemical nearest neighbors are calculated by using MACCS keys as features and the Tanimoto coefficient as the similarity metric.




Figure 2. Under the CIIP Predictor tab, users can create bioloigcal similarities, confidence scores, and biological nearest neighbors. With activity data, users can generate in vitro - in vivo correlations generate predictions for compounds, and cross validate their model.