What is Inductive Logic Programming?
Inductive Logic Programming (ILP)ILP (Muggleton 1991). combines Machine Learning and Logic Programming. Given a formal encoding of the background knowledge and a set of examples, an ILP system will derive hypotheses which explain all the positive examples and none, or almost none, of the negative examples. In this approach, logic is used as a language to induce hypotheses from the examples and background knowledge. Thus, the result of learning is represented as a logical formula in predicate logic, typically a Prolog program. Briefly, the basic form of the ILP problem is defined as follows.
- A background knowledge B which is the knowledge available before the learning.
- A finite set of examples E, E = E+ U E- where E+ is a nonempty set of positive examples, and E- is a set of negative examples.
- Find: hypotheses H (set of rules), such that:
- All or almost all positive examples e∈E+ are covered by H.
- No or few negative examples e∈E- are covered by H.
|Background knowledge (B)||Positive examples (E+)||Negative example (E-)|
|parent(Ann, Mary)||daughter(Mary, Ann)||daughter(Tom, Ann)|
|parent(Ann, Tom)||daughter(Eve, Tom)||daughter(Eve, Ann)|
Advantages of ILP
In comparison with other machine learning approaches, ILP has several advantages.
- Firstly, in data mining, ILP is able to discover knowledge from a multi-relational database consisting of multiple tables. Thus, ILP is also called multi-relational data mining (Dzeroski 2003).
- Secondly, using logic programming allows to encode more general forms of background knowledge such as recursions, functions or quantifiers (Srinivasan, King et al. 1999).
- Finally, the learned rules, which are based on first-order logic, are comprehensible to humans and computers and can be interpreted without the need for visualization.
Applications of ILP to real problems in bioinformaticsILP has been used recently in bioinformatics studies because of its ability to take into account background knowledge during the learning stage and to work directly with structured data. It has been successfully applied to various bioinformatics problems including toxicology (Srinivasan, et al., 1997), breast cancer (Woods, et al., 2009), protein structure prediction (Cootes, et al., 2003), gene function prediction (King, 2004), protein function class prediction (King, et al., 2000), protein-protein interaction prediction (Nguyen and Ho, 2008), protein-ligand interaction prediction (Kelley, et al., 2009) and microarray data classification (Ryeng and Alsberg, 2010).
Main steps for an ILP applicationThe goal is to use ILP to translate a mutation database to a mutation knowledge base. The steps to achieve this goal are illustrated in the figure below and the following sections expand on each of the stages.
Postive / negative example definitionThe first task involves identifying the problem and translating it into the positive and negative examples. Here, we have limited our study to the task of discriminating mutations linked to known human diseases (deleterious) from those associated with the "polymorphism" term (neutral), in accordance with the nomenclature used in the UniProtKB database
The predicate for the positive examples is "is_deleterious". For example, a positive example in Prolog syntax is "is_deleterious(m_Q13496_Asn180Lys)". This clause indicates that, in the Q13496 protein, the replacement of the Asparagine at position 180 by a Lysine is deleterious. The positive examples are written in a file with a ".f" extension.
The negative examples use the same predicate (is_deleterious) but they are written in difference file (with a ".n" extension).
Training set and validation set can be download here.
Background knowledge construction
- The construction of ILP background knowledge includes two steps.
- The first one allows the identification of the predicates. This is an important step as the theory returned is established by the predicates.
- In the second step, we use SQL scripts to translate this information stored in a relational database management system
Figure. Overview of the implementation of the background knowledge extracted from MSV3d database.
Figure. ER diagram for background knowledge
Table: Summary of background knowledge predicates used in our study.
|Information type||Predicates||Value||Calculation method||Description and comments|
|physico-chemical changes induced by the substitution||modif_size(+MutationID, #Value)||+ / - / =||Described in Physico-chemical changes Section||Size, charge, polarity and hydrophobicity modification.
+ : increase
- : decrease
= : same order of magnitude
!= : opposite
|modif_charge(+MutationID, #Value)||+ / - / = / !=|
|modif_polarity(+MutationID, #Value)||+ / - / =|
|modif_hydrophobicity(+MutationID, #Value)||+ / - / =|
|g_or_p(+MutationID, #Value)||+ / - / =||Local python script||Glycine or Proline disparition or apparition. For example: p.Arg2Gly (apparition), p.Pro2Leu (disparition)|
|functional and structural features||conservation_class(+MutationID, #Value)||Global rank1 / Global rank 2 / Sub-family conservation / no conservation typification||Described in Conservation Ranking Section||Conservation rank of the substituted position in the sampled alignment.|
|conservation_mut(+MutationID, -Value)||Percentage||Local python script||Percentage of mutated residue in the column of the alignment|
|freq_at_pos(+MutationID, -Value)||Integer||Database count||Number of known mutations at this position|
|secondary_struc(+MutationID, #Value)||Helix / sheet / no / no structural data available||DSSP||In a secondary structure element?|
|is_in_site(+MutationID, #Value)||yes / no||MACSIMS annotations (Thompson 2006)||The position of the mutation is within/outside an annotated functional site?|
|structural modifications induced by the substitution, based on the mutant models (data are computed only for sequences that share more the 50% identity with the template used for the model construction)||gain_contact(+MutationID, #Value)||
Hydrophilic - Hydrophilic contact
Aromatic - Aromatic contact
Hydrophobic - Hydrophobic contact
Hydrophobic - Hydrophilic contact
|CSU (Sobolev 1999)||Contacts between:
- the wild type residue and its direct 3D neighbours, based on the wild type 3D model
- the mutant residue and its direct 3D neighbours, based on the mutant 3D model are computed and compared
It should be noted that a residue can gain or lost from 0 to N the contacts with its direct 3D neighbours by the substitution.
|wt_accessibility(+MutationID, #Value)||buried / intermediate / accessible||Described in Accessibility Section||Wild type relative accessibility|
|mut_accessibility(+MutationID, #Value)||buried / intermediate / accessible||Mutant relative accessibility|
|stability(+MutationID, #Value)||Decrease / Increase||Described in Protein stability changes Section||I-mutant DDG variation|
|fold(+MutationID, #Value)||Fold name||Mining from SCOP DB||Structural Classification of Proteins|
Physico-chemical changes induced by the substitutionA classification of the physical and chemical properties of amino acids in KD4v is based on [Taylor et al., 1986]. The Taylor classification is displayed as a Venn diagram.
In this Venn diagram, charged is subcategorized into positive and negative. So modifications in charge are calculated as follows:
|Modifications in charge||Mutant residue|
|Positively charged||Negatively charged||Neutral|
|Wild type residue||Positively charged||unchanged||opposite||decrease|
Based on polarity, amino acids are classified into two groups: polar and non-polar. So modifications in polarity are calculated as follows:
|Modifications in polarity||Mutant residue|
|Wild type residue||Non polar||unchanged||increase|
The same method with modifications in size
|Modifications in size||Mutant residue|
|Wild type residue||Tiny||unchanged||increase||increase|
Hydrophobicity index for common amino acids is based on [Monera et al., 1995]. Affinity for water is described by 3 sets: hydrophobic, neutral and hydrophilic. So modifications in hydrophobicity are calculated as follows:
|Modifications in hydrophobicity||Mutant residue|
|Wild type residue||Hydrophobic||unchanged||decrease||decrease|
- Conservation ranking is calculated using an in-house developed method based on a
- (i) two independent conservation scores are computed for each column of the sampled MACS, namely, the free energy score [Lockless and Ranganathan, 1999] and a score based on the two-dimensional vector representation of the amino acids (Figure 7);
- (ii) these scores are classified with a DPC algorithm [Wicker et al., 2002] to define "groups of conservation;"
- (iii) the groups are then ranked. This process is reiterated for each MACS subfamily. Thus, two global conservation classes (rank 1 and rank 2) and one subfamily conservation class per subfamily are finally defined.
- If accessibility <= 10 then residue is buried.
- If 10 < accessibility <= 30 then residue is intermediate.
- If accessibility > 30 then residue is accessible.
Figure 6. Schema of the conservation ranking
Figure 7. Schema of the column conservation score calculation, based on the vector representation of amino acids. If N is the number of sequences in the alignment, the score is computed as follows: the N vectors of the residues in the column are added to obtain Vm, the representative vector of the column. The norm of the vector Vm is then divided by the sum of the norms of all vectors from the column and normalized by the number of residues in the column. The score will be 1 if all the residues in the column are identical and less than 1 otherwise.
- Solvent accessible surface (A**2) for wild type residue and mutant residue is calcluted by
CSU software (Sobolev 1999).
Protein stability changesThe change in protein relative stability upon single-site mutation (DDG value) is predicted with I-mutant2.0 [Capriotti et al., 2005] : a positive DDG value implies a protein stability increase, whereas a negative DDG value suggests a destabilizing mutation.
Cross-validation performance on the trainings setTo study the role of fold classification of proteins in the prediction, we identify 2 subsets of background knowledge. The table below shows the average of different measures of predictive for 5 fold cross-validation performance on the trainings set for each of background knowledge. As expected, the background with fold information would lead to better performance.
Table:Results of 5 fold cross-validation for comparison between different background knowledge on the traing set
|Background knowledge without Fold predicate||872||121||328||279||0,7267||0,6985||0,8784||0,7267||0,7197||0,7953|
|Background knowledge with Fold predicate||893||111||307||289||0,7442||0,723||0,8897||0,7442||0,7389||0,8104|
Click here to see the the definition of TP, FP, FN, TN, Sensitivity, Specificity, Precision, Recall, Accuracy and F-measure
ILP system selection and parameter identificationMany ILP systems have been proposed and they have been successfully applied to diverse domains. Examples of ILP systems are FOIL, Progol, Tidle and Aleph. We chose Aleph (Version 5) with the SWI-Prolog compiler (Version 5.6.47) to learn rules from a set of examples because of its popularity, frequent update and flexibility. It permits customization of all the settings of the learning task. For researchers, Aleph is also very attractive since it is coded in Prolog, and is consequently relatively easy to modify in order to adequately perform rule searches. See Aleph website for more details.
- There are many parameters of Aleph we can vary in our experiments.
- First, the parameter minpos, indicating the minimum number of positive examples to be covered by an acceptable clause, was set to 15.
- Second, we needed a larger default search space and we thus set the parameter nodes to 200,000 (default 5,000). The parameter nodes defines the maximum number of nodes on the search space to be explored by the algorithm.
- Finally, the parameter with the largest affect on the final theory was the noise, defined as the maximum number of negative examples to be covered by an acceptable clause. To estimate this parameter, we tuned the model with different values. We used 5-fold cross-validation on the training set for optimizing noise values and the optimal value, i.e., the value which resulted in the best performance on our data sets, was then used for the final model. The most stringent parameter noise was set to 5.
Model evaluationWe compare the performance of our best model with SIFT and PolyPhen-2, the most common and widely used methods that can be acquired, implemented and run locally. In the table below, different measures of predictive performance are reported on the validation set.
Table: Comparison of our KD4v with other systems on the validation set (658 disease-causing mutations and 298 neutral polymorphisms).
Where True Positives (TP) and True Negatives (TN) are the number of correct predictions of the positive and negative examples, respectively; False Positives (FP) is the number of negative examples incorrectly predicted as positive; and False Negatives (FN) is the number of positive examples incorrectly predicted as negative.
The F-measure is the harmonic mean of precision and recall. This score reaches its best value at 1 and worst score at 0. It is defined as follows:
Distribution of the predicted variantsThe chart beneath presents the distribution of predicted variants by our method, SIFT and PolyPhen-2, considering only 3D models from dbSNP.
How to interpret the rules
To illustrate how to understand the rules on the KD4v website, we can consider the first rule (humvar398_1) in above figure. This rule states that a mutation A is deleterious if:
+ The mutated residue belongs to the "global conservation rank 1 class". (See "conservation score calculation" in "Background knowledge construction" section)
+ There are two or more two observed mutations at the mutant position.
This rule correctly identified 475 deleterious mutations while misclassifying 2 neutral mutations as deleterious on training set
This rule states that a mutation A is deleterious if:
+ The mutated residue belongs to the "global conservation rank 1 class". (See "conservation score calculation" in "Background knowledge construction" section)
+ The wild-type residue is buried.
+ After point mutation, there is one more contact between mutant residue and its direct 3D neighbours. This contact is destabilizing (hydrophobic-hydrophilic contact).
Prediction service implementationBased on the rules learnt by the ILP algorithm described above, a service aimed at estimating nsSNP effects has been incorporated in the KD4v server. It can be accessed via SOAP API or the Prediction link on web interface, in the main menu. The input form allows users to specify the amino acid position and substitution of a given protein to be predicted including: the Uniprot accession number of the human protein, the mutation position, the wild type residue and the mutant residue. The wild type residue must correspond to the current protein sequence.
Figure. Screenshots of Input form. The mutation predicted in this example is the p.Gly455Asp of Probable G-protein coupled receptor 179 protein. (http://dx.doi.org/10.1016/j.ajhg.2011.12.007)
Given the input data, the MSV3d annotation service generates a multi-level characterization of the mutant. The process starts with the generation of mutant 3D models. Then, physico-chemical changes and structural modifications induced by the substitution, as well as functional and structural features related to the mutated position are calculated. If a 3D model is available, these values are transferred to the Convert2Prolog module to convert them into prolog facts which then become the input for the prediction engine. Thanks to Prolog, the deductive reasoning process immediately derives a conclusion (deleterious or neutral missense variant).
Figure. Screenshots of ouput page. The output page provides prediction results as well as multi-level (physico-chemical, functional, structural and evolutionary) characterizations of the mutation. The important advantage, in comparison with other predicition system in the same field, is that KD4v supplies the rules which explain the prediction result.
Web Browser Requirements
- The KD4v is optimized for the following internet browsers:
- Firefox 3.5+
- Google Chrome 16
- The latest java plugin should be installed in order to use the Jmol Applet.
How to cite usLuu, Dao; Rusu, Alin; Walter, Vincent; Linard, Benjamin; Poidevin, Laetitia; Ripp, Raymond; Moulinier, Luc; Muller, Jean; Raffelsberger, Wolfgang; Wicker, Nicolas; Lecompte, Odile; Thompson, Julie; Poch, Olivier; Nguyen, Hoan (2012). KD4v: Comprehensible Knowledge Discovery System For Missense Variant. Nucleic Acids Res
What is Rule Ranking?
Rule Ranking is based on the evaluation function of ILP system. Evaluation function used in KD4v is coverage. With coverage function, rule utility is P - N, where P, N are the number of positive and negative examples covered by the rule. A rule is ranked highest (#1) as this rule explains maximum number of the positive examples and none, or almost none, of the negative examples.
Why KD4v doesn't have prediction score?
Given a set of positive and negative examples, an ILP system finds a set of rules that differentiates between the positive and negative examples. An unseen example is classified as positive if it satisfies the conditions of one of the rules. If the example does not match any rule, it receives the negative classification. Thus we have no classification score.
What is MACSIMS?
MACSIMS stands for Multiple Alignment of Complete Sequences Information Management System. MACSIMS allows the automatic annotation of a MACS by integration of mined and predicted structural and functional information. By extension, the output of this program is called a MACSIMS.
See (Thompson 2006) for more details.
Why is the MACSIMS and/or Jmol loading so long?
The size of a MACSIMS file (XML format) can reach more than 1Mo, which can be too much for some old browsers. The Jmol applet loading is performed in parallel of the MACSIMS loading, this explains why the loading time can be quite long.
Why does the Jmol applet stay in loading state?
Jmol applet can take time to load a molecule. After several minutes, please reload the page (F5). If the problem persists, check your browser.
Why does the Jmol applet not work?
Several reasons may be invoked.
Jmol works with Java version >= 1.4. If your Java version is under 1.4, please update your Java library. If you are working on MacOS system, the problem comes from JAVA liveConnect that is not supported natively by this system. See Java or MacOS documentation about LiveConnect. On GNU/linux system (tested on Fedora), all functions work fine Firefox and Mozilla browsers. The interface has also been extensively tested on MS Windows system: no problem occurred with Firefox, Mozilla and Google Chrome under XP, Windows Vista, Windows 7. If you have problems, check your Java configuration.
Taylor, W.R. (1986) The classification of amino acid conservation. J Theor Biol, 119, 205-218.
Monera, O.D., Sereda, T.J., Zhou, N.E., Kay, C.M., and Hodges, R.S. (1995). Relationship of sidechain hydrophobicity and alpha-helical propensity on the stability of the single-stranded amphipathic alpha-helix. J Pept Sci 1, 319-329.
Srinivasan, A., King, R.D., Muggleton, S. and Sternberg, M.J.E. (1997) Carcinogenesis Predictions Using ILP. Proceedings of the 7th International Workshop on Inductive Logic Programming. Springer-Verlag.
Muggleton, S. (1991) Inductive logic programming. New Generation Computing 8(4): 295-318.
Sobolev, V., Sorokine, A., Prilusky, J., Abola, E.E. and Edelman, M. (1999) Automated analysis of interatomic contacts in proteins. Bioinformatics, 15, 327-332.
Srinivasan, A., R. King, et al. (1999) The role of background knowledge: using a problem from chemistry to examine the performance of an ILP program. Oxford University Computing Laboratory.
Woods, R.W., Oliphant, L., Shinki, K., Page, D., Shavlik, J. and Burnside, E. (2009) Validation of Results from Knowledge Discovery: Mass Density as a Predictor of Breast Cancer. J Digit Imaging.
King, R.D., Karwath, A., Clare, A. and Dehaspe, L. (2000) Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast, 17, 283-293.
Cootes, A.P., Muggleton, S.H. and Sternberg, M.J. (2003) The automatic discovery of structural principles describing protein fold space. J Mol Biol, 330, 839-850.
Dzeroski, S. (2003) Multi-relational data mining: an introduction. SIGKDD Explor. Newsl. 5(1): 1-16.
King, R.D. (2004) Applying inductive logic programming to predicting gene function. AI Mag., 25, 57-68.
Capriotti, E., Fariselli, P. and Casadio, R. (2005) I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res, 33, W306-310.
Thompson, J.D., Muller, A., Waterhouse, A., Procter, J., Barton, G.J., Plewniak, F. and Poch, O. (2006) MACSIMS: multiple alignment of complete sequences information management system. BMC Bioinformatics, 7, 318.
Nguyen, T.P. and Ho, T.B. (2008) An integrative domain-based approach to predicting protein-protein interactions. J Bioinform Comput Biol, 6, 1115-1132.
Kelley, L.A., Shrimpton, P.J., Muggleton, S.H. and Sternberg, M.J. (2009) Discovering rules for protein-ligand specificity using support vector inductive logic programming. Protein Eng Des Sel, 22, 561-567.
Ryeng, E. and Alsberg, B.K. (2010) Microarray data classification using inductive logic programming and gene ontology background information. Journal of Chemometrics, 24, 231-240.