Machine learning for language toolkit
In this tutorial we describe training maximum entropy document classifiers with expectation constraints that specify affinities between words and labels. See [Druck, Mann, and McCallum 2008] for more information. We assume that the task is classifying baseball and hockey documents and that we have processed data sets baseball-hockey.train.vectors and baseball-hockey.test.vectors. These methods require unlabeled training data. We can hide labels using Vectors2Vectors.
java cc.mallet.classify.tui.Vectors2Vectors \
--input baseball-hockey.train.vectors \
--output baseball-hockey.unlabeled.vectors \
--hide-targets
If the data is truly unlabeled, then the easiest way to import it is to assign an arbitrary label to each document, ensuring that each label is used at least once.
Suppose we know a priori that the words baseball and puck are good indicators of labels baseball and hockey respectively. Specifically, suppose that we estimate that 90% of the documents in which the word puck occurs should be labeled hockey, and similarly for baseball. We may specify these constraints in a file as follows.
baseball hockey:0.1 baseball:0.9
puck hockey:0.9 baseball:0.1
The general format for a constraints file is:
feature_name label_name=probability label_name=probability ...
The number of probabilities must be equal to the number of labels. The feature and label names must match the names in the data and target alphabets exactly.
The following command trains a MaxEnt classifier with the above constraints (assumed to be in file baseball-hockey.constraints
) using Generalized Expectation (GE) (as described in Druck, Mann, and McCallum 2008). We specify the constraints file using constraintsFile and specify a regularization penalty with gasussianPriorVariance
.
mallet train-classifier \
--training-file baseball-hockey.unlabeled.vectors \
--testing-file baseball-hockey.test.vectors \
--trainer "MaxEntGETrainer,gaussianPriorVariance=0.1,
constraintsFile=\"baseball-hockey.constraints\"" \
--report test:accuracy
By default, the difference between the target and model expectations is penalized using KL divergence (as in Druck, Mann, and McCallum 2008). Instead, we can impose an L2 penalty using the L2 option.
mallet train-classifier \
--training-file baseball-hockey.unlabeled.vectors \
--testing-file baseball-hockey.test.vectors \
--trainer "MaxEntGETrainer,gaussianPriorVariance=0.1,L2=true,
constraintsFile=\"baseball-hockey.constraints\"" \
--report test:accuracy
The underlying trainer is cc.mallet.classify.MaxEntGETrainer
. New GE constraints and penalties for training MaxEnt models can be defined by implementing cc.mallet.classify.constraints.ge.MaxEntGEConstraint
.
It is also possible to specify L2 constraints that do not impose a penalty if the model expectation is within some target range. For example, we can encourage model expectations to be in the range 90-100%.
baseball baseball:0.9,1
hockey hockey:0.9,1
In general, the format for range constraints is:
feature_name label_name=lower_probability,upper_probability ...
Support for such constraints is provided by MaxEntGERangeTrainer.
mallet train-classifier \
--training-file baseball-hockey.unlabeled.vectors \
--testing-file baseball-hockey.test.vectors \
--trainer "MaxEntGERangeTrainer,gaussianPriorVariance=0.1,
constraintsFile=\"baseball-hockey.range_constraints\"" \
--report test:accuracy
The underlying trainer is cc.mallet.classify.MaxEntGERangeTrainer
. New GE constraints and penalties for training MaxEnt models can be defined by implementing cc.mallet.classify.constraints.ge.MaxEntGEConstraint
.
There is also support for training MaxEnt models with Posterior Regularization (PR) Ganchev, Graça, Gillenwater, and Taskar 2010. The following command trains a MaxEnt classifier using the above constraints (assumed to be in file baseball-hockey.constraints
) with PR for 100 iterations. We specify the constraints file using constraintsFile
and specify a regularization penalty for each step (c.f. Bellare, Druck, and McCallum 2009) with pGasussianPriorVariance
and qGaussianPriorVariance
.
mallet train-classifier \
--training-file baseball-hockey.unlabeled.vectors \
--testing-file baseball-hockey.test.vectors \
--trainer "MaxEntPRTrainer,minIterations=100,maxIterations=100,
pGaussianPriorVariance=0.1,qGaussianPriorVariance=1000,
constraintsFile=\"baseball-hockey.constraints\"" \
--report test:accuracy
The underlying trainer is cc.mallet.classify.MaxEntPRTrainer
. New PR constraints and penalties for training MaxEnt models can be defined by implementing cc.mallet.classify.constraints.pr.MaxEntPRConstraint
.
Below, we discuss machine-assisted methods for obtaining constraints. Note that these methods do not yet support target ranges.
Rather than specifying the target expectations directly, we may instead specify “labels” for features, and have these converted into target expectations. Suppose we know that the word puck is associated with hockey, and the word baseball is associated with the label baseball. We may specify these labeled features in a file (baseball-hockey.labeled_features
) as follows.
baseball baseball
puck hockey
The general format for a file with labeled features is:
feature_name label_name label_name ...
Vectors2FeatureConstraints
can estimate target expectations from a file with labeled features. A simple heuristic for obtaining expectations from labeled features is to uniformly divide constant probability mass among the labels for a feature. By default, 0.9 probability is allocated to the labels for a feature. This estimation method can be specified using heuristic for the targets command option.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.labeled_features \
--targets heuristic
The option majority-prob can be used to specify a value other than 0.9. We can use the constraints file baseball-hockey.constraints
to perform GE training as above.
We may obtain a set of candidate features for which constraints may be expressed using the Latent Dirichlet Allocation (LDA) based method of Druck, Mann, and McCallum 2008.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.features \
--feature-selection lda \
--lda-file baseball-hockey.train.lda \
--targets none \
--num-constraints 10
The lda-file is a serialized LDA model file. See the topic modeling tutorial for more information. Setting targets to none tells Vectors2FeatureConstraints
to output candidate features only. baseball-hockey.features
will then contain a list of ten candidate features, one per line.
The above method is unsupervised (i.e. does not look at the true labels). We can also select candidate features using an “oracle” information gain method (infogain) that looks at the true labels. (Note that when using true labels obtaining constraints, baseball-hockey.train.vectors
, rather than baseball-hockey.unlabeled.vectors
, must be used.)
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.features \
--feature-selection infogain \
--targets none \
--num-constraints 10
Given a set of candidate features, we may estimate constraints using two methods. The first method is to have the machine label the features (by revealing the true labels and using the method of Druck, Mann, and McCallum 2008), and convert these labels into expectations using the same heuristic as above.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.features \
--targets heuristic
Note that if the candidate features are also machine-provided, we may perform both steps at the same time using, for example, the command:
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--feature-selection lda \
--lda-file baseball-hockey.train.lda \
--num-constraints 10 \
--targets heuristic
Finally, we may estimate the expectations using the exact target expectations from the labeled data. The targets option to do this is oracle.
java cc.mallet.classify.tui.Vectors2FeatureConstraints \
--input baseball-hockey.train.vectors \
--output baseball-hockey.constraints \
--features-file baseball-hockey.features \
--targets oracle
Note that when using heuristic targets, the machine may discard candidate features in the labeling process (c.f. Druck, Mann, and McCallum 2008). However, the machine does not discard any candidate features when using --targets oracle
.
gaussianPriorVariance
of 1 is a reasonable default choice.qGaussianPriorVariance
and small values for pGaussianPriorVariance
work best.