dpcluster Package¶
algorithms
Module¶
-
class
dpcluster.algorithms.
OnlineVDP
(distr, w=0.1, k=25, tol=0.001, max_items=100)[source]¶ Experimental online clustering algorithm.
Parameters: - distr – likelihood-prior distribution pair governing clusters. For now the only option is using a instance of
dpcluster.distributions.GaussianNIW
. - w – non-negative prior weight. The prior has as much influence as w data points.
- k – maximum number of clusters.
- tol – convergence tolerance.
- max_items – maximum queue length.
-
get_model
()[source]¶ Get current model.
Returns: instance of dpcluster.algorithms.VDP
- distr – likelihood-prior distribution pair governing clusters. For now the only option is using a instance of
-
class
dpcluster.algorithms.
Predictor
(model, ix, iy)[source]¶ -
distr_fit
(*args)¶
-
precomp
(*args)¶
-
predict
(*args)¶
-
-
class
dpcluster.algorithms.
VDP
(distr, w=0.1, k=50, tol=1e-05, max_iters=10000)[source]¶ Bases:
object
Variational Dirichlet Process clustering algorithm following “Variational Inference for Dirichlet Process Mixtures” by Blei et al. (2006).
Parameters: - distr – likelihood-prior distribution pair governing clusters. For now the only option is using a instance of
dpcluster.distributions.GaussianNIW
. - w – non-negative prior weight. The prior has as much influence as w data points.
- k – maximum number of clusters.
- tol – convergence tolerance.
-
batch_learn
(x, verbose=False, sort=True)[source]¶ Learn cluster from data. This is a batch algorithm that required all data be loaded in memory.
Parameters: - x – sufficient statistics of the data to be clustered. Can be obtained from raw data by calling
dpcluster.distributions.ConjugatePair.sufficient_stats()
- verbose – print progress report
- sort – algorithm optimization. Sort clusters at every step.
Basic usage example:
>>> distr = GaussianNIW(data.shape[2]) >>> x = distr.sufficient_stats(data) >>> vdp = VDP(distr) >>> vdp.batch_learn(x) >>> print vdp.cluster_parameters()
- x – sufficient statistics of the data to be clustered. Can be obtained from raw data by calling
-
conditional_expectation
(*args)¶
-
conditional_ll
(x, cond)[source]¶ Conditional log likelihood.
Parameters: - x – sufficient statistics of data.
- cond – slice representing variables to condition on
-
ll
(x, ret_ll_gr_hs=(True, False, False))[source]¶ Compute the log likelihoods (ll) of data with respect to the trained model.
Parameters: - x – sufficient statistics of the data.
- ret_ll_gr_hs – what to return: likelihood, gradient, hessian. Derivatives taken with respect to data, not sufficient statistics.
-
marginal
(*args)¶
-
plot_clusters
(**kwargs)[source]¶ Asks each cluster to plot itself. For Gaussian multidimensional clusters pass
slc=np.array([i,j])
as an argument to project clusters on the plane defined by the i’th and j’th coordinate.
-
pseudo_resp
(*args)¶
-
pseudo_resp_cache
(*args)¶
-
resp
(*args)¶
-
resp_cache
(*args)¶
- distr – likelihood-prior distribution pair governing clusters. For now the only option is using a instance of
distributions
Module¶
-
class
dpcluster.distributions.
ConjugatePair
(evidence_distr, prior_distr, prior_param)[source]¶ Conjugate prior-evidence pair of distributions in the exponential family. Conjugacy means that the posterior has the same for as the prior with updated parameters.
Parameters: - evidence_distr – Evidence distribution. Must be an instance of
ExponentialFamilyDistribution
- prior_distr – Prior distribution. Must be an instance of
ExponentialFamilyDistribution
- prior_param – Prior parameters.
- evidence_distr – Evidence distribution. Must be an instance of
-
class
dpcluster.distributions.
ExponentialFamilyDistribution
[source]¶ Models a distribution in the exponential family of the form:
\(f(x | \nu) = h(x) \exp( \nu \cdot T(x) - A(\nu) )\)
Parameters to be defined in subclasses:
- h is the base measure
- nu (\(\nu\)) are the parameters
- T(x) are the sufficient statistics of the data
- A is the log partition function
-
ll
(xs, nus, ret_ll_gr_hs=(True, False, False))[source]¶ Log likelihood (and derivatives, optionally) of data under distribution.
Parameters: - xs – sufficient statistics of data
- nus – parameters of distribution
-
class
dpcluster.distributions.
Gaussian
(d)[source]¶ Bases:
dpcluster.distributions.ExponentialFamilyDistribution
Multivariate Gaussian distribution with density:
\(f(x | \mu, \Sigma) = |2 \pi \Sigma|^{-1/2} \exp(-(x-\mu)^T \Sigma^{-1} (x - \mu)/2)\)
Natural parameters:
\(\nu = [\Sigma^{-1} \mu, -\Sigma^{-1}/2]\)
Sufficient statistics of data:
\(T(x) = [x, x \cdot x^T]\)
Parameters: d – dimension.
-
class
dpcluster.distributions.
GaussianNIW
(d)[source]¶ Bases:
dpcluster.distributions.ConjugatePair
Gaussian, Normal-Inverse-Wishart conjugate pair.
The predictive posterior is a multivariate t-distribution.
Parameters: d – dimension -
conditional
(*args)¶
-
conditional_expectation
(*args)¶
-
conditionals_cache
(*args)¶
-
conditionals_cache_bare
(*args)¶
-
posterior_ll
(*args)¶
-
posterior_ll_cache
(*args)¶
-
-
class
dpcluster.distributions.
NIW
(d)[source]¶ Bases:
dpcluster.distributions.ExponentialFamilyDistribution
Normal Inverse Wishart distribution defined by:
\(f(\mu,\Sigma|\mu_0,\Psi,k) = \text{Gaussian}(\mu|\mu_0,\Sigma/k) \cdot \text{Inverse-Wishart}(\Sigma|\Psi,\nu-d-2)\)
where \(\mu, \mu_0 \in R^d, \Sigma, \Psi \in R^{d \times d}, k \in R, \nu > 2d+1 \in R\)
This is an exponential family conjugate prior for the Gaussian.
Parameters: d – dimension -
nat2usual
(*args)¶
-