clustviz package¶
Subpackages¶
Submodules¶
clustviz.agglomerative module¶
-
agg_clust(X, linkage, plotting=True)[source]¶ Perform hierarchical agglomerative clustering with the provided linkage method, plotting every step of cluster aggregation.
- Parameters
X (
ndarray) – input data array.linkage (
str) – linkage method; can be single, complete, average or ward.plotting (
bool) – if True, execute plots.
- Return type
None
-
avg_dist(a, b)[source]¶ Distance for average_linkage method, i.e. mean[dist(x, y)] for x in a & y in b.
- Return type
float
-
cl_dist(a, b)[source]¶ Distance for complete_linkage method, i.e. max[dist(x,y)] for x in a & y in b.
- Return type
float
-
compute_var(X, df)[source]¶ Compute total intra-cluster variance of the cluster configuration inferred from df.
- Parameters
X (
ndarray) – input data as array.df (
DataFrame) – input dataframe built by agg_clust, listing the cluster and the x and y coordinates of each point.
- Return type
Tuple[DataFrame,float]- Returns
centroids dataframe with their coordinates and the single variances of the corresponding clusters, and the total intra-cluster variance.
-
compute_var_sing(df, centroids)[source]¶ Compute every internal variance in clusters; clusters are found in df, whereas centroids are saved in centroids.
- Parameters
df (
DataFrame) – input dataframe built by agg_clust, listing the cluster and the x and y coordinates of each point.centroids (
DataFrame) – dataframe of the centroids of clusters, with their x and y coordinates.
- Return type
list- Returns
list of intra-cluster variances.
-
compute_ward_ij(data, df)[source]¶ Compute difference in total within-cluster variance, with squared euclidean distance, and finds the best cluster according to Ward criterion.
- Parameters
data (
ndarray) – input data array.df (
DataFrame) – input dataframe built by agg_clust, listing the cluster and the x and y coordinates of each point.
- Return type
Tuple[Tuple,float,float]- Returns
(i,j) indices of best cluster (the one for which the increase in intra-cluster variance is minimum) new_summ: new total intra-cluster variance par_var: increment in total intra-cluster variance, i.e. minimum increase in total intra-cluster variance
-
dist_mat(df, linkage)[source]¶ Take as input the dataframe created by agg_clust and output the distance matrix. It is actually an upper triangular matrix, the symmetrical values are replaced with np.inf.
- Parameters
df (
DataFrame) – input dataframe, with the first column corresponding to x-coordinates and the second column corresponding to y-coordinates of data points.linkage (
str) – linkage method; can be single, complete, average.
- Return type
DataFrame- Returns
distance matrix.
-
dist_mat_gen(df)[source]¶ Variation of dist_mat, uses only single_linkage method.
- Return type
DataFrame
-
point_plot_mod(X, distance_matrix, level_txt, level2_txt=None)[source]¶ Scatter plot of data points, colored according to the cluster they belong to. The most recently merged cluster is enclosed in a rectangle of the same color as its points, with red borders. In the top right corner, the total distance is shown, along with the current number of clusters. When using Ward linkage, also the increment in distance is shown.
- Parameters
X (
ndarray) – input data as array.distance_matrix (
DataFrame) – distance matrix built by agg_clust.level_txt (
float) – dist_tot displayed.level2_txt (
Optional[float]) – dist_incr displayed.
- Return type
None
-
sl_dist(a, b)[source]¶ Distance for single_linkage method, i.e. min[dist(x,y)] for x in a & y in b.
- Return type
float
-
update_mat(mat, i, j, linkage)[source]¶ Update the input distance matrix in the position (i, j), according to the provided linkage method.
- Parameters
mat (
DataFrame) – distance dataframe.i (
int) – row index.j (
int) – column index.linkage (
str) – linkage method; can be single, complete, average.
- Return type
DataFrame- Returns
updated distance dataframe.
clustviz.birch module¶
-
class
birch(data, number_clusters, branching_factor=50, max_node_entries=200, diameter=0.5, type_measurement=<measurement_type.CENTROID_EUCLIDEAN_DISTANCE: 0>, entry_size_limit=500, diameter_multiplier=1.5, ccore=True)[source]¶ Bases:
pyclustering.cluster.birch.birch
-
class
cftree(branch_factor, max_entries, threshold, type_measurement=<measurement_type.CENTROID_EUCLIDEAN_DISTANCE: 0>)[source]¶ Bases:
pyclustering.container.cftree.cftree
-
plot_birch_leaves(tree, data)[source]¶ Scatter plot of data point, with colors according to the leaf the belong to. Points in the same entry in a leaf are represented by a cross, with the number of points over it.
- Parameters
tree – tree built during BIRCH algorithm execution.
data – input data as array of list of list
-
plot_tree_fin(tree, info=False)[source]¶ Plot the final CFtree built by BIRCH. Leaves are colored, and every node displays the total number of elements in its child nodes.
- Parameters
tree – tree built during BIRCH algorithm execution.
info – if True, tree height, number of nodes, leaves and entries are printed.
clustviz.clara module¶
-
class
ClaraClustering(max_iter=100000)[source]¶ Bases:
objectThe clara clustering algorithm. Basically an iterative guessing version of k-medoids that makes things a lot faster for bigger data sets.
-
average_cost(_df, _fn, _cur_choice)[source]¶ A function to compute the average cost.
- Parameters
_df (
DataFrame) – The input data frame._fn (
str) – The distance function._cur_choice (
list) – The current medoid candidates.
- Returns
The average cost, the new medoids.
-
cheat_at_sampling(_df, _k, _fn, _nsamp)[source]¶ A function to cheat at sampling for speed ups.
- Parameters
_df (
DataFrame) – The input dataframe._k (
int) – The number of medoids._fn (
str) – The distance function._nsamp (
int) – The number of samples.
- Return type
Tuple[float,list]- Returns
The best score, the medoids.
-
clara(_df, _k, _fn)[source]¶ The main clara clustering iterative algorithm.
- Parameters
_df (
DataFrame) – Input dataframe._k (
int) – Number of medoids._fn (
str) – The distance function to use.
- Return type
Tuple[float,list,Union[dict,Dict[Any,list]]]- Returns
The minimized cost, the best medoid choices and the final configuration.
-
compute_cost(_df, _fn, _cur_choice, cache_on=False)[source]¶ A function to compute the configuration cost.
- Parameters
_df (
DataFrame) – The input dataframe._fn (
str) – The distance function._cur_choice (
list) – The current set of medoid choices.cache_on (
bool) – Binary flag to turn caching.
- Return type
Tuple[float,Dict[Any,list]]- Returns
The total configuration cost, the medoids.
-
static
cosine_distance(v1, v2)[source]¶ Function for cosine distance.
- Parameters
v1 (
Iterable) – The first vector.v2 (
Iterable) – The second vector.
- Return type
float- Returns
The cosine distance between v1 and v2.
-
static
euclidean_distance(v1, v2)[source]¶ Slow function for computing euclidean distance.
- Parameters
v1 (
Iterable) – The first vector.v2 (
Iterable) – The second vector.
- Return type
float- Returns
The euclidean distance between v1 and v2.
-
static
fast_euclidean(v1, v2)[source]¶ Faster function for euclidean distance.
- Parameters
v1 (
ndarray) – The first vector.v2 (
ndarray) – The second vector.
- Return type
float- Returns
The euclidean distance between v1 and v2.
-
k_medoids(_df, _k, _fn, _niter)[source]¶ The original k-medoids algorithm.
- Parameters
_df (
DataFrame) – Input data frame._k (
int) – Number of medoids._fn (
str) – The distance function to use._niter (
int) – The number of iterations.
- Return type
Tuple[float,list,Union[dict,Dict[Any,list]]]- Returns
Cost of configuration, the medoids (list) and the clusters (dictionary).
Pseudo-code for the k-medoids algorithm. 1. Sample k of the n data points as the medoids. 2. Associate each data point to the closest medoid. 3. While the cost of the data point space configuration is decreasing: - For each medoid m and each non-medoid point o: – Swap m and o, recompute cost. – If global cost increased, swap back.
-
-
plot_pam_mod(data, cl, full, equal_axis_scale=False)[source]¶ Scatterplot of data points, with colors according to cluster labels. Only sampled points are plotted, the others are only displayed with their indexes; moreover, centers of mass of the clusters are marked with an X.
- Parameters
data (
DataFrame) – input data sample.cl (
dict) – cluster dictionary.full (
DataFrame) – full input dataframe.equal_axis_scale (
bool) – if True, axis are plotted with the same scaling.
- Return type
None
clustviz.clarans module¶
-
class
clarans(data, number_clusters, numlocal, maxneighbor)[source]¶ Bases:
pyclustering.cluster.clarans.clarans
-
compute_cost_clarans(data, _cur_choice)[source]¶ A function to compute the configuration cost. (modified from that of CLARA)
- Parameters
data (
DataFrame) – The input dataframe._cur_choice (
list) – The current set of medoid choices.
- Return type
Tuple[float,Dict[Any,list]]- Returns
The total configuration cost, the medoids.
clustviz.cure module¶
-
assignment_phase_large_cure(X, CURE_df, diz, initial_ind, last_reps, not_sampled, not_sampled_ind, n_rep_fin, xwidth)[source]¶ In the last phase of CURE algorithm variation for large datasets, arrows are displayed from every not sampled point to its closest representative point; moreover, representative points are surrounded by small circles, to make them more visible. Representative points of different clusters are plotted in different nuances of red.
- Parameters
X – input data array.
diz – indexes of data points, to take shuffling into account.
CURE_df – input dataframe built by CURE algorithm, listing the cluster and the x and y coordinates of each point.
initial_ind – initial partial index.
last_reps – dictionary of last representative points.
not_sampled – coordinates of points that have not been initially sampled, in the large dataset version.
not_sampled_ind – indexes of not_sampled point_indices.
n_rep_fin – number of representatives to use for each cluster in the final assignment phase in the large dataset version.
xwidth – plot width.
-
cure(X, k, c=3, alpha=0.1, plotting=True, preprocessed_data=None, partial_index=None, n_rep_finalclust=None, not_sampled=None, not_sampled_ind=None)[source]¶ CURE algorithm: hierarchical agglomerative clustering using representatives. The parameters which default to None are used for the large dataset variation of CURE.
- Parameters
X (
ndarray) – input data array.k (
int) – desired number of clusters.c (
int) – number of representatives for each cluster.alpha (
float) – parameter that regulates the shrinking of representative points toward the centroid.plotting (
bool) – if True, plots all intermediate steps.preprocessed_data – if not None, must be of the form (clusters,representatives,matrix_a,X_dist1), which is used to perform a warm start.
partial_index – if not None, it is used as index of the matrix_a, of cluster points and of representatives.
n_rep_finalclust – the final representative points used to classify the not_sampled points.
not_sampled – points not sampled in the initial phase.
not_sampled_ind – indexes of not_sampled points.
- Return, rep, a)
returns the clusters dictionary, the dictionary of representatives, the matrix a
-
cure_sample_part(X, k, c=3, alpha=0.3, u_min=None, f=0.3, d=0.02, p=None, q=None, n_rep_finalclust=None, plotting=True)[source]¶ CURE algorithm variation for large datasets. Partition the sample space into p partitions, each of size len(X)/p, then partially cluster each partition until the final number of clusters in each partition reduces to n/(pq). Then run a second clustering pass on the n/q partial clusters for all the partitions.
- Parameters
X (
ndarray) – input data array.k (
int) – desired number of clusters.c (
int) – number of representatives for each cluster.alpha (
float) – parameter that regulates the shrinking of representative points toward the centroid.u_min (
Optional[int]) – size of the smallest cluster u.f (
float) – percentage of cluster points (0 <= f <= 1) we would like to have in the sample.d (
float) – (0 <= d <= 1) the probability that the sample contains less than f*|u| points of cluster u is less than d.p (
Optional[int]) – the number of partitions.q (
Optional[int]) – the number >1 such that each partition reduces to n/(pq) clusters.n_rep_finalclust (
Optional[int]) – number of representatives to use in the final assignment phase.plotting (
bool) – if True, plots all intermediate steps.
- Return, rep, mat_a)
returns the clusters dictionary, the dictionary of representatives, the matrix a.
-
dist_clust_cure(rep_u, rep_v)[source]¶ Compute the distance of two clusters based on the minimum distance found between the representatives of one cluster and the ones of the other.
- Parameters
rep_u (
list) – representatives of the first cluster.rep_v (
list) – representatives of the second cluster.
- Return type
float- Returns
distance between two clusters.
-
dist_mat_gen_cure(reps)[source]¶ Build distance matrix for CURE algorithm, using the dictionary of representatives.
- Parameters
reps (
dict) – dictionary of representative points, the only ones used to compute distances between clusters.- Return type
DataFrame- Returns
distance matrix as dataframe
-
form_new_cluster(clusters, u, u_cl)[source]¶ Form a new cluster from the input ones.
- Parameters
clusters (
dict) – existing clusters.u (
str) – first cluster.u_cl (
str) – second cluster.
- Return type
list- Returns
new cluster obtained by merging the first and second cluster.
-
plot_results_cure(clust)[source]¶ Scatter plot of data points, colored according to the cluster they belong to, after performing CURE algorithm.
- Parameters
clust (
dict) – output of CURE algorithm, dictionary of the form cluster_labels+point_indices: coords of points- Return type
None
-
point_plot_mod2(X, CURE_df, reps, level_txt, level2_txt=None, par_index=None, u=None, u_cl=None, initial_ind=None, last_reps=None, not_sampled=None, not_sampled_ind=None, n_rep_fin=None)[source]¶ Scatter-plot of input data points, colored according to the cluster they belong to. A rectangle with red borders is displayed around the last merged cluster; representative points of last merged cluster are also plotted in red, along with the center of mass, plotted as a red cross. The current number of clusters and current distance are also displayed in the right upper corner.
- Parameters
X (
ndarray) – input data array.CURE_df (
DataFrame) – input dataframe built by CURE algorithm, listing the cluster and the x and y coordinates of each point.reps (
list) – list of the coordinates of representative points.level_txt (
float) – distance at which current merging occurs displayed in the upper right corner.level2_txt (
Optional[float]) – incremental distance (not used).par_index – partial index to take the shuffling of indexes into account.
u – first cluster to be merged.
u_cl – second cluster to be merged.
initial_ind – initial partial index.
last_reps (
Optional[dict]) – dictionary of last representative points.not_sampled – coordinates of points that have not been initially sampled, in the large dataset version.
not_sampled_ind – indexes of not_sampled point_indices.
n_rep_fin – number of representatives to use for each cluster in the final assignment phase in the large dataset version.
- Returns
if par_index is not None, returns the new indexes of par_index.
-
sel_rep(clusters, name, c, alpha)[source]¶ Select c representatives of the clusters: first one is the farthest from the centroid, the others c-1 are the farthest from the already selected representatives. It doesn’t use the old representatives, so it is slower than sel_rep_fast.
- Parameters
clusters (
dict) – dictionary of clusters.name (
str) – name of the cluster we want to select representatives from.c (
int) – number of representatives we want to extract.alpha (
float) – 0<=float<=1, it determines how much the representative points are moved toward the centroid: 0 means they aren’t modified, 1 means that all points collapse to the centroid.
- Return type
list- Returns
list of representative points.
-
sel_rep_fast(prec_reps, clusters, name, c, alpha)[source]¶ Select c representatives of the clusters from the previously computed representatives, so it is faster than sel_rep.
- Parameters
prec_reps (
list) – list of previously computed representatives.clusters (
dict) – dictionary of clusters.name (
str) – name of the cluster we want to select representatives from.c (
int) – number of representatives we want to extract.alpha (
float) – 0<=float<=1, it determines how much the representative points are moved toward the centroid: 0 means they aren’t modified, 1 means that all points collapse to the centroid.
- Return type
list- Returns
list of representative points.
-
update_mat_cure(mat, i, j, rep_new, name)[source]¶ Update distance matrix of CURE, by computing the new distances from the new representatives.
- Parameters
mat (
DataFrame) – input dataframe built by CURE algorithm, listing the cluster and the x and y coordinates of each point.i (
int) – row index of cluster to be merged.j (
int) – column index of cluster to be merged.rep_new (
dict) – dictionary of new representatives.name (
str) – string of the form “(” + u + “)-(” + u_cl + “)”, containing the new name of the newly merged cluster.
- Return type
DataFrame- Returns
updated matrix with new distances
clustviz.dbscan module¶
-
DBSCAN(data, eps, minPTS, plotting=False, print_details=False)[source]¶ DBSCAN algorithm.
- Parameters
data (
ndarray) – input array.eps (
float) – radius of a point within which to search for minPTS points.minPTS (
int) – minimum number of neighbors for a point to be considered a core point.plotting (
bool) – if True, executes point_plot_mod, plotting every time a points is added to a clusters.print_details (
bool) – if True, prints the length of the “external” NearestNeighborhood and of the “internal” one (in the while loop).
- Return type
Dict[str,int]- Returns
dictionary of the form point_index:cluster_label.
-
plot_clust_DB(X, ClustDict, eps, circle_class=None, noise_circle=True)[source]¶ Scatter plot of the data points, colored according to the cluster they belong to; circle_class plots circles around some or all points, with a radius of eps; if noise_circle is True, circle are also plotted around noise points.
- Parameters
X (
ndarray) – input array.ClustDict (
Dict[str,int]) – dictionary of the form point_index:cluster_label, built by DBSCAN.eps (
float) – radius of the circles to plot around the points.circle_class (
Optional[str]) – if == “all”, plots circles around every non-noise point, else plots circles only around points belonging to certain clusters, e.g. circle_class = [1,2] will plot circles around points belonging to clusters 1 and 2.noise_circle (
bool) – if True, plots circles around noise points
- Return type
None
-
point_plot_mod(X, X_dict, point, eps, ClustDict)[source]¶ Plots a scatter plot of points, where the point (x,y) is light black and surrounded by a red circle of radius eps, where already processed point are plotted according to ClustDict and without edgecolor, whereas still-to-process points are green with black edgecolor.
- Parameters
X (
ndarray) – input array.X_dict (
Dict[str,ndarray]) – input dictionary version of X.point – coordinates of the point that is currently inspected.
eps (
float) – radius of the circle to plot around the point (x,y).ClustDict (
dict) – dictionary of the form point_index:cluster_label, built by DBSCAN
- Return type
None
-
scan_neigh1_mod(data, point, eps)[source]¶ Neighborhood search for a point of a given dataset-dictionary (data) with a fixed eps; it returns also the point itself, differently from scan_neigh1 of OPTICS.
- Parameters
data (
Dict[str,ndarray]) – input dataset.point (
ndarray) – point whose neighborhood is to be examined.eps (
float) – radius of search.
- Return type
Dict[str,ndarray]- Returns
neighborhood points.
clustviz.denclue module¶
-
CubeInfo¶ Info of a cube (rectangle), it is made up of the number of points contained in the cube, the linear sum of the x and y coordinated of the points in the cube, and of the coordinates of the points themselves. For example
{'num_points': 2, 'linear_sum': np.array([3, 4]), 'points_coords': np.array([[1, 1], [2, 3])}.alias of Dict[str, Union[int, List[float], List[list]]]
-
Cubes¶ Cubes (rectangles), having as keys the
(row, column)tuple, and as valuesCubeInfo.alias of Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]
-
CubesCoords¶ Coordinates of the cubes (rectangles), having as keys the (row, column) tuple, and as values the minimum x, the minimum y, the maximum x and maximum y of the cube. For example
{ (0, 0): (-0.05, -0.05, 1.95, 1.95), (1, 0): (1.95, -0.05, 3.95, 1.95), ... }.alias of Dict[Tuple[int, int], Tuple[float, float, float, float]]
-
DENCLUE(data, s, xi=3, xi_c=3, tol=2, dist='euclidean', prec=20, plotting=True)[source]¶ Execute the DENCLUE algorithm, whose basic idea is to model the overall point density analytically as the sum of influence functions of the data points. Clusters can then be identified by determining density-attractors.
- Parameters
data (
ndarray) – input dataset.s (
float) –sigma, determines the influence of a point in its neighborhood.xi (
float) –xi, determines whether a density attractor is significant.xi_c (
float) –xi/2d, whered=2is the dimension of input data.tol (
float) – tolerance for determining if two density attractors coincide.dist (
str) – distance to use.prec (
int) – precision used to compute density function.plotting (
bool) – ifTrue, show plots.
- Return type
list- Returns
list of cluster labels.
-
FindPoint(x1, y1, x2, y2, x, y)[source]¶ Check if the point
(x,y)is inside the rectangle determined byx1, y1, x2, y2.- Parameters
x1 (
float) – minimumxcoordinate of the rectangle vertices.y1 (
float) – minimumycoordinate of the rectangle vertices.x2 (
float) – maximumxcoordinate of the rectangle vertices.y2 (
float) – maximumycoordinate of the rectangle vertices.x (
float) –xcoordinate of the point to be examined.y (
float) –ycoordinate of the point to be examined.
- Return type
bool- Returns
Trueif the point(x, y)lies inside the rectangle,Falseotherwise.
-
FindRect(point, coord_dict)[source]¶ Find the key of the cube (rectangle) containing the point (if any).
- Parameters
point (
ndarray) – point whose cube (rectangle) is to be found.coord_dict (
Dict[Tuple[int,int],Tuple[float,float,float,float]]) – dictionary of the rectangles’ coordinates.
- Return type
Optional[Tuple[int,int]]- Returns
key of the cube, e.g.
(1, 3), containing the point; if it does not exist, returnNone.
-
assign_cluster(data, others, attractor, clust_dict, processed)[source]¶ Assign a density attractor to (a) point(s) or mark it/them as outlier(s).
- Parameters
data (
ndarray) – input dataset.others (
ndarray) – coordinates of the point(s) whose clusters have to be assigned.attractor (
Optional[Tuple[ndarray,bool]]) – coordinates of the point and flag to indicate if it is an outlier, i.e. if the density attractor is significant.clust_dict (
Dict[int,ndarray]) – dictionary of points with the coordinates of their density attractor.processed (
List[int]) – points that have been processed.
- Return type
Tuple[Dict[int,ndarray],List[int]]- Returns
dictionary of clusters, i.e. points with their density attractor, and list of processed points.
-
check_border_points_rectangles(data, populated_cubes)[source]¶ Check if any of the points lie on the borders of the cubes, by checking if the sum of the number of points contained in each populated cube is equal to the total number of points of the dataset.
- Parameters
data (
ndarray) – input dataset.populated_cubes (
Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]]) – populated cubes.
- Return type
None
-
check_connection(cube1, cube2, s, dist='euclidean')[source]¶ Check if two cubes are connected (the distance between their centers of mass is not greater than
4*s).- Parameters
cube1 (
Dict[str,Union[int,List[float],List[list]]]) – first cube.cube2 (
Dict[str,Union[int,List[float],List[list]]]) – second cube.s (
float) –sigma, determines the influence of a point in its neighborhood.dist (
str) – distance to use.
- Return type
bool- Returns
Trueif the cubes are connected,``False`` otherwise.
-
density_attractor(data, x, coord_dict, tot_cubes, s, xi, delta=0.05, max_iter=100, dist='euclidean')[source]¶ Find the density attractor for point x with a hill-climbing procedure. To speed up computations, during the procedure, store all the points y such that dist(x, y) <= s/2: they will belong to the same cluster as x.
- Parameters
data (
ndarray) – input dataset.x (
ndarray) – point whose density attractors is to be found.coord_dict (
Dict[Tuple[int,int],Tuple[float,float,float,float]]) – dictionary of the rectangles’ coordinates.tot_cubes (
Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]]) – the final cubes (highly populated + connected).s (
float) –sigma, determines the influence of a point in its neighborhood.xi (
float) –xi, determines whether a density attractor is significant.delta (
float) –deltaof gradient descent.max_iter (
int) – maximum number of iteration for finding a density attractor.dist (
str) – distance to use.
- Return type
Union[Tuple[Tuple[float,bool],ndarray],Tuple[None,None]]- Returns
the coordinates of the density attractor, a flag to indicate its significance, and the list of the coordinates of the point(s) attracted by that density attractor.
-
extract_cluster_labels(data, cld, tol=2)[source]¶ Extract the labels from the dictionary of points with the coordinates of their density attractors.
- Parameters
data (
ndarray) – input dataset.cld (
Dict[int,ndarray]) – dictionary of points with the coordinates of their density attractors.tol (
float) – tolerance to merge points with the same density attractors.
- Return type
DataFrame- Returns
dataframe of points with cluster labels and coordinates of density attractors.
-
find_connected_cubes(hp_cubes, cubes, s, dist='euclidean')[source]¶ Return connected cubes, i.e. cubes whose centers of mass’ distances with a highly populated cube’s center of mass is less than
4*s.- Parameters
hp_cubes (
Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]]) – highly populated cubes.cubes (
Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]]) – cubes.s (
float) – sigma, determines the influence of a point in its neighborhood.dist (
str) – distance to use.
- Return type
Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]]- Returns
new dictionary of cubes.
-
form_populated_cubes(a, b, c, d, data)[source]¶ For the given input cube (rectangle), compute how many points of the input dataset lie in it, store their coordinates and compute the sum of their
xandycoordinates.- Parameters
a (
float) – minimumxcoordinate of the rectangle.b (
float) – minimumycoordinate of the rectangle.c (
float) – maximumxcoordinate of the rectangle.d (
float) – maximumycoordinate of the rectangle.data (
ndarray) – input dataset.
- Return type
Dict[str,Union[int,List[float],List[list]]]- Returns
dictionary of number of points lying in the cube, the linear sum of their
xandycoordinates, their coordinates.
-
gaussian_density(x, D, s, dist='euclidean')[source]¶ Compute the Gaussian density of a point with respect to a dataset.
- Parameters
x (
ndarray) – point whose density is to be computed.D (
ndarray) – dataset.s (
float) – standard deviation of the Gaussian.dist (
str) – distance to use in the Gaussian.
- Return type
float- Returns
Gaussian density at point
xwith respect to datasetD, using a Gaussian function with distance dist and standard deviations.
-
gaussian_influence(x, y, s, dist='euclidean')[source]¶ Return the value of the Gaussian influence function in
(x,y)with standard deviations.- Parameters
x (
ndarray) – first point.y (
ndarray) – second point.s (
float) – standard deviation of the Gaussian.dist (
str) – distance to use in the Gaussian.
- Return type
float- Returns
value for the Gaussian in
(x,y)with standard deviations.
-
gradient_gaussian_density(x, D, s, dist='euclidean')[source]¶ Compute the gradient of the Gaussian density function, used to find the density-attractors, at a point.
- Parameters
x (
ndarray) – point.D (
ndarray) – dataset.s (
float) – standard deviation of the Gaussian.dist (
str) – distance to use in the Gaussian.
- Return type
ndarray- Returns
gradient of the Gaussian density at point
xwith respect to datasetD, using a Gaussian function with distancedistand standard deviations.
-
highly_pop_cubes(pop_cub, xi_c)[source]¶ Find highly populated cubes, i.e. cubes containing at least
xi_cpoints.- Parameters
pop_cub (
Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]]) – populated cubes.xi_c (
float) –xi_c = xi/2d, wherexidetermines whether a density attractor is significant.
- Return type
Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]]- Returns
highly populated cubes.
-
near_with_cube(x, cube_x, tot_cubes, s)[source]¶ Find points of cubes that are connected with cube_x and whose center of mass’ distance from
xis less or equal to4*s. The point itself is included.- Parameters
x (
ndarray) – examined point.cube_x (
Dict[str,Union[int,List[float],List[list]]]) – cube whichxbelongs to.tot_cubes (
Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]]) – the final cubes (highly populated + connected).s (
float) – sigma, determines the influence of a point in its neighborhood.
- Return type
ndarray- Returns
list of points belonging to cubes connected to
cube_xand whose center of mass’ distance fromxis less or equal to4*s.
-
near_without_cube(x, coord_dict, tot_cubes, s)[source]¶ Find the cube that
xbelongs to, and then find the points of cubes that are connected with it and whose center of mass’ distance fromxis less or equal to4*s. The point itself is included.- Parameters
x (
ndarray) – examined point.coord_dict (
Dict[Tuple[int,int],Tuple[float,float,float,float]]) – dictionary of the rectangles’ coordinates.tot_cubes (
Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]]) – the final cubes (highly populated + connected).s (
float) – sigma, determines the influence of a point in its neighborhood.
- Return type
ndarray- Returns
list of points belonging to cubes connected to
cube_xand whose center of mass’ distance fromxis less or equal to4*s.
-
plot_3d_both(data, s, xi=None, prec=3)[source]¶ Show a 3D plot of the density function, with a horizontal plane cutting it at height xi, above which points can be considered signicant density attractors. Below this, a scatter plot and a countour plot show the actual points of the dataset, colored by significance of their ‘density-attractiveness’.
- Parameters
data (
ndarray) – input dataset.s (
float) –sigma, determines the influence of a point in its neighborhood.xi (
Optional[float]) –xi, determines whether a density attractor is significant.prec (
int) – precision used to compute density function.
- Return type
None
-
plot_3d_or_contour(data, s, three=False, scatter=False, prec=3)[source]¶ Plot the density function for the input dataset, either in 3D or 2D, using a contour plot.
- Parameters
data (
ndarray) – input dataset.s (
float) –sigma, determines the influence of a point in its neighborhood.three (
bool) – ifTrue, execute 3D plot and do not plot 2D countour plot.scatter (
bool) – ifTrue, and if three isFalse, draw a scatter plot on top of the countour plot.prec (
int) – precision used to compute density function.
- Return type
None
-
plot_clust_dict(data, coord_df)[source]¶ Draw a scatter plot of the dataset, highlighting the clusters, the outliers and the density attractors, marked with a cross.
- Parameters
data (
ndarray) – input dataset.coord_df (
DataFrame) – dataframe of points with cluster labels and coordinates of density attractors.
- Return type
None
-
plot_grid_rect(data, s, cube_kind='populated')[source]¶ Plot the cubes, with colors highlighting populated and highly populated cubes.
- Parameters
data (
ndarray) – input dataset.s (
float) –sigma, determines the influence of a point in its neighborhood.cube_kind (
str) – option to consider populated cubes of highly populated cubes.
- Return type
None
-
plot_infl(data, s, xi)[source]¶ Plot points of the dataset, showing which of them could be density attractors.
- Parameters
data (
ndarray) – input dataset.s (
float) –sigma, determines the influence of a point in its neighborhood.xi (
float) –xi, determines whether a density attractor is significant.
- Return type
None
-
plot_min_bound_rect(data)[source]¶ Plot the minimal bounding rectangle of the input dataset.
- Parameters
data (
ndarray) – input dataset.- Return type
tuple
-
pop_cubes(data, s)[source]¶ Find the populated cubes (rectangles containing at least one point).
- Parameters
data (
ndarray) – input dataset.s (
float) –sigma, determines the influence of a point in its neighborhood.
- Return type
Tuple[Dict[Tuple[int,int],Dict[str,Union[int,List[float],List[list]]]],Dict[Tuple[int,int],Tuple[float,float,float,float]]]- Returns
the
(x,y)coordinates of the populated cubes, with how many points it contains, the coordinates of its center of mass, and the coordinates of the points belonging to it; the coordinates of the cube (rectangle) itself.
-
square_wave_density(x, D, s, dist='euclidean')[source]¶ Compute the square-wave density of a point with respect to a dataset.
- Parameters
x (
ndarray) – point whose density is to be computed.D (
ndarray) – dataset.s (
float) – cut-off.dist (
str) – distance to use.
- Return type
int- Returns
square-wave density at point
xwith respect to datasetD, using a square-wave function with distancedistand cut-offs.
-
square_wave_gradient(x, D, s, dist='euclidean')[source]¶ Compute the gradient of the square-wave density function of a point with respect to a dataset.
- Parameters
x (
ndarray) – point whose density is to be computed.D (
ndarray) – dataset.s (
float) – cut-off.dist (
str) – distance to use.
- Return type
ndarray- Returns
gradient of the square-wave density function at point
xwith respect to datasetD, using a square-wave function with distancedistand cut-offs.
-
square_wave_influence(x, y, s, dist='euclidean')[source]¶ Compute the square-wave influence function in
(x,y)with standard deviations.- Parameters
x (
ndarray) – first point.y (
ndarray) – second point.s (
float) – cut-off.dist (
str) – distance to use.
- Return type
int- Returns
if dist(x, y) <= s, return 1, else 0.
clustviz.optics module¶
-
ExtractDBSCANclust(ClustDist, CoreDist, eps_db)[source]¶ Extracts cluster in a DBSCAN fashion; one can use any eps_db <= eps of OPTICS.
- Parameters
ClustDist (
Dict[str,float]) – ClustDist of OPTICS, a dictionary of the form point_index:reach_dist.CoreDist (
Dict[str,float]) – CoreDist of OPTICS, a dictionary of the form point_index:core_dist.eps_db (
float) – the eps to choose for DBSCAN.
- Return type
Dict[str,int]- Returns
dictionary of clusters, of the form point_index:cluster_label.
-
OPTICS(X, eps, minPTS, plot=True, plot_reach=False)[source]¶ Execute the OPTICS algorithm. Similar to DBSCAN, but uses a priority queue.
- Parameters
X (
ndarray) – input array.eps (
float) – radius of a point within which to search for minPTS points.minPTS (
int) – minimum number of neighbors for a point to be considered a core point.plot (
bool) – if True, the scatter plot of the function point_plot is displayed at each step.plot_reach (
bool) – if True, the reachability plot is displayed at each step.
- Return type
Tuple[Dict[str,float],Dict[str,float]]- Returns
ClustDist, a dictionary of the form point_index:reach_dist, and CoreDist, a dictionary of the form point_index:core_dist.
-
minPTSdist(data, o, minPTS, eps)[source]¶ Return the minPTS-distance of a point if it is a core point, else it returns np.inf.
- Parameters
data (
Dict[str,ndarray]) – input dictionary.o (
str) – key of point of interest.minPTS (
int) – minimum number of neighbors for a point to be considered a core point.eps (
float) – radius of a point within which to search for minPTS points.
- Return type
Union[float,Any]- Returns
minPTS-distance of data[o] or np.inf.
-
plot_clust(X, ClustDist, CoreDist, eps, eps_db)[source]¶ Plot a scatter plot on the left, where points are colored according to the cluster they belong to, and a reachability plot on the right, where colors correspond to the clusters, and the two horizontal lines represent eps and eps_db.
- Parameters
X (
ndarray) – input array.ClustDist (
Dict[str,float]) – ClustDist of OPTICS, a dictionary of the form point_index:reach_dist.CoreDist (
Dict[str,float]) – CoreDist of OPTICS, a dictionary of the form point_index:core_dist.eps (
float) – the eps used to run OPTICS.eps_db (
float) – the eps to choose for DBSCAN.
- Return type
None
-
point_plot(X, X_dict, o, eps, processed=None, col='yellow')[source]¶ Plot a scatter plot of points, where the point (x,y) is light black and surrounded by a red circle of radius eps, where processed point are plotted in col (yellow by default) and without edgecolor, whereas still-to-process points are green with black edgecolor.
- Parameters
X (
ndarray) – input array.X_dict (
Dict[str,ndarray]) – input dictionary version of X.o (
str) – point that is currently inspected.eps (
float) – radius of the circle to plot around the point (x,y).processed (
Optional[Iterable]) – already processed points, to plot in col.col (
str) – color to use for processed points, yellow by default.
- Return type
None
-
reach_dist(data, x, y, minPTS, eps)[source]¶ Reachability distance (even if it is not a distance because it isn’t symmetrical).
- Parameters
data (
Dict[str,ndarray]) – input dictionary.x (
str) – first point.y (
str) – second point.minPTS (
int) – minimum number of neighbors for a point to be considered a core point.eps (
float) – radius of a point within which to search for minPTS points.
- Return type
Union[float,Any]- Returns
reachability distance of x and y.
-
reach_plot(data, ClustDist, eps)[source]¶ Plot the reachability plot, along with a horizontal line denoting eps, from the ClustDist produced by OPTICS.
- Parameters
data (
Dict[str,ndarray]) – input dictionary.ClustDist (
Dict[str,float]) – output of OPTICS function, dictionary of the form point_index:reach_dist.eps (
float) – radius of a point within which to search for minPTS points.
- Return type
None
-
scan_neigh1(data, point, eps)[source]¶ Neighborhood search for a point of a given dataset-dictionary (data) with a fixed eps.
- Parameters
data (
Dict[str,ndarray]) – input dictionary.point (
ndarray) – point whose neighborhood is to be examined.eps (
float) – radius of search.
- Return type
Dict[str,ndarray]- Returns
dictionary of neighborhood points.
-
scan_neigh2(data, point, eps)[source]¶ Variation of scan_neigh1 that returns only the keys of the input dictionary with the euclidean distances <= eps from the point.
- Parameters
data (
Dict[str,ndarray]) – input dictionary.point (
ndarray) – point whose neighborhood is to be examined.eps (
float) – radius of search.
- Return type
List[str]- Returns
keys of dictionary of neighborhood points, ordered by distance.
clustviz.pam module¶
-
class
KMedoids(n_cluster=2, max_iter=10, tol=0.1, start_prob=0.8, end_prob=0.99, random_state=42)[source]¶ Bases:
object-
calculate_distance_of_clusters(cluster_dist=None)[source]¶ If no argument is provided, just sum the distances of the existing cluster_distances, else sum the distances of the input cluster_distances.
- Parameters
cluster_dist – if not None, cluster distances.
- Returns
sum of cluster distances.
-
-
plot_pam(data, cl, equal_axis_scale=False)[source]¶ Scatterplot of data points, with colors according to cluster labels. Centers of mass of the clusters are marked with an X.
- Parameters
data (
DataFrame) – input data sample as dataframe.cl (
dict) – cluster dictionary.equal_axis_scale (
bool) – if True, axis are plotted with the same scaling.
- Return type
None
clustviz.utils module¶
-
annotate_points(annotations, points, ax)[source]¶ Annotate the points of the axis with their name (number).
- Parameters
annotations (
Iterable) – names of the points (their numbers).points (
ndarray) – array of their positions.ax – axis of the plot.
- Return type
None
-
build_initial_matrices(X, partial_index=None)[source]¶ Build the initial dataframe, adding as many columns of the type 0x,0y,1x,1y,2x,2y,… as there are points in the input dataset, filling them with NaNs, and build the same dataframe without NaNs columns.
- Parameters
X (
ndarray) – input data array.partial_index (
Optional[list]) – if not None, it is used as index of the matrix_a, of cluster points and of representatives, for CURE algorithm.
- Return type
Tuple[DataFrame,DataFrame]- Returns
initial cluster dataframe, where each row represents a cluster and each pair of columns represents the x and y coordinates of each point belonging to that cluster, and its version without NaNs.
-
chernoffBounds(u_min, f, N, d, k)[source]¶ - Parameters
u_min (
int) – size of the smallest cluster u.f (
float) – percentage of cluster points (0 <= f <= 1).N (
int) – total size.d (
float) – the probability that the sample contains less than f*|u| points of cluster u is less than d.k (
int) – cluster size.
If one uses as dim(u) the minimum cluster size we are interested in, the result is the minimum sample size that guarantees that for k clusters the probability of selecting fewer than f*dim(u) points from any one of the clusters u is less than k*d.
- Return type
float
-
cluster_points(cluster_name)[source]¶ Return points composing the cluster, removing brackets and hyphen from cluster name, e.g. ((a)-(b))-(c) becomes [a, b, c].
- Parameters
cluster_name (
str) – name of the cluster.- Return type
list- Returns
points forming the cluster.
-
convert_colors(dict_colors, alpha=0.5)[source]¶ Modify the transparency of each color of a dictionary of colors to the desired alpha.
- Return type
Dict[int,Tuple[float, …]]
-
dist2(data, x, y)[source]¶ Euclidean distance which takes keys of a dictionary (X_dict) as inputs.
- Return type
float
-
draw_rectangle_or_encircle(X, points, X_clust, Y_clust, ax, ind)[source]¶ Draw a rectangle or link the points forming the cluster, if the number of points exceeds two.
- Parameters
X (
ndarray) – input data array.points (
list) – points forming the cluster.X_clust (
list) – x coordinates of points forming the cluster.Y_clust (
list) – y coordinates of points forming the cluster.ax – axis of the plot.
ind (
int) – index for coloring the cluster.
- Return type
None