clustviz package

Submodules

clustviz.agglomerative module

agg_clust(X, linkage, plotting=True)[source]

Perform hierarchical agglomerative clustering with the provided linkage method, plotting every step of cluster aggregation.

Parameters
  • X (ndarray) – input data array.

  • linkage (str) – linkage method; can be single, complete, average or ward.

  • plotting (bool) – if True, execute plots.

Return type

None

avg_dist(a, b)[source]

Distance for average_linkage method, i.e. mean[dist(x, y)] for x in a & y in b.

Return type

float

cl_dist(a, b)[source]

Distance for complete_linkage method, i.e. max[dist(x,y)] for x in a & y in b.

Return type

float

compute_var(X, df)[source]

Compute total intra-cluster variance of the cluster configuration inferred from df.

Parameters
  • X (ndarray) – input data as array.

  • df (DataFrame) – input dataframe built by agg_clust, listing the cluster and the x and y coordinates of each point.

Return type

Tuple[DataFrame, float]

Returns

centroids dataframe with their coordinates and the single variances of the corresponding clusters, and the total intra-cluster variance.

compute_var_sing(df, centroids)[source]

Compute every internal variance in clusters; clusters are found in df, whereas centroids are saved in centroids.

Parameters
  • df (DataFrame) – input dataframe built by agg_clust, listing the cluster and the x and y coordinates of each point.

  • centroids (DataFrame) – dataframe of the centroids of clusters, with their x and y coordinates.

Return type

list

Returns

list of intra-cluster variances.

compute_ward_ij(data, df)[source]

Compute difference in total within-cluster variance, with squared euclidean distance, and finds the best cluster according to Ward criterion.

Parameters
  • data (ndarray) – input data array.

  • df (DataFrame) – input dataframe built by agg_clust, listing the cluster and the x and y coordinates of each point.

Return type

Tuple[Tuple, float, float]

Returns

(i,j) indices of best cluster (the one for which the increase in intra-cluster variance is minimum) new_summ: new total intra-cluster variance par_var: increment in total intra-cluster variance, i.e. minimum increase in total intra-cluster variance

dist_mat(df, linkage)[source]

Take as input the dataframe created by agg_clust and output the distance matrix. It is actually an upper triangular matrix, the symmetrical values are replaced with np.inf.

Parameters
  • df (DataFrame) – input dataframe, with the first column corresponding to x-coordinates and the second column corresponding to y-coordinates of data points.

  • linkage (str) – linkage method; can be single, complete, average.

Return type

DataFrame

Returns

distance matrix.

dist_mat_gen(df)[source]

Variation of dist_mat, uses only single_linkage method.

Return type

DataFrame

point_plot_mod(X, distance_matrix, level_txt, level2_txt=None)[source]

Scatter plot of data points, colored according to the cluster they belong to. The most recently merged cluster is enclosed in a rectangle of the same color as its points, with red borders. In the top right corner, the total distance is shown, along with the current number of clusters. When using Ward linkage, also the increment in distance is shown.

Parameters
  • X (ndarray) – input data as array.

  • distance_matrix (DataFrame) – distance matrix built by agg_clust.

  • level_txt (float) – dist_tot displayed.

  • level2_txt (Optional[float]) – dist_incr displayed.

Return type

None

sl_dist(a, b)[source]

Distance for single_linkage method, i.e. min[dist(x,y)] for x in a & y in b.

Return type

float

update_mat(mat, i, j, linkage)[source]

Update the input distance matrix in the position (i, j), according to the provided linkage method.

Parameters
  • mat (DataFrame) – distance dataframe.

  • i (int) – row index.

  • j (int) – column index.

  • linkage (str) – linkage method; can be single, complete, average.

Return type

DataFrame

Returns

updated distance dataframe.

clustviz.birch module

class birch(data, number_clusters, branching_factor=50, max_node_entries=200, diameter=0.5, type_measurement=<measurement_type.CENTROID_EUCLIDEAN_DISTANCE: 0>, entry_size_limit=500, diameter_multiplier=1.5, ccore=True)[source]

Bases: pyclustering.cluster.birch.birch

process(plotting=False)[source]

! @brief Performs cluster analysis in line with rules of BIRCH algorithm.

@return (birch) Returns itself (BIRCH instance).

@see get_clusters()

return_tree()[source]

Return the tree built by the algorithm.

class cftree(branch_factor, max_entries, threshold, type_measurement=<measurement_type.CENTROID_EUCLIDEAN_DISTANCE: 0>)[source]

Bases: pyclustering.container.cftree.cftree

insert(entry)[source]

Insert clustering feature to the tree.

Parameters

entry – clustering feature that should be inserted.

show_feature_distribution(data=None)[source]

! @brief Shows feature distribution. @details Only features in 1D, 2D, 3D space can be visualized.

@param[in] data (list): List of points that will be used for visualization, if it not specified than feature will be displayed only.

plot_birch_leaves(tree, data)[source]

Scatter plot of data point, with colors according to the leaf the belong to. Points in the same entry in a leaf are represented by a cross, with the number of points over it.

Parameters
  • tree – tree built during BIRCH algorithm execution.

  • data – input data as array of list of list

plot_tree_fin(tree, info=False)[source]

Plot the final CFtree built by BIRCH. Leaves are colored, and every node displays the total number of elements in its child nodes.

Parameters
  • tree – tree built during BIRCH algorithm execution.

  • info – if True, tree height, number of nodes, leaves and entries are printed.

clustviz.clara module

class ClaraClustering(max_iter=100000)[source]

Bases: object

The clara clustering algorithm. Basically an iterative guessing version of k-medoids that makes things a lot faster for bigger data sets.

average_cost(_df, _fn, _cur_choice)[source]

A function to compute the average cost.

Parameters
  • _df (DataFrame) – The input data frame.

  • _fn (str) – The distance function.

  • _cur_choice (list) – The current medoid candidates.

Returns

The average cost, the new medoids.

cheat_at_sampling(_df, _k, _fn, _nsamp)[source]

A function to cheat at sampling for speed ups.

Parameters
  • _df (DataFrame) – The input dataframe.

  • _k (int) – The number of medoids.

  • _fn (str) – The distance function.

  • _nsamp (int) – The number of samples.

Return type

Tuple[float, list]

Returns

The best score, the medoids.

clara(_df, _k, _fn)[source]

The main clara clustering iterative algorithm.

Parameters
  • _df (DataFrame) – Input dataframe.

  • _k (int) – Number of medoids.

  • _fn (str) – The distance function to use.

Return type

Tuple[float, list, Union[dict, Dict[Any, list]]]

Returns

The minimized cost, the best medoid choices and the final configuration.

compute_cost(_df, _fn, _cur_choice, cache_on=False)[source]

A function to compute the configuration cost.

Parameters
  • _df (DataFrame) – The input dataframe.

  • _fn (str) – The distance function.

  • _cur_choice (list) – The current set of medoid choices.

  • cache_on (bool) – Binary flag to turn caching.

Return type

Tuple[float, Dict[Any, list]]

Returns

The total configuration cost, the medoids.

static cosine_distance(v1, v2)[source]

Function for cosine distance.

Parameters
  • v1 (Iterable) – The first vector.

  • v2 (Iterable) – The second vector.

Return type

float

Returns

The cosine distance between v1 and v2.

static euclidean_distance(v1, v2)[source]

Slow function for computing euclidean distance.

Parameters
  • v1 (Iterable) – The first vector.

  • v2 (Iterable) – The second vector.

Return type

float

Returns

The euclidean distance between v1 and v2.

static fast_euclidean(v1, v2)[source]

Faster function for euclidean distance.

Parameters
  • v1 (ndarray) – The first vector.

  • v2 (ndarray) – The second vector.

Return type

float

Returns

The euclidean distance between v1 and v2.

k_medoids(_df, _k, _fn, _niter)[source]

The original k-medoids algorithm.

Parameters
  • _df (DataFrame) – Input data frame.

  • _k (int) – Number of medoids.

  • _fn (str) – The distance function to use.

  • _niter (int) – The number of iterations.

Return type

Tuple[float, list, Union[dict, Dict[Any, list]]]

Returns

Cost of configuration, the medoids (list) and the clusters (dictionary).

Pseudo-code for the k-medoids algorithm. 1. Sample k of the n data points as the medoids. 2. Associate each data point to the closest medoid. 3. While the cost of the data point space configuration is decreasing: - For each medoid m and each non-medoid point o: – Swap m and o, recompute cost. – If global cost increased, swap back.

static manhattan_distance(v1, v2)[source]

Function for manhattan distance.

Parameters
  • v1 (Iterable) – The first vector.

  • v2 (Iterable) – The second vector.

Return type

float

Returns

The manhattan distance between v1 and v2.

plot_pam_mod(data, cl, full, equal_axis_scale=False)[source]

Scatterplot of data points, with colors according to cluster labels. Only sampled points are plotted, the others are only displayed with their indexes; moreover, centers of mass of the clusters are marked with an X.

Parameters
  • data (DataFrame) – input data sample.

  • cl (dict) – cluster dictionary.

  • full (DataFrame) – full input dataframe.

  • equal_axis_scale (bool) – if True, axis are plotted with the same scaling.

Return type

None

clustviz.clarans module

class clarans(data, number_clusters, numlocal, maxneighbor)[source]

Bases: pyclustering.cluster.clarans.clarans

process(plotting=False)[source]

! @brief Performs cluster analysis in line with rules of CLARANS algorithm.

@return (clarans) Returns itself (CLARANS instance).

@see get_clusters() @see get_medoids()

compute_cost_clarans(data, _cur_choice)[source]

A function to compute the configuration cost. (modified from that of CLARA)

Parameters
  • data (DataFrame) – The input dataframe.

  • _cur_choice (list) – The current set of medoid choices.

Return type

Tuple[float, Dict[Any, list]]

Returns

The total configuration cost, the medoids.

plot_tree_clarans(data, k)[source]

plot G_{k,n} as in the paper of CLARANS; only to use with small input data.

Parameters
  • data (DataFrame) – input DataFrame.

  • k (int) – number of points in each combination (possible set of medoids).

Return type

None

clustviz.cure module

assignment_phase_large_cure(X, CURE_df, diz, initial_ind, last_reps, not_sampled, not_sampled_ind, n_rep_fin, xwidth)[source]

In the last phase of CURE algorithm variation for large datasets, arrows are displayed from every not sampled point to its closest representative point; moreover, representative points are surrounded by small circles, to make them more visible. Representative points of different clusters are plotted in different nuances of red.

Parameters
  • X – input data array.

  • diz – indexes of data points, to take shuffling into account.

  • CURE_df – input dataframe built by CURE algorithm, listing the cluster and the x and y coordinates of each point.

  • initial_ind – initial partial index.

  • last_reps – dictionary of last representative points.

  • not_sampled – coordinates of points that have not been initially sampled, in the large dataset version.

  • not_sampled_ind – indexes of not_sampled point_indices.

  • n_rep_fin – number of representatives to use for each cluster in the final assignment phase in the large dataset version.

  • xwidth – plot width.

cure(X, k, c=3, alpha=0.1, plotting=True, preprocessed_data=None, partial_index=None, n_rep_finalclust=None, not_sampled=None, not_sampled_ind=None)[source]

CURE algorithm: hierarchical agglomerative clustering using representatives. The parameters which default to None are used for the large dataset variation of CURE.

Parameters
  • X (ndarray) – input data array.

  • k (int) – desired number of clusters.

  • c (int) – number of representatives for each cluster.

  • alpha (float) – parameter that regulates the shrinking of representative points toward the centroid.

  • plotting (bool) – if True, plots all intermediate steps.

  • preprocessed_data – if not None, must be of the form (clusters,representatives,matrix_a,X_dist1), which is used to perform a warm start.

  • partial_index – if not None, it is used as index of the matrix_a, of cluster points and of representatives.

  • n_rep_finalclust – the final representative points used to classify the not_sampled points.

  • not_sampled – points not sampled in the initial phase.

  • not_sampled_ind – indexes of not_sampled points.

Return, rep, a)

returns the clusters dictionary, the dictionary of representatives, the matrix a

cure_sample_part(X, k, c=3, alpha=0.3, u_min=None, f=0.3, d=0.02, p=None, q=None, n_rep_finalclust=None, plotting=True)[source]

CURE algorithm variation for large datasets. Partition the sample space into p partitions, each of size len(X)/p, then partially cluster each partition until the final number of clusters in each partition reduces to n/(pq). Then run a second clustering pass on the n/q partial clusters for all the partitions.

Parameters
  • X (ndarray) – input data array.

  • k (int) – desired number of clusters.

  • c (int) – number of representatives for each cluster.

  • alpha (float) – parameter that regulates the shrinking of representative points toward the centroid.

  • u_min (Optional[int]) – size of the smallest cluster u.

  • f (float) – percentage of cluster points (0 <= f <= 1) we would like to have in the sample.

  • d (float) – (0 <= d <= 1) the probability that the sample contains less than f*|u| points of cluster u is less than d.

  • p (Optional[int]) – the number of partitions.

  • q (Optional[int]) – the number >1 such that each partition reduces to n/(pq) clusters.

  • n_rep_finalclust (Optional[int]) – number of representatives to use in the final assignment phase.

  • plotting (bool) – if True, plots all intermediate steps.

Return, rep, mat_a)

returns the clusters dictionary, the dictionary of representatives, the matrix a.

demo_parameters()[source]

Four plots showing the effects on the sample size of various parameters.

dist_clust_cure(rep_u, rep_v)[source]

Compute the distance of two clusters based on the minimum distance found between the representatives of one cluster and the ones of the other.

Parameters
  • rep_u (list) – representatives of the first cluster.

  • rep_v (list) – representatives of the second cluster.

Return type

float

Returns

distance between two clusters.

dist_mat_gen_cure(reps)[source]

Build distance matrix for CURE algorithm, using the dictionary of representatives.

Parameters

reps (dict) – dictionary of representative points, the only ones used to compute distances between clusters.

Return type

DataFrame

Returns

distance matrix as dataframe

form_new_cluster(clusters, u, u_cl)[source]

Form a new cluster from the input ones.

Parameters
  • clusters (dict) – existing clusters.

  • u (str) – first cluster.

  • u_cl (str) – second cluster.

Return type

list

Returns

new cluster obtained by merging the first and second cluster.

plot_results_cure(clust)[source]

Scatter plot of data points, colored according to the cluster they belong to, after performing CURE algorithm.

Parameters

clust (dict) – output of CURE algorithm, dictionary of the form cluster_labels+point_indices: coords of points

Return type

None

point_plot_mod2(X, CURE_df, reps, level_txt, level2_txt=None, par_index=None, u=None, u_cl=None, initial_ind=None, last_reps=None, not_sampled=None, not_sampled_ind=None, n_rep_fin=None)[source]

Scatter-plot of input data points, colored according to the cluster they belong to. A rectangle with red borders is displayed around the last merged cluster; representative points of last merged cluster are also plotted in red, along with the center of mass, plotted as a red cross. The current number of clusters and current distance are also displayed in the right upper corner.

Parameters
  • X (ndarray) – input data array.

  • CURE_df (DataFrame) – input dataframe built by CURE algorithm, listing the cluster and the x and y coordinates of each point.

  • reps (list) – list of the coordinates of representative points.

  • level_txt (float) – distance at which current merging occurs displayed in the upper right corner.

  • level2_txt (Optional[float]) – incremental distance (not used).

  • par_index – partial index to take the shuffling of indexes into account.

  • u – first cluster to be merged.

  • u_cl – second cluster to be merged.

  • initial_ind – initial partial index.

  • last_reps (Optional[dict]) – dictionary of last representative points.

  • not_sampled – coordinates of points that have not been initially sampled, in the large dataset version.

  • not_sampled_ind – indexes of not_sampled point_indices.

  • n_rep_fin – number of representatives to use for each cluster in the final assignment phase in the large dataset version.

Returns

if par_index is not None, returns the new indexes of par_index.

sel_rep(clusters, name, c, alpha)[source]

Select c representatives of the clusters: first one is the farthest from the centroid, the others c-1 are the farthest from the already selected representatives. It doesn’t use the old representatives, so it is slower than sel_rep_fast.

Parameters
  • clusters (dict) – dictionary of clusters.

  • name (str) – name of the cluster we want to select representatives from.

  • c (int) – number of representatives we want to extract.

  • alpha (float) – 0<=float<=1, it determines how much the representative points are moved toward the centroid: 0 means they aren’t modified, 1 means that all points collapse to the centroid.

Return type

list

Returns

list of representative points.

sel_rep_fast(prec_reps, clusters, name, c, alpha)[source]

Select c representatives of the clusters from the previously computed representatives, so it is faster than sel_rep.

Parameters
  • prec_reps (list) – list of previously computed representatives.

  • clusters (dict) – dictionary of clusters.

  • name (str) – name of the cluster we want to select representatives from.

  • c (int) – number of representatives we want to extract.

  • alpha (float) – 0<=float<=1, it determines how much the representative points are moved toward the centroid: 0 means they aren’t modified, 1 means that all points collapse to the centroid.

Return type

list

Returns

list of representative points.

update_mat_cure(mat, i, j, rep_new, name)[source]

Update distance matrix of CURE, by computing the new distances from the new representatives.

Parameters
  • mat (DataFrame) – input dataframe built by CURE algorithm, listing the cluster and the x and y coordinates of each point.

  • i (int) – row index of cluster to be merged.

  • j (int) – column index of cluster to be merged.

  • rep_new (dict) – dictionary of new representatives.

  • name (str) – string of the form “(” + u + “)-(” + u_cl + “)”, containing the new name of the newly merged cluster.

Return type

DataFrame

Returns

updated matrix with new distances

clustviz.dbscan module

DBSCAN(data, eps, minPTS, plotting=False, print_details=False)[source]

DBSCAN algorithm.

Parameters
  • data (ndarray) – input array.

  • eps (float) – radius of a point within which to search for minPTS points.

  • minPTS (int) – minimum number of neighbors for a point to be considered a core point.

  • plotting (bool) – if True, executes point_plot_mod, plotting every time a points is added to a clusters.

  • print_details (bool) – if True, prints the length of the “external” NearestNeighborhood and of the “internal” one (in the while loop).

Return type

Dict[str, int]

Returns

dictionary of the form point_index:cluster_label.

plot_clust_DB(X, ClustDict, eps, circle_class=None, noise_circle=True)[source]

Scatter plot of the data points, colored according to the cluster they belong to; circle_class plots circles around some or all points, with a radius of eps; if noise_circle is True, circle are also plotted around noise points.

Parameters
  • X (ndarray) – input array.

  • ClustDict (Dict[str, int]) – dictionary of the form point_index:cluster_label, built by DBSCAN.

  • eps (float) – radius of the circles to plot around the points.

  • circle_class (Optional[str]) – if == “all”, plots circles around every non-noise point, else plots circles only around points belonging to certain clusters, e.g. circle_class = [1,2] will plot circles around points belonging to clusters 1 and 2.

  • noise_circle (bool) – if True, plots circles around noise points

Return type

None

point_plot_mod(X, X_dict, point, eps, ClustDict)[source]

Plots a scatter plot of points, where the point (x,y) is light black and surrounded by a red circle of radius eps, where already processed point are plotted according to ClustDict and without edgecolor, whereas still-to-process points are green with black edgecolor.

Parameters
  • X (ndarray) – input array.

  • X_dict (Dict[str, ndarray]) – input dictionary version of X.

  • point – coordinates of the point that is currently inspected.

  • eps (float) – radius of the circle to plot around the point (x,y).

  • ClustDict (dict) – dictionary of the form point_index:cluster_label, built by DBSCAN

Return type

None

scan_neigh1_mod(data, point, eps)[source]

Neighborhood search for a point of a given dataset-dictionary (data) with a fixed eps; it returns also the point itself, differently from scan_neigh1 of OPTICS.

Parameters
  • data (Dict[str, ndarray]) – input dataset.

  • point (ndarray) – point whose neighborhood is to be examined.

  • eps (float) – radius of search.

Return type

Dict[str, ndarray]

Returns

neighborhood points.

clustviz.denclue module

CubeInfo

Info of a cube (rectangle), it is made up of the number of points contained in the cube, the linear sum of the x and y coordinated of the points in the cube, and of the coordinates of the points themselves. For example {'num_points': 2, 'linear_sum': np.array([3, 4]), 'points_coords': np.array([[1, 1], [2, 3])}.

alias of Dict[str, Union[int, List[float], List[list]]]

Cubes

Cubes (rectangles), having as keys the (row, column) tuple, and as values CubeInfo.

alias of Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]

CubesCoords

Coordinates of the cubes (rectangles), having as keys the (row, column) tuple, and as values the minimum x, the minimum y, the maximum x and maximum y of the cube. For example { (0, 0): (-0.05, -0.05, 1.95, 1.95), (1, 0): (1.95, -0.05, 3.95, 1.95), ... }.

alias of Dict[Tuple[int, int], Tuple[float, float, float, float]]

DENCLUE(data, s, xi=3, xi_c=3, tol=2, dist='euclidean', prec=20, plotting=True)[source]

Execute the DENCLUE algorithm, whose basic idea is to model the overall point density analytically as the sum of influence functions of the data points. Clusters can then be identified by determining density-attractors.

Parameters
  • data (ndarray) – input dataset.

  • s (float) – sigma, determines the influence of a point in its neighborhood.

  • xi (float) – xi, determines whether a density attractor is significant.

  • xi_c (float) – xi/2d, where d=2 is the dimension of input data.

  • tol (float) – tolerance for determining if two density attractors coincide.

  • dist (str) – distance to use.

  • prec (int) – precision used to compute density function.

  • plotting (bool) – if True, show plots.

Return type

list

Returns

list of cluster labels.

FindPoint(x1, y1, x2, y2, x, y)[source]

Check if the point (x,y) is inside the rectangle determined by x1, y1, x2, y2.

Parameters
  • x1 (float) – minimum x coordinate of the rectangle vertices.

  • y1 (float) – minimum y coordinate of the rectangle vertices.

  • x2 (float) – maximum x coordinate of the rectangle vertices.

  • y2 (float) – maximum y coordinate of the rectangle vertices.

  • x (float) – x coordinate of the point to be examined.

  • y (float) – y coordinate of the point to be examined.

Return type

bool

Returns

True if the point (x, y) lies inside the rectangle, False otherwise.

FindRect(point, coord_dict)[source]

Find the key of the cube (rectangle) containing the point (if any).

Parameters
  • point (ndarray) – point whose cube (rectangle) is to be found.

  • coord_dict (Dict[Tuple[int, int], Tuple[float, float, float, float]]) – dictionary of the rectangles’ coordinates.

Return type

Optional[Tuple[int, int]]

Returns

key of the cube, e.g. (1, 3), containing the point; if it does not exist, return None.

assign_cluster(data, others, attractor, clust_dict, processed)[source]

Assign a density attractor to (a) point(s) or mark it/them as outlier(s).

Parameters
  • data (ndarray) – input dataset.

  • others (ndarray) – coordinates of the point(s) whose clusters have to be assigned.

  • attractor (Optional[Tuple[ndarray, bool]]) – coordinates of the point and flag to indicate if it is an outlier, i.e. if the density attractor is significant.

  • clust_dict (Dict[int, ndarray]) – dictionary of points with the coordinates of their density attractor.

  • processed (List[int]) – points that have been processed.

Return type

Tuple[Dict[int, ndarray], List[int]]

Returns

dictionary of clusters, i.e. points with their density attractor, and list of processed points.

center_of_mass(cube)[source]

Compute the center of mass of a cube (rectangle).

check_border_points_rectangles(data, populated_cubes)[source]

Check if any of the points lie on the borders of the cubes, by checking if the sum of the number of points contained in each populated cube is equal to the total number of points of the dataset.

Parameters
  • data (ndarray) – input dataset.

  • populated_cubes (Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]) – populated cubes.

Return type

None

check_connection(cube1, cube2, s, dist='euclidean')[source]

Check if two cubes are connected (the distance between their centers of mass is not greater than 4*s).

Parameters
  • cube1 (Dict[str, Union[int, List[float], List[list]]]) – first cube.

  • cube2 (Dict[str, Union[int, List[float], List[list]]]) – second cube.

  • s (float) – sigma, determines the influence of a point in its neighborhood.

  • dist (str) – distance to use.

Return type

bool

Returns

True if the cubes are connected,``False`` otherwise.

density_attractor(data, x, coord_dict, tot_cubes, s, xi, delta=0.05, max_iter=100, dist='euclidean')[source]

Find the density attractor for point x with a hill-climbing procedure. To speed up computations, during the procedure, store all the points y such that dist(x, y) <= s/2: they will belong to the same cluster as x.

Parameters
  • data (ndarray) – input dataset.

  • x (ndarray) – point whose density attractors is to be found.

  • coord_dict (Dict[Tuple[int, int], Tuple[float, float, float, float]]) – dictionary of the rectangles’ coordinates.

  • tot_cubes (Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]) – the final cubes (highly populated + connected).

  • s (float) – sigma, determines the influence of a point in its neighborhood.

  • xi (float) – xi, determines whether a density attractor is significant.

  • delta (float) – delta of gradient descent.

  • max_iter (int) – maximum number of iteration for finding a density attractor.

  • dist (str) – distance to use.

Return type

Union[Tuple[Tuple[float, bool], ndarray], Tuple[None, None]]

Returns

the coordinates of the density attractor, a flag to indicate its significance, and the list of the coordinates of the point(s) attracted by that density attractor.

extract_cluster_labels(data, cld, tol=2)[source]

Extract the labels from the dictionary of points with the coordinates of their density attractors.

Parameters
  • data (ndarray) – input dataset.

  • cld (Dict[int, ndarray]) – dictionary of points with the coordinates of their density attractors.

  • tol (float) – tolerance to merge points with the same density attractors.

Return type

DataFrame

Returns

dataframe of points with cluster labels and coordinates of density attractors.

find_connected_cubes(hp_cubes, cubes, s, dist='euclidean')[source]

Return connected cubes, i.e. cubes whose centers of mass’ distances with a highly populated cube’s center of mass is less than 4*s.

Parameters
  • hp_cubes (Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]) – highly populated cubes.

  • cubes (Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]) – cubes.

  • s (float) – sigma, determines the influence of a point in its neighborhood.

  • dist (str) – distance to use.

Return type

Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]

Returns

new dictionary of cubes.

form_populated_cubes(a, b, c, d, data)[source]

For the given input cube (rectangle), compute how many points of the input dataset lie in it, store their coordinates and compute the sum of their x and y coordinates.

Parameters
  • a (float) – minimum x coordinate of the rectangle.

  • b (float) – minimum y coordinate of the rectangle.

  • c (float) – maximum x coordinate of the rectangle.

  • d (float) – maximum y coordinate of the rectangle.

  • data (ndarray) – input dataset.

Return type

Dict[str, Union[int, List[float], List[list]]]

Returns

dictionary of number of points lying in the cube, the linear sum of their x and y coordinates, their coordinates.

gaussian_density(x, D, s, dist='euclidean')[source]

Compute the Gaussian density of a point with respect to a dataset.

Parameters
  • x (ndarray) – point whose density is to be computed.

  • D (ndarray) – dataset.

  • s (float) – standard deviation of the Gaussian.

  • dist (str) – distance to use in the Gaussian.

Return type

float

Returns

Gaussian density at point x with respect to dataset D, using a Gaussian function with distance dist and standard deviation s.

gaussian_influence(x, y, s, dist='euclidean')[source]

Return the value of the Gaussian influence function in (x,y) with standard deviation s.

Parameters
  • x (ndarray) – first point.

  • y (ndarray) – second point.

  • s (float) – standard deviation of the Gaussian.

  • dist (str) – distance to use in the Gaussian.

Return type

float

Returns

value for the Gaussian in (x,y) with standard deviation s.

gradient_gaussian_density(x, D, s, dist='euclidean')[source]

Compute the gradient of the Gaussian density function, used to find the density-attractors, at a point.

Parameters
  • x (ndarray) – point.

  • D (ndarray) – dataset.

  • s (float) – standard deviation of the Gaussian.

  • dist (str) – distance to use in the Gaussian.

Return type

ndarray

Returns

gradient of the Gaussian density at point x with respect to dataset D, using a Gaussian function with distance dist and standard deviation s.

highly_pop_cubes(pop_cub, xi_c)[source]

Find highly populated cubes, i.e. cubes containing at least xi_c points.

Parameters
  • pop_cub (Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]) – populated cubes.

  • xi_c (float) – xi_c = xi/2d, where xi determines whether a density attractor is significant.

Return type

Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]

Returns

highly populated cubes.

near_with_cube(x, cube_x, tot_cubes, s)[source]

Find points of cubes that are connected with cube_x and whose center of mass’ distance from x is less or equal to 4*s. The point itself is included.

Parameters
  • x (ndarray) – examined point.

  • cube_x (Dict[str, Union[int, List[float], List[list]]]) – cube which x belongs to.

  • tot_cubes (Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]) – the final cubes (highly populated + connected).

  • s (float) – sigma, determines the influence of a point in its neighborhood.

Return type

ndarray

Returns

list of points belonging to cubes connected to cube_x and whose center of mass’ distance from x is less or equal to 4*s.

near_without_cube(x, coord_dict, tot_cubes, s)[source]

Find the cube that x belongs to, and then find the points of cubes that are connected with it and whose center of mass’ distance from x is less or equal to 4*s. The point itself is included.

Parameters
  • x (ndarray) – examined point.

  • coord_dict (Dict[Tuple[int, int], Tuple[float, float, float, float]]) – dictionary of the rectangles’ coordinates.

  • tot_cubes (Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]]) – the final cubes (highly populated + connected).

  • s (float) – sigma, determines the influence of a point in its neighborhood.

Return type

ndarray

Returns

list of points belonging to cubes connected to cube_x and whose center of mass’ distance from x is less or equal to 4*s.

plot_3d_both(data, s, xi=None, prec=3)[source]

Show a 3D plot of the density function, with a horizontal plane cutting it at height xi, above which points can be considered signicant density attractors. Below this, a scatter plot and a countour plot show the actual points of the dataset, colored by significance of their ‘density-attractiveness’.

Parameters
  • data (ndarray) – input dataset.

  • s (float) – sigma, determines the influence of a point in its neighborhood.

  • xi (Optional[float]) – xi, determines whether a density attractor is significant.

  • prec (int) – precision used to compute density function.

Return type

None

plot_3d_or_contour(data, s, three=False, scatter=False, prec=3)[source]

Plot the density function for the input dataset, either in 3D or 2D, using a contour plot.

Parameters
  • data (ndarray) – input dataset.

  • s (float) – sigma, determines the influence of a point in its neighborhood.

  • three (bool) – if True, execute 3D plot and do not plot 2D countour plot.

  • scatter (bool) – if True, and if three is False, draw a scatter plot on top of the countour plot.

  • prec (int) – precision used to compute density function.

Return type

None

plot_clust_dict(data, coord_df)[source]

Draw a scatter plot of the dataset, highlighting the clusters, the outliers and the density attractors, marked with a cross.

Parameters
  • data (ndarray) – input dataset.

  • coord_df (DataFrame) – dataframe of points with cluster labels and coordinates of density attractors.

Return type

None

plot_grid_rect(data, s, cube_kind='populated')[source]

Plot the cubes, with colors highlighting populated and highly populated cubes.

Parameters
  • data (ndarray) – input dataset.

  • s (float) – sigma, determines the influence of a point in its neighborhood.

  • cube_kind (str) – option to consider populated cubes of highly populated cubes.

Return type

None

plot_infl(data, s, xi)[source]

Plot points of the dataset, showing which of them could be density attractors.

Parameters
  • data (ndarray) – input dataset.

  • s (float) – sigma, determines the influence of a point in its neighborhood.

  • xi (float) – xi, determines whether a density attractor is significant.

Return type

None

plot_min_bound_rect(data)[source]

Plot the minimal bounding rectangle of the input dataset.

Parameters

data (ndarray) – input dataset.

Return type

tuple

pop_cubes(data, s)[source]

Find the populated cubes (rectangles containing at least one point).

Parameters
  • data (ndarray) – input dataset.

  • s (float) – sigma, determines the influence of a point in its neighborhood.

Return type

Tuple[Dict[Tuple[int, int], Dict[str, Union[int, List[float], List[list]]]], Dict[Tuple[int, int], Tuple[float, float, float, float]]]

Returns

the (x,y) coordinates of the populated cubes, with how many points it contains, the coordinates of its center of mass, and the coordinates of the points belonging to it; the coordinates of the cube (rectangle) itself.

square_wave_density(x, D, s, dist='euclidean')[source]

Compute the square-wave density of a point with respect to a dataset.

Parameters
  • x (ndarray) – point whose density is to be computed.

  • D (ndarray) – dataset.

  • s (float) – cut-off.

  • dist (str) – distance to use.

Return type

int

Returns

square-wave density at point x with respect to dataset D, using a square-wave function with distance dist and cut-off s.

square_wave_gradient(x, D, s, dist='euclidean')[source]

Compute the gradient of the square-wave density function of a point with respect to a dataset.

Parameters
  • x (ndarray) – point whose density is to be computed.

  • D (ndarray) – dataset.

  • s (float) – cut-off.

  • dist (str) – distance to use.

Return type

ndarray

Returns

gradient of the square-wave density function at point x with respect to dataset D, using a square-wave function with distance dist and cut-off s.

square_wave_influence(x, y, s, dist='euclidean')[source]

Compute the square-wave influence function in (x,y) with standard deviation s.

Parameters
  • x (ndarray) – first point.

  • y (ndarray) – second point.

  • s (float) – cut-off.

  • dist (str) – distance to use.

Return type

int

Returns

if dist(x, y) <= s, return 1, else 0.

clustviz.optics module

ExtractDBSCANclust(ClustDist, CoreDist, eps_db)[source]

Extracts cluster in a DBSCAN fashion; one can use any eps_db <= eps of OPTICS.

Parameters
  • ClustDist (Dict[str, float]) – ClustDist of OPTICS, a dictionary of the form point_index:reach_dist.

  • CoreDist (Dict[str, float]) – CoreDist of OPTICS, a dictionary of the form point_index:core_dist.

  • eps_db (float) – the eps to choose for DBSCAN.

Return type

Dict[str, int]

Returns

dictionary of clusters, of the form point_index:cluster_label.

OPTICS(X, eps, minPTS, plot=True, plot_reach=False)[source]

Execute the OPTICS algorithm. Similar to DBSCAN, but uses a priority queue.

Parameters
  • X (ndarray) – input array.

  • eps (float) – radius of a point within which to search for minPTS points.

  • minPTS (int) – minimum number of neighbors for a point to be considered a core point.

  • plot (bool) – if True, the scatter plot of the function point_plot is displayed at each step.

  • plot_reach (bool) – if True, the reachability plot is displayed at each step.

Return type

Tuple[Dict[str, float], Dict[str, float]]

Returns

ClustDist, a dictionary of the form point_index:reach_dist, and CoreDist, a dictionary of the form point_index:core_dist.

minPTSdist(data, o, minPTS, eps)[source]

Return the minPTS-distance of a point if it is a core point, else it returns np.inf.

Parameters
  • data (Dict[str, ndarray]) – input dictionary.

  • o (str) – key of point of interest.

  • minPTS (int) – minimum number of neighbors for a point to be considered a core point.

  • eps (float) – radius of a point within which to search for minPTS points.

Return type

Union[float, Any]

Returns

minPTS-distance of data[o] or np.inf.

plot_clust(X, ClustDist, CoreDist, eps, eps_db)[source]

Plot a scatter plot on the left, where points are colored according to the cluster they belong to, and a reachability plot on the right, where colors correspond to the clusters, and the two horizontal lines represent eps and eps_db.

Parameters
  • X (ndarray) – input array.

  • ClustDist (Dict[str, float]) – ClustDist of OPTICS, a dictionary of the form point_index:reach_dist.

  • CoreDist (Dict[str, float]) – CoreDist of OPTICS, a dictionary of the form point_index:core_dist.

  • eps (float) – the eps used to run OPTICS.

  • eps_db (float) – the eps to choose for DBSCAN.

Return type

None

point_plot(X, X_dict, o, eps, processed=None, col='yellow')[source]

Plot a scatter plot of points, where the point (x,y) is light black and surrounded by a red circle of radius eps, where processed point are plotted in col (yellow by default) and without edgecolor, whereas still-to-process points are green with black edgecolor.

Parameters
  • X (ndarray) – input array.

  • X_dict (Dict[str, ndarray]) – input dictionary version of X.

  • o (str) – point that is currently inspected.

  • eps (float) – radius of the circle to plot around the point (x,y).

  • processed (Optional[Iterable]) – already processed points, to plot in col.

  • col (str) – color to use for processed points, yellow by default.

Return type

None

reach_dist(data, x, y, minPTS, eps)[source]

Reachability distance (even if it is not a distance because it isn’t symmetrical).

Parameters
  • data (Dict[str, ndarray]) – input dictionary.

  • x (str) – first point.

  • y (str) – second point.

  • minPTS (int) – minimum number of neighbors for a point to be considered a core point.

  • eps (float) – radius of a point within which to search for minPTS points.

Return type

Union[float, Any]

Returns

reachability distance of x and y.

reach_plot(data, ClustDist, eps)[source]

Plot the reachability plot, along with a horizontal line denoting eps, from the ClustDist produced by OPTICS.

Parameters
  • data (Dict[str, ndarray]) – input dictionary.

  • ClustDist (Dict[str, float]) – output of OPTICS function, dictionary of the form point_index:reach_dist.

  • eps (float) – radius of a point within which to search for minPTS points.

Return type

None

scan_neigh1(data, point, eps)[source]

Neighborhood search for a point of a given dataset-dictionary (data) with a fixed eps.

Parameters
  • data (Dict[str, ndarray]) – input dictionary.

  • point (ndarray) – point whose neighborhood is to be examined.

  • eps (float) – radius of search.

Return type

Dict[str, ndarray]

Returns

dictionary of neighborhood points.

scan_neigh2(data, point, eps)[source]

Variation of scan_neigh1 that returns only the keys of the input dictionary with the euclidean distances <= eps from the point.

Parameters
  • data (Dict[str, ndarray]) – input dictionary.

  • point (ndarray) – point whose neighborhood is to be examined.

  • eps (float) – radius of search.

Return type

List[str]

Returns

keys of dictionary of neighborhood points, ordered by distance.

clustviz.pam module

class KMedoids(n_cluster=2, max_iter=10, tol=0.1, start_prob=0.8, end_prob=0.99, random_state=42)[source]

Bases: object

calculate_distance_of_clusters(cluster_dist=None)[source]

If no argument is provided, just sum the distances of the existing cluster_distances, else sum the distances of the input cluster_distances.

Parameters

cluster_dist – if not None, cluster distances.

Returns

sum of cluster distances.

calculate_inter_cluster_distance(medoid, cluster_list)[source]

Compute the average distance of points in a cluster from their medoid.

fit(data)[source]
plot_pam(data, cl, equal_axis_scale=False)[source]

Scatterplot of data points, with colors according to cluster labels. Centers of mass of the clusters are marked with an X.

Parameters
  • data (DataFrame) – input data sample as dataframe.

  • cl (dict) – cluster dictionary.

  • equal_axis_scale (bool) – if True, axis are plotted with the same scaling.

Return type

None

clustviz.utils module

annotate_points(annotations, points, ax)[source]

Annotate the points of the axis with their name (number).

Parameters
  • annotations (Iterable) – names of the points (their numbers).

  • points (ndarray) – array of their positions.

  • ax – axis of the plot.

Return type

None

build_initial_matrices(X, partial_index=None)[source]

Build the initial dataframe, adding as many columns of the type 0x,0y,1x,1y,2x,2y,… as there are points in the input dataset, filling them with NaNs, and build the same dataframe without NaNs columns.

Parameters
  • X (ndarray) – input data array.

  • partial_index (Optional[list]) – if not None, it is used as index of the matrix_a, of cluster points and of representatives, for CURE algorithm.

Return type

Tuple[DataFrame, DataFrame]

Returns

initial cluster dataframe, where each row represents a cluster and each pair of columns represents the x and y coordinates of each point belonging to that cluster, and its version without NaNs.

chernoffBounds(u_min, f, N, d, k)[source]
Parameters
  • u_min (int) – size of the smallest cluster u.

  • f (float) – percentage of cluster points (0 <= f <= 1).

  • N (int) – total size.

  • d (float) – the probability that the sample contains less than f*|u| points of cluster u is less than d.

  • k (int) – cluster size.

If one uses as dim(u) the minimum cluster size we are interested in, the result is the minimum sample size that guarantees that for k clusters the probability of selecting fewer than f*dim(u) points from any one of the clusters u is less than k*d.

Return type

float

cluster_points(cluster_name)[source]

Return points composing the cluster, removing brackets and hyphen from cluster name, e.g. ((a)-(b))-(c) becomes [a, b, c].

Parameters

cluster_name (str) – name of the cluster.

Return type

list

Returns

points forming the cluster.

convert_colors(dict_colors, alpha=0.5)[source]

Modify the transparency of each color of a dictionary of colors to the desired alpha.

Return type

Dict[int, Tuple[float, …]]

dist1(x, y)[source]

Original euclidean distance.

Return type

float

dist2(data, x, y)[source]

Euclidean distance which takes keys of a dictionary (X_dict) as inputs.

Return type

float

draw_rectangle_or_encircle(X, points, X_clust, Y_clust, ax, ind)[source]

Draw a rectangle or link the points forming the cluster, if the number of points exceeds two.

Parameters
  • X (ndarray) – input data array.

  • points (list) – points forming the cluster.

  • X_clust (list) – x coordinates of points forming the cluster.

  • Y_clust (list) – y coordinates of points forming the cluster.

  • ax – axis of the plot.

  • ind (int) – index for coloring the cluster.

Return type

None

encircle(x, y, ax, **kwargs)[source]

Plot a line-boundary around a cluster (at least 3 points are required).

Return type

None

euclidean_distance(a, b)[source]

Return Euclidean distance of two arrays.

Return type

float

flatten_list(input_list)[source]
Return type

list

Module contents