clustviz.chameleon package

Submodules

clustviz.chameleon.chameleon module

cluster(df, k, knn=10, m=30, alpha=2.0, verbose0=False, verbose1=False, verbose2=True, plot=True)[source]

Chameleon clustering: build the K-NN graph, partition it into m clusters

Parameters
  • df (DataFrame) – input dataframe.

  • k (int) – desired number of clusters.

  • knn (int) – parameter k of K-nearest_neighbors.

  • m (int) – number of clusters to reach in the initial clustering phase.

  • alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.

  • verbose0 (bool) – if True, print general infos.

  • verbose1 (bool) – if True, print infos about the prepartitioning phase.

  • verbose2 (bool) – if True, print labels of merging clusters and their scores in the merging phase.

  • plot (bool) – if True, show plots.

Return type

Tuple[DataFrame, OrderedDict]

Returns

dataframe with cluster labels and dictionary of merging scores (similarities).

internal_closeness(graph, cluster)[source]

Compute the internal closeness of the input cluster, i.e. the sum of weights of the edges fully contained in it.

Parameters
  • graph (Graph) – kNN graph.

  • cluster (List[int]) – cluster represented by a list of nodes belonging to it.

Return type

float

Returns

internal closeness of the input cluster.

internal_interconnectivity(graph, cluster)[source]

Compute the weighted sum of edges that partition the graph into two roughly equal parts.

Parameters
  • graph – kNN graph.

  • cluster (List[int]) – cluster represented by a list of nodes belonging to it.

Return type

float

Returns

sum of the bisection weights.

len_edges(graph, cluster)[source]

Compute the number of edges that interconnect the nodes of the input cluster.

Parameters
  • graph (Graph) – kNN graph.

  • cluster (List[int]) – cluster represented by a list of nodes belonging to it.

Return type

int

Returns

number of edges interconnecting the nodes of the input graph.

merge_best(graph, df, alpha, k, verbose=False, verbose2=True)[source]

Find the two clusters with the highest score and merge them.

Parameters
  • graph – kNN graph.

  • df – input dataframe.

  • alpha – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.

  • k – desired number of clusters.

  • verbose – if True, print additional infos.

  • verbose2 – if True, print labels of merging clusters and their score.

Return type

Union[Tuple[DataFrame, float, int], bool]

Returns

input dataframe with clustering label column, maximum merging score and newly merged cluster label.

merge_score(graph, cluster_i, cluster_j, alpha)[source]

Compute the score associated with the merging of the two clusters.

Parameters
  • graph (Graph) – kNN graph.

  • cluster_i (List[int]) – first cluster.

  • cluster_j (List[int]) – second cluster.

  • alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.

Return type

float

Returns

merging score.

rebuild_labels(df)[source]

Clean the clustering labels of the input dataframe, i.e. bring them into the range starting from 1. For example, if they range from 6 to 10, bring them to the range 1 to 5.

Parameters

df (DataFrame) – dataframe obtained from the merging phase of the algorithm.

Return type

DataFrame

Returns

dataframe with cleaned labels.

relative_closeness(graph, cluster_i, cluster_j)[source]

Compute the relative closeness of two clusters of a graph, measured on the connecting edges.

Parameters
  • graph (Graph) – kNN graph.

  • cluster_i (List[int]) – first cluster.

  • cluster_j (List[int]) – second cluster.

Return type

float

Returns

relative closeness of the two input clusters.

relative_interconnectivity(graph, cluster_i, cluster_j)[source]

Compute the relative interconnectivity of two clusters of a graph, measured on the connecting edges.

Parameters
  • graph (Graph) – kNN graph.

  • cluster_i (List[int]) – first cluster.

  • cluster_j (List[int]) – second cluster.

Return type

float

Returns

relative interconnectivity of the two input clusters.

clustviz.chameleon.chameleon2 module

cluster2(df, k=None, knn=None, m=30, alpha=2.0, beta=1, m_fact=1000.0, verbose=False, verbose1=True, verbose2=True, plot=True, auto_extract=False)[source]
Parameters
  • df (DataFrame) – input dataframe.

  • k (Optional[int]) – desired number of clusters.

  • knn (Optional[int]) – parameter k of K-nearest_neighbors.

  • m (int) – number of clusters to reach in the initial clustering phase.

  • alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.

  • beta (float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.

  • m_fact (float) – multiplicative factor for clusters composed of a single node.

  • verbose (bool) – if True, print general infos.

  • verbose1 (bool) – if True, print infos about the prepartitioning phase.

  • verbose2 (bool) – if True, print labels of merging clusters and their scores in the merging phase.

  • plot (bool) – if True, show plots.

  • auto_extract (bool) – if True, try to extract the optimal number of clusters and print it.

Return type

Tuple[DataFrame, Dict[int, float]]

Returns

dataframe with cluster labels and dictionary of merging scores (similarities).

connected_components(connected_points)[source]

Find connected components from a dictionary of connected nodes.

Parameters

connected_points (dict) – (symmetrically) connected points.

Return type

Generator

Returns

connected components.

dendrogram_height(merging_similarities, m)[source]

Find dendrogram height, defined with a recursive sum of the reciprocal of the merging scores.

Parameters
  • merging_similarities (Dict[int, float]) – merging scores of the algorithm.

  • m (int) – initial number of clusters.

Return type

Dict[int, float]

Returns

dendrogram height.

extract_optimal_n_clust(merging_similarities, m, f=1000, eta=2)[source]

Extract the optimal number of clusters using the dendrogram.

Parameters
  • merging_similarities (Dict[int, float]) – merging scores of the algorithm.

  • m (int) – initial number of clusters.

  • f (float) – threshold parameter to determine if jump is large enough.

  • eta (float) – decrease coefficient.

Return type

None

find_bigger_jump(dh, jump)[source]

Find a bigger jump in the dendrogram levels.

Parameters
  • dh (Dict[int, float]) – dendrogram height.

  • jump (float) – threshold to exceed.

Return type

float

Returns

best level where to cut off th dendrogram if found, else 0.

find_nearest_height(dh, value)[source]

Find nearest height to cutoff value.

Parameters
  • dh (Dict[int, float]) – dendrogram height.

  • value (float) – first_jump cutoff value.

Return type

int

Returns

nearest dendrogram height to cutoff value.

first_jump_cutoff(dh, mult, eta, m)[source]

Find the first large gap between tree level, which heuristically is the best level where clusters should be divided.

Parameters
  • dh (Dict[int, float]) – dendrogram height.

  • mult (float) – additional factor.

  • eta (float) – decrease coefficient.

  • m (int) – initial number of clusters.

Return type

float

Returns

best level where to cut off dendrogram.

flood_fill(preprocessed_graph, knn_graph, df)[source]

Find clusters composed by more than one connected component and divide them accordingly. Adjust the parameter m, which indicates the number of clusters to reach in the initial phase.

Parameters
  • preprocessed_graph (Graph) – clustered kNN graph.

  • knn_graph (Graph) – kNN graph.

  • df (DataFrame) – input dataframe.

Return type

Tuple[Graph, int]

Returns

preprocessed graph with updated cluster labels, new m parameter.

merge_best2(graph, df, alpha, beta, m_fact, k, verbose=False, verbose2=True)[source]

Find the two clusters with the highest score and merge them.

Parameters
  • graph (Graph) – kNN graph.

  • df (DataFrame) – input dataframe.

  • alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.

  • beta (float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.

  • m_fact (float) – multiplicative factor for clusters composed of a single node.

  • k (int) – desired number of clusters.

  • verbose (bool) – if True, print additional infos.

  • verbose2 (bool) – if True, print labels of merging clusters and their score.

Return type

Union[Tuple[DataFrame, float, int], bool]

Returns

input dataframe with clustering label column, maximum merging score and newly merged cluster label.

merge_score2(graph, ci, cj, alpha, beta, m_fact)[source]

Compute the score associated with the merging of the two clusters.

Parameters
  • graph (Graph) – kNN graph.

  • ci (List[int]) – first cluster.

  • cj (List[int]) – second cluster.

  • alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.

  • beta (float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.

  • m_fact (float) – multiplicative factor for clusters composed of a single node.

Return type

float

Returns

merging score

prepro_edge(graph)[source]

Build a dictionary having points as keys and all the points that are symmetrically connected through edges to the key point as values, i.e. 0: [5, 7] means that there are edges 0->5 and 0->7, but also 5->0 and 7->0.

Parameters

graph (Graph) – kNN graph.

Return type

OrderedDict

Returns

dictionary of symmetrically connected points.

relative_closeness2(graph, cluster_i, cluster_j, m_fact)[source]

Compute the relative closeness of the two input clusters.

Parameters
  • graph (Graph) – kNN graph.

  • cluster_i (List[int]) – first cluster.

  • cluster_j (List[int]) – second cluster.

  • m_fact (float) – multiplicative factor for clusters composed of a single node.

Return type

float

Returns

relative closeness of the two input clusters.

relative_interconnectivity2(graph, cluster_i, cluster_j, beta)[source]

Compute the relative interconnectivity of the two input clusters.

Parameters
  • graph (Graph) – kNN graph.

  • cluster_i (List[int]) – first cluster.

  • cluster_j (List[int]) – second cluster.

  • beta (float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.

Return type

float

Returns

relative interconnectivity of the two input clusters.

rho(graph, cluster_i, cluster_j)[source]

Compute the rho factor, which discourages the algorithm from merging clusters with different densities.

Parameters
  • graph (Graph) – kNN graph.

  • cluster_i (List[int]) – first cluster.

  • cluster_j (List[int]) – second cluster.

Return type

float

Returns

rho factor.

w_int_closeness(graph, cluster)[source]

Compute the internal closeness of the input cluster weighted by the number of its internal edges.

Parameters
  • graph (Graph) – kNN graph.

  • cluster (List[int]) – cluster represented by a list of nodes belonging to it.

Return type

float

Returns

weighted internal closeness.

clustviz.chameleon.graphtools module

bisection_weights(graph, cluster)[source]

Return the weights of the edges that connect nodes belonging to the two new partitions obtained through min-cut bisection.

Parameters
  • graph (Graph) – kNN graph.

  • cluster (List[int]) – list containing the label of the cluster whose nodes are to be found.

Return type

List[float]

Returns

weights of the edges connecting nodes belonging to the two new partitions.

connecting_edges(partitions, graph)[source]

Return only the edges that connect nodes of the first cluster with nodes of the second cluster, in the form of a list of tuples [(0, 5), (3, 5)] (e.g. the only connecting-edges are the ones connecting node_0 to node_5 and node_3 to node_5.

Parameters
  • partitions (Tuple[list, list]) – tuple composed of two lists containing the nodes of the two clusters to be examined.

  • graph (Graph) – kNN graph.

Return type

List[Tuple]

Returns

edges connecting the two input clusters.

get_cluster(graph, cluster_label)[source]

Find nodes belonging to a specific cluster.

Parameters
  • graph (Graph) – kNN graph.

  • cluster_label (int) – label of the cluster whose nodes are to be found.

Return type

List[int]

Returns

nodes of the graph belonging to the input cluster.

get_weights(graph, edges)[source]

Get the weights associated to the input edges.

Parameters
  • graph (Graph) – kNN graph.

  • edges (List[Tuple]) – edges whose weight is to computed.

Return type

List[float]

Returns

weights of the input edges.

knn_graph(df, k, symmetrical, verbose=False)[source]

Return the weighted (symmetrical) K-NearestNeighbor graph of input dataframe.

Parameters
  • df (DataFrame) – input dataset.

  • k (int) – k of kNN, i.e. number of nearest neighbors to consider.

  • symmetrical (bool) – if True, return the weighted symmetrical K-NearestNeighbor graph.

  • verbose (bool) – if True, print infos.

Return type

Graph

Returns

weighted (symmetrical) K-NearestNeighbor graph.

min_cut_bisector(graph)[source]

Return the edges that connect nodes belonging to the two new partitions obtained through min-cut bisection.

Parameters

graph (Graph) – kNN graph.

Return type

List[Tuple]

Returns

edges that connect nodes belonging to the two newly formed partitions.

part_graph(graph, df=None)[source]

Return the input graph with the clustering obtained through mincut-bisection.

Parameters
  • graph (Graph) – kNN graph.

  • df (Optional[DataFrame]) – if not None, input dataframe, containing points coordinates and clustering labels.

Return type

Graph

Returns

partitioned graph.

plot2d_data(df, col_i=None)[source]

Scatter plot data points, colored according to the clusters they belong to, highlighting the last formed cluster.

Parameters
  • df (DataFrame) – dataframe of the points containing a column indicating clusters.

  • col_i (Optional[int]) – if not None, label of the last formed cluster.

Return type

None

plot2d_graph(graph, print_clust=True)[source]

Draw the graph of the input dataset, colored according to the clusters.

Parameters
  • graph (Graph) – kNN graph.

  • print_clust (bool) – if True, print the cardinality of each cluster.

Return type

None

pre_part_graph(graph, k, df=None, verbose=True, plotting=False)[source]

Partition the input graph into k clusters.

Parameters
  • graph (Graph) – kNN graph.

  • k (int) – final number of partitions.

  • df (Optional[DataFrame]) – if not None, input dataframe, containing points coordinates and clustering labels.

  • verbose (bool) – if True, print infos.

  • plotting (bool) – if True, plot the points colored by cluster.

Return type

Graph

Returns

partitioned graph.

Module contents