clustviz.chameleon package¶
Submodules¶
clustviz.chameleon.chameleon module¶
-
cluster(df, k, knn=10, m=30, alpha=2.0, verbose0=False, verbose1=False, verbose2=True, plot=True)[source]¶ Chameleon clustering: build the K-NN graph, partition it into m clusters
- Parameters
df (
DataFrame) – input dataframe.k (
int) – desired number of clusters.knn (
int) – parameter k of K-nearest_neighbors.m (
int) – number of clusters to reach in the initial clustering phase.alpha (
float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.verbose0 (
bool) – if True, print general infos.verbose1 (
bool) – if True, print infos about the prepartitioning phase.verbose2 (
bool) – if True, print labels of merging clusters and their scores in the merging phase.plot (
bool) – if True, show plots.
- Return type
Tuple[DataFrame,OrderedDict]- Returns
dataframe with cluster labels and dictionary of merging scores (similarities).
-
internal_closeness(graph, cluster)[source]¶ Compute the internal closeness of the input cluster, i.e. the sum of weights of the edges fully contained in it.
- Parameters
graph (
Graph) – kNN graph.cluster (
List[int]) – cluster represented by a list of nodes belonging to it.
- Return type
float- Returns
internal closeness of the input cluster.
-
internal_interconnectivity(graph, cluster)[source]¶ Compute the weighted sum of edges that partition the graph into two roughly equal parts.
- Parameters
graph – kNN graph.
cluster (
List[int]) – cluster represented by a list of nodes belonging to it.
- Return type
float- Returns
sum of the bisection weights.
-
len_edges(graph, cluster)[source]¶ Compute the number of edges that interconnect the nodes of the input cluster.
- Parameters
graph (
Graph) – kNN graph.cluster (
List[int]) – cluster represented by a list of nodes belonging to it.
- Return type
int- Returns
number of edges interconnecting the nodes of the input graph.
-
merge_best(graph, df, alpha, k, verbose=False, verbose2=True)[source]¶ Find the two clusters with the highest score and merge them.
- Parameters
graph – kNN graph.
df – input dataframe.
alpha – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.
k – desired number of clusters.
verbose – if True, print additional infos.
verbose2 – if True, print labels of merging clusters and their score.
- Return type
Union[Tuple[DataFrame,float,int],bool]- Returns
input dataframe with clustering label column, maximum merging score and newly merged cluster label.
-
merge_score(graph, cluster_i, cluster_j, alpha)[source]¶ Compute the score associated with the merging of the two clusters.
- Parameters
graph (
Graph) – kNN graph.cluster_i (
List[int]) – first cluster.cluster_j (
List[int]) – second cluster.alpha (
float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.
- Return type
float- Returns
merging score.
-
rebuild_labels(df)[source]¶ Clean the clustering labels of the input dataframe, i.e. bring them into the range starting from 1. For example, if they range from 6 to 10, bring them to the range 1 to 5.
- Parameters
df (
DataFrame) – dataframe obtained from the merging phase of the algorithm.- Return type
DataFrame- Returns
dataframe with cleaned labels.
-
relative_closeness(graph, cluster_i, cluster_j)[source]¶ Compute the relative closeness of two clusters of a graph, measured on the connecting edges.
- Parameters
graph (
Graph) – kNN graph.cluster_i (
List[int]) – first cluster.cluster_j (
List[int]) – second cluster.
- Return type
float- Returns
relative closeness of the two input clusters.
-
relative_interconnectivity(graph, cluster_i, cluster_j)[source]¶ Compute the relative interconnectivity of two clusters of a graph, measured on the connecting edges.
- Parameters
graph (
Graph) – kNN graph.cluster_i (
List[int]) – first cluster.cluster_j (
List[int]) – second cluster.
- Return type
float- Returns
relative interconnectivity of the two input clusters.
clustviz.chameleon.chameleon2 module¶
-
cluster2(df, k=None, knn=None, m=30, alpha=2.0, beta=1, m_fact=1000.0, verbose=False, verbose1=True, verbose2=True, plot=True, auto_extract=False)[source]¶ - Parameters
df (
DataFrame) – input dataframe.k (
Optional[int]) – desired number of clusters.knn (
Optional[int]) – parameter k of K-nearest_neighbors.m (
int) – number of clusters to reach in the initial clustering phase.alpha (
float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.beta (
float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.m_fact (
float) – multiplicative factor for clusters composed of a single node.verbose (
bool) – if True, print general infos.verbose1 (
bool) – if True, print infos about the prepartitioning phase.verbose2 (
bool) – if True, print labels of merging clusters and their scores in the merging phase.plot (
bool) – if True, show plots.auto_extract (
bool) – if True, try to extract the optimal number of clusters and print it.
- Return type
Tuple[DataFrame,Dict[int,float]]- Returns
dataframe with cluster labels and dictionary of merging scores (similarities).
-
connected_components(connected_points)[source]¶ Find connected components from a dictionary of connected nodes.
- Parameters
connected_points (
dict) – (symmetrically) connected points.- Return type
Generator- Returns
connected components.
-
dendrogram_height(merging_similarities, m)[source]¶ Find dendrogram height, defined with a recursive sum of the reciprocal of the merging scores.
- Parameters
merging_similarities (
Dict[int,float]) – merging scores of the algorithm.m (
int) – initial number of clusters.
- Return type
Dict[int,float]- Returns
dendrogram height.
-
extract_optimal_n_clust(merging_similarities, m, f=1000, eta=2)[source]¶ Extract the optimal number of clusters using the dendrogram.
- Parameters
merging_similarities (
Dict[int,float]) – merging scores of the algorithm.m (
int) – initial number of clusters.f (
float) – threshold parameter to determine if jump is large enough.eta (
float) – decrease coefficient.
- Return type
None
-
find_bigger_jump(dh, jump)[source]¶ Find a bigger jump in the dendrogram levels.
- Parameters
dh (
Dict[int,float]) – dendrogram height.jump (
float) – threshold to exceed.
- Return type
float- Returns
best level where to cut off th dendrogram if found, else 0.
-
find_nearest_height(dh, value)[source]¶ Find nearest height to cutoff value.
- Parameters
dh (
Dict[int,float]) – dendrogram height.value (
float) – first_jump cutoff value.
- Return type
int- Returns
nearest dendrogram height to cutoff value.
-
first_jump_cutoff(dh, mult, eta, m)[source]¶ Find the first large gap between tree level, which heuristically is the best level where clusters should be divided.
- Parameters
dh (
Dict[int,float]) – dendrogram height.mult (
float) – additional factor.eta (
float) – decrease coefficient.m (
int) – initial number of clusters.
- Return type
float- Returns
best level where to cut off dendrogram.
-
flood_fill(preprocessed_graph, knn_graph, df)[source]¶ Find clusters composed by more than one connected component and divide them accordingly. Adjust the parameter m, which indicates the number of clusters to reach in the initial phase.
- Parameters
preprocessed_graph (
Graph) – clustered kNN graph.knn_graph (
Graph) – kNN graph.df (
DataFrame) – input dataframe.
- Return type
Tuple[Graph,int]- Returns
preprocessed graph with updated cluster labels, new m parameter.
-
merge_best2(graph, df, alpha, beta, m_fact, k, verbose=False, verbose2=True)[source]¶ Find the two clusters with the highest score and merge them.
- Parameters
graph (
Graph) – kNN graph.df (
DataFrame) – input dataframe.alpha (
float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.beta (
float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.m_fact (
float) – multiplicative factor for clusters composed of a single node.k (
int) – desired number of clusters.verbose (
bool) – if True, print additional infos.verbose2 (
bool) – if True, print labels of merging clusters and their score.
- Return type
Union[Tuple[DataFrame,float,int],bool]- Returns
input dataframe with clustering label column, maximum merging score and newly merged cluster label.
-
merge_score2(graph, ci, cj, alpha, beta, m_fact)[source]¶ Compute the score associated with the merging of the two clusters.
- Parameters
graph (
Graph) – kNN graph.ci (
List[int]) – first cluster.cj (
List[int]) – second cluster.alpha (
float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.beta (
float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.m_fact (
float) – multiplicative factor for clusters composed of a single node.
- Return type
float- Returns
merging score
-
prepro_edge(graph)[source]¶ Build a dictionary having points as keys and all the points that are symmetrically connected through edges to the key point as values, i.e. 0: [5, 7] means that there are edges 0->5 and 0->7, but also 5->0 and 7->0.
- Parameters
graph (
Graph) – kNN graph.- Return type
OrderedDict- Returns
dictionary of symmetrically connected points.
-
relative_closeness2(graph, cluster_i, cluster_j, m_fact)[source]¶ Compute the relative closeness of the two input clusters.
- Parameters
graph (
Graph) – kNN graph.cluster_i (
List[int]) – first cluster.cluster_j (
List[int]) – second cluster.m_fact (
float) – multiplicative factor for clusters composed of a single node.
- Return type
float- Returns
relative closeness of the two input clusters.
-
relative_interconnectivity2(graph, cluster_i, cluster_j, beta)[source]¶ Compute the relative interconnectivity of the two input clusters.
- Parameters
graph (
Graph) – kNN graph.cluster_i (
List[int]) – first cluster.cluster_j (
List[int]) – second cluster.beta (
float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.
- Return type
float- Returns
relative interconnectivity of the two input clusters.
-
rho(graph, cluster_i, cluster_j)[source]¶ Compute the rho factor, which discourages the algorithm from merging clusters with different densities.
- Parameters
graph (
Graph) – kNN graph.cluster_i (
List[int]) – first cluster.cluster_j (
List[int]) – second cluster.
- Return type
float- Returns
rho factor.
-
w_int_closeness(graph, cluster)[source]¶ Compute the internal closeness of the input cluster weighted by the number of its internal edges.
- Parameters
graph (
Graph) – kNN graph.cluster (
List[int]) – cluster represented by a list of nodes belonging to it.
- Return type
float- Returns
weighted internal closeness.
clustviz.chameleon.graphtools module¶
-
bisection_weights(graph, cluster)[source]¶ Return the weights of the edges that connect nodes belonging to the two new partitions obtained through min-cut bisection.
- Parameters
graph (
Graph) – kNN graph.cluster (
List[int]) – list containing the label of the cluster whose nodes are to be found.
- Return type
List[float]- Returns
weights of the edges connecting nodes belonging to the two new partitions.
-
connecting_edges(partitions, graph)[source]¶ Return only the edges that connect nodes of the first cluster with nodes of the second cluster, in the form of a list of tuples [(0, 5), (3, 5)] (e.g. the only connecting-edges are the ones connecting node_0 to node_5 and node_3 to node_5.
- Parameters
partitions (
Tuple[list,list]) – tuple composed of two lists containing the nodes of the two clusters to be examined.graph (
Graph) – kNN graph.
- Return type
List[Tuple]- Returns
edges connecting the two input clusters.
-
get_cluster(graph, cluster_label)[source]¶ Find nodes belonging to a specific cluster.
- Parameters
graph (
Graph) – kNN graph.cluster_label (
int) – label of the cluster whose nodes are to be found.
- Return type
List[int]- Returns
nodes of the graph belonging to the input cluster.
-
get_weights(graph, edges)[source]¶ Get the weights associated to the input edges.
- Parameters
graph (
Graph) – kNN graph.edges (
List[Tuple]) – edges whose weight is to computed.
- Return type
List[float]- Returns
weights of the input edges.
-
knn_graph(df, k, symmetrical, verbose=False)[source]¶ Return the weighted (symmetrical) K-NearestNeighbor graph of input dataframe.
- Parameters
df (
DataFrame) – input dataset.k (
int) – k of kNN, i.e. number of nearest neighbors to consider.symmetrical (
bool) – if True, return the weighted symmetrical K-NearestNeighbor graph.verbose (
bool) – if True, print infos.
- Return type
Graph- Returns
weighted (symmetrical) K-NearestNeighbor graph.
-
min_cut_bisector(graph)[source]¶ Return the edges that connect nodes belonging to the two new partitions obtained through min-cut bisection.
- Parameters
graph (
Graph) – kNN graph.- Return type
List[Tuple]- Returns
edges that connect nodes belonging to the two newly formed partitions.
-
part_graph(graph, df=None)[source]¶ Return the input graph with the clustering obtained through mincut-bisection.
- Parameters
graph (
Graph) – kNN graph.df (
Optional[DataFrame]) – if not None, input dataframe, containing points coordinates and clustering labels.
- Return type
Graph- Returns
partitioned graph.
-
plot2d_data(df, col_i=None)[source]¶ Scatter plot data points, colored according to the clusters they belong to, highlighting the last formed cluster.
- Parameters
df (
DataFrame) – dataframe of the points containing a column indicating clusters.col_i (
Optional[int]) – if not None, label of the last formed cluster.
- Return type
None
-
plot2d_graph(graph, print_clust=True)[source]¶ Draw the graph of the input dataset, colored according to the clusters.
- Parameters
graph (
Graph) – kNN graph.print_clust (
bool) – if True, print the cardinality of each cluster.
- Return type
None
-
pre_part_graph(graph, k, df=None, verbose=True, plotting=False)[source]¶ Partition the input graph into k clusters.
- Parameters
graph (
Graph) – kNN graph.k (
int) – final number of partitions.df (
Optional[DataFrame]) – if not None, input dataframe, containing points coordinates and clustering labels.verbose (
bool) – if True, print infos.plotting (
bool) – if True, plot the points colored by cluster.
- Return type
Graph- Returns
partitioned graph.