clustviz.chameleon package¶

Submodules¶

clustviz.chameleon.chameleon module¶

cluster(df, k, knn=10, m=30, alpha=2.0, verbose0=False, verbose1=False, verbose2=True, plot=True)[source]¶

Chameleon clustering: build the K-NN graph, partition it into m clusters

Parameters

df (DataFrame) – input dataframe.
k (int) – desired number of clusters.
knn (int) – parameter k of K-nearest_neighbors.
m (int) – number of clusters to reach in the initial clustering phase.
alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.
verbose0 (bool) – if True, print general infos.
verbose1 (bool) – if True, print infos about the prepartitioning phase.
verbose2 (bool) – if True, print labels of merging clusters and their scores in the merging phase.
plot (bool) – if True, show plots.

Return type

Tuple[DataFrame, OrderedDict]

Returns

dataframe with cluster labels and dictionary of merging scores (similarities).

internal_closeness(graph, cluster)[source]¶

Compute the internal closeness of the input cluster, i.e. the sum of weights of the edges fully contained in it.

Parameters

graph (Graph) – kNN graph.
cluster (List[int]) – cluster represented by a list of nodes belonging to it.

Return type

float

Returns

internal closeness of the input cluster.

internal_interconnectivity(graph, cluster)[source]¶

Compute the weighted sum of edges that partition the graph into two roughly equal parts.

Parameters

graph – kNN graph.
cluster (List[int]) – cluster represented by a list of nodes belonging to it.

Return type

float

Returns

sum of the bisection weights.

len_edges(graph, cluster)[source]¶

Compute the number of edges that interconnect the nodes of the input cluster.

Parameters

graph (Graph) – kNN graph.
cluster (List[int]) – cluster represented by a list of nodes belonging to it.

Return type

int

Returns

number of edges interconnecting the nodes of the input graph.

merge_best(graph, df, alpha, k, verbose=False, verbose2=True)[source]¶

Find the two clusters with the highest score and merge them.

Parameters

graph – kNN graph.
df – input dataframe.
alpha – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.
k – desired number of clusters.
verbose – if True, print additional infos.
verbose2 – if True, print labels of merging clusters and their score.

Return type

Union[Tuple[DataFrame, float, int], bool]

Returns

input dataframe with clustering label column, maximum merging score and newly merged cluster label.

merge_score(graph, cluster_i, cluster_j, alpha)[source]¶

Compute the score associated with the merging of the two clusters.

Parameters

graph (Graph) – kNN graph.
cluster_i (List[int]) – first cluster.
cluster_j (List[int]) – second cluster.
alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.

Return type

float

Returns

merging score.

rebuild_labels(df)[source]¶

Clean the clustering labels of the input dataframe, i.e. bring them into the range starting from 1. For example, if they range from 6 to 10, bring them to the range 1 to 5.

Parameters: df (DataFrame) – dataframe obtained from the merging phase of the algorithm.
Return type: DataFrame
Returns: dataframe with cleaned labels.

relative_closeness(graph, cluster_i, cluster_j)[source]¶

Compute the relative closeness of two clusters of a graph, measured on the connecting edges.

Parameters

graph (Graph) – kNN graph.
cluster_i (List[int]) – first cluster.
cluster_j (List[int]) – second cluster.

Return type

float

Returns

relative closeness of the two input clusters.

relative_interconnectivity(graph, cluster_i, cluster_j)[source]¶

Compute the relative interconnectivity of two clusters of a graph, measured on the connecting edges.

Parameters

graph (Graph) – kNN graph.
cluster_i (List[int]) – first cluster.
cluster_j (List[int]) – second cluster.

Return type

float

Returns

relative interconnectivity of the two input clusters.

clustviz.chameleon.chameleon2 module¶

cluster2(df, k=None, knn=None, m=30, alpha=2.0, beta=1, m_fact=1000.0, verbose=False, verbose1=True, verbose2=True, plot=True, auto_extract=False)[source]¶

Parameters

df (DataFrame) – input dataframe.
k (Optional[int]) – desired number of clusters.
knn (Optional[int]) – parameter k of K-nearest_neighbors.
m (int) – number of clusters to reach in the initial clustering phase.
alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.
beta (float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.
m_fact (float) – multiplicative factor for clusters composed of a single node.
verbose (bool) – if True, print general infos.
verbose1 (bool) – if True, print infos about the prepartitioning phase.
verbose2 (bool) – if True, print labels of merging clusters and their scores in the merging phase.
plot (bool) – if True, show plots.
auto_extract (bool) – if True, try to extract the optimal number of clusters and print it.

Return type

Tuple[DataFrame, Dict[int, float]]

Returns

dataframe with cluster labels and dictionary of merging scores (similarities).

connected_components(connected_points)[source]¶

Find connected components from a dictionary of connected nodes.

Parameters: connected_points (dict) – (symmetrically) connected points.
Return type: Generator
Returns: connected components.

dendrogram_height(merging_similarities, m)[source]¶

Find dendrogram height, defined with a recursive sum of the reciprocal of the merging scores.

Parameters

merging_similarities (Dict[int, float]) – merging scores of the algorithm.
m (int) – initial number of clusters.

Return type

Dict[int, float]

Returns

dendrogram height.

extract_optimal_n_clust(merging_similarities, m, f=1000, eta=2)[source]¶

Extract the optimal number of clusters using the dendrogram.

Parameters

merging_similarities (Dict[int, float]) – merging scores of the algorithm.
m (int) – initial number of clusters.
f (float) – threshold parameter to determine if jump is large enough.
eta (float) – decrease coefficient.

Return type

None

find_bigger_jump(dh, jump)[source]¶

Find a bigger jump in the dendrogram levels.

Parameters

dh (Dict[int, float]) – dendrogram height.
jump (float) – threshold to exceed.

Return type

float

Returns

best level where to cut off th dendrogram if found, else 0.

find_nearest_height(dh, value)[source]¶

Find nearest height to cutoff value.

Parameters

dh (Dict[int, float]) – dendrogram height.
value (float) – first_jump cutoff value.

Return type

int

Returns

nearest dendrogram height to cutoff value.

first_jump_cutoff(dh, mult, eta, m)[source]¶

Find the first large gap between tree level, which heuristically is the best level where clusters should be divided.

Parameters

dh (Dict[int, float]) – dendrogram height.
mult (float) – additional factor.
eta (float) – decrease coefficient.
m (int) – initial number of clusters.

Return type

float

Returns

best level where to cut off dendrogram.

flood_fill(preprocessed_graph, knn_graph, df)[source]¶

Find clusters composed by more than one connected component and divide them accordingly. Adjust the parameter m, which indicates the number of clusters to reach in the initial phase.

Parameters

preprocessed_graph (Graph) – clustered kNN graph.
knn_graph (Graph) – kNN graph.
df (DataFrame) – input dataframe.

Return type

Tuple[Graph, int]

Returns

preprocessed graph with updated cluster labels, new m parameter.

merge_best2(graph, df, alpha, beta, m_fact, k, verbose=False, verbose2=True)[source]¶

Find the two clusters with the highest score and merge them.

Parameters

graph (Graph) – kNN graph.
df (DataFrame) – input dataframe.
alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.
beta (float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.
m_fact (float) – multiplicative factor for clusters composed of a single node.
k (int) – desired number of clusters.
verbose (bool) – if True, print additional infos.
verbose2 (bool) – if True, print labels of merging clusters and their score.

Return type

Union[Tuple[DataFrame, float, int], bool]

Returns

input dataframe with clustering label column, maximum merging score and newly merged cluster label.

merge_score2(graph, ci, cj, alpha, beta, m_fact)[source]¶

Compute the score associated with the merging of the two clusters.

Parameters

graph (Graph) – kNN graph.
ci (List[int]) – first cluster.
cj (List[int]) – second cluster.
alpha (float) – exponent of relative closeness; the larger, the more important relative closeness is than relative interconnectivity.
beta (float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.
m_fact (float) – multiplicative factor for clusters composed of a single node.

Return type

float

Returns

merging score

prepro_edge(graph)[source]¶

Build a dictionary having points as keys and all the points that are symmetrically connected through edges to the key point as values, i.e. 0: [5, 7] means that there are edges 0->5 and 0->7, but also 5->0 and 7->0.

Parameters: graph (Graph) – kNN graph.
Return type: OrderedDict
Returns: dictionary of symmetrically connected points.

relative_closeness2(graph, cluster_i, cluster_j, m_fact)[source]¶

Compute the relative closeness of the two input clusters.

Parameters

graph (Graph) – kNN graph.
cluster_i (List[int]) – first cluster.
cluster_j (List[int]) – second cluster.
m_fact (float) – multiplicative factor for clusters composed of a single node.

Return type

float

Returns

relative closeness of the two input clusters.

relative_interconnectivity2(graph, cluster_i, cluster_j, beta)[source]¶

Compute the relative interconnectivity of the two input clusters.

Parameters

graph (Graph) – kNN graph.
cluster_i (List[int]) – first cluster.
cluster_j (List[int]) – second cluster.
beta (float) – exponent of the rho factor; the larger, the less encouraged the merging of clusters connected by a large number of edges relative to the number of edges inside the cluster.

Return type

float

Returns

relative interconnectivity of the two input clusters.

rho(graph, cluster_i, cluster_j)[source]¶

Compute the rho factor, which discourages the algorithm from merging clusters with different densities.

Parameters

graph (Graph) – kNN graph.
cluster_i (List[int]) – first cluster.
cluster_j (List[int]) – second cluster.

Return type

float

Returns

rho factor.

w_int_closeness(graph, cluster)[source]¶

Compute the internal closeness of the input cluster weighted by the number of its internal edges.

Parameters

graph (Graph) – kNN graph.
cluster (List[int]) – cluster represented by a list of nodes belonging to it.

Return type

float

Returns

weighted internal closeness.

clustviz.chameleon.graphtools module¶

bisection_weights(graph, cluster)[source]¶

Return the weights of the edges that connect nodes belonging to the two new partitions obtained through min-cut bisection.

Parameters

graph (Graph) – kNN graph.
cluster (List[int]) – list containing the label of the cluster whose nodes are to be found.

Return type

List[float]

Returns

weights of the edges connecting nodes belonging to the two new partitions.

connecting_edges(partitions, graph)[source]¶

Return only the edges that connect nodes of the first cluster with nodes of the second cluster, in the form of a list of tuples [(0, 5), (3, 5)] (e.g. the only connecting-edges are the ones connecting node_0 to node_5 and node_3 to node_5.

Parameters

partitions (Tuple[list, list]) – tuple composed of two lists containing the nodes of the two clusters to be examined.
graph (Graph) – kNN graph.

Return type

List[Tuple]

Returns

edges connecting the two input clusters.

get_cluster(graph, cluster_label)[source]¶

Find nodes belonging to a specific cluster.

Parameters

graph (Graph) – kNN graph.
cluster_label (int) – label of the cluster whose nodes are to be found.

Return type

List[int]

Returns

nodes of the graph belonging to the input cluster.

get_weights(graph, edges)[source]¶

Get the weights associated to the input edges.

Parameters

graph (Graph) – kNN graph.
edges (List[Tuple]) – edges whose weight is to computed.

Return type

List[float]

Returns

weights of the input edges.

knn_graph(df, k, symmetrical, verbose=False)[source]¶

Return the weighted (symmetrical) K-NearestNeighbor graph of input dataframe.

Parameters

df (DataFrame) – input dataset.
k (int) – k of kNN, i.e. number of nearest neighbors to consider.
symmetrical (bool) – if True, return the weighted symmetrical K-NearestNeighbor graph.
verbose (bool) – if True, print infos.

Return type

Graph

Returns

weighted (symmetrical) K-NearestNeighbor graph.

min_cut_bisector(graph)[source]¶

Return the edges that connect nodes belonging to the two new partitions obtained through min-cut bisection.

Parameters: graph (Graph) – kNN graph.
Return type: List[Tuple]
Returns: edges that connect nodes belonging to the two newly formed partitions.

part_graph(graph, df=None)[source]¶

Return the input graph with the clustering obtained through mincut-bisection.

Parameters

graph (Graph) – kNN graph.
df (Optional[DataFrame]) – if not None, input dataframe, containing points coordinates and clustering labels.

Return type

Graph

Returns

partitioned graph.

plot2d_data(df, col_i=None)[source]¶

Scatter plot data points, colored according to the clusters they belong to, highlighting the last formed cluster.

Parameters

df (DataFrame) – dataframe of the points containing a column indicating clusters.
col_i (Optional[int]) – if not None, label of the last formed cluster.

Return type

None

plot2d_graph(graph, print_clust=True)[source]¶

Draw the graph of the input dataset, colored according to the clusters.

Parameters

graph (Graph) – kNN graph.
print_clust (bool) – if True, print the cardinality of each cluster.

Return type

None

pre_part_graph(graph, k, df=None, verbose=True, plotting=False)[source]¶

Partition the input graph into k clusters.

Parameters

graph (Graph) – kNN graph.
k (int) – final number of partitions.
df (Optional[DataFrame]) – if not None, input dataframe, containing points coordinates and clustering labels.
verbose (bool) – if True, print infos.
plotting (bool) – if True, plot the points colored by cluster.

Return type

Graph

Returns

partitioned graph.

clustviz.chameleon package¶

Submodules¶

clustviz.chameleon.chameleon module¶

clustviz.chameleon.chameleon2 module¶

clustviz.chameleon.graphtools module¶

Module contents¶