java
I am trying to implement the above community detection algorithm in Java, and while I have access to C++ code, and the original paper - I can't make it work at all. My major issue is that I don't understand the purpose of the code - ie how the algorithm works. In practical terms, my code gets stuck in what seems to be an infinite loop at mergeBestQ
, the list heap
seems to be getting larger on each iteration (as I would expect from the code), but the value of topQ
is always returning the same value.
The graph I am testing this on is quite large (300,000 nodes, 650,000 edges). The original code I am using for my implementation is from the SNAP library (https://github.com/snap-stanford/snap/blob/master/snap-core/cmty.cpp). What would be great is if someone could explain to me the intuition of the algorithm, it seems to be initially setting each node to be in it's own community, then recording the modularity value (whatever that is) of each pair of connected nodes in the graph, then finding the pair of nodes which have the highest modularity and moving them to the same community. In addition, if someone could provide some mid level pseudo code, that would be great. Here is my implementation thus far, I have tried to keep it in one file for the sake of brevity, however CommunityGraph and CommunityNode are elsewhere (should not be required). Graph maintain a list of all nodes and each node maintains a list of its connections to other nodes. When running it never gets past the line while(this.mergeBestQ()){}
UPDATE - found several bugs in my code after a thorough review. The code now completes VERY quickly, but doesnt fully implement the algorithm, for example of the 300,000 nodes in the graph, it states there are approximately 299,000 communities (ie roughly 1 node per community). I have listed the updated code below. /// Clauset-Newman-Moore community detection method. /// At every step two communities that contribute maximum positive value to global modularity are merged. /// See: Finding community structure in very large networks, A. Clauset, MEJ Newman, C. Moore, 2004 public class CNMMCommunityMetric implements CommunityMetric{ private static class DoubleIntInt implements Comparable{ public double val1; public int val2; public int val3; DoubleIntInt(double val1, int val2, int val3){ this.val1 = val1; this.val2 = val2; this.val3 = val3; }
@Override
public int compareTo(DoubleIntInt o) {
//int this_sum = this.val2 + this.val3;
//int oth_sum = o.val2 + o.val3;
if(this.equals(o)){
return 0;
}
else if(val1 < o.val1 || (val1 == o.val1 && val2 < o.val2) || (val1 == o.val1 && val2 == o.val2 && val3 < o.val3)){
return 1;
}
else{
return -1;
}
//return this.val1 < o.val1 ? 1 : (this.val1 > o.val1 ? -1 : this_sum - oth_sum);
}
@Override
public boolean equals(Object o){
return this.val2 == ((DoubleIntInt)o).val2 && this.val3 == ((DoubleIntInt)o).val3;
}
@Override
public int hashCode() {
int hash = 3;
hash = 79 * hash + this.val2;
hash = 79 * hash + this.val3;
return hash;
}
}
private static class CommunityData {
double DegFrac;
TIntDoubleHashMap nodeToQ = new TIntDoubleHashMap();
int maxQId;
CommunityData(){
maxQId = -1;
}
CommunityData(double nodeDegFrac, int outDeg){
DegFrac = nodeDegFrac;
maxQId = -1;
}
void addQ(int NId, double Q) {
nodeToQ.put(NId, Q);
if (maxQId == -1 || nodeToQ.get(maxQId) < Q) {
maxQId = NId;
}
}
void updateMaxQ() {
maxQId=-1;
int[] nodeIDs = nodeToQ.keys();
double maxQ = nodeToQ.get(maxQId);
for(int i = 0; i < nodeIDs.length; i++){
int id = nodeIDs[i];
if(maxQId == -1 || maxQ < nodeToQ.get(id)){
maxQId = id;
maxQ = nodeToQ.get(maxQId);
}
}
}
void delLink(int K) {
int NId=getMxQNId();
nodeToQ.remove(K);
if (NId == K) {
updateMaxQ();
}
}
int getMxQNId() {
return maxQId;
}
double getMxQ() {
return nodeToQ.get(maxQId);
}
};
private TIntObjectHashMap<CommunityData> communityData = new TIntObjectHashMap<CommunityData>();
private TreeSet<DoubleIntInt> heap = new TreeSet<DoubleIntInt>();
private HashMap<DoubleIntInt,DoubleIntInt> set = new HashMap<DoubleIntInt,DoubleIntInt>();
private double Q = 0.0;
private UnionFind uf = new UnionFind();
@Override
public double getCommunities(CommunityGraph graph) {
init(graph);
//CNMMCommunityMetric metric = new CNMMCommunityMetric();
//metric.getCommunities(graph);
// maximize modularity
while (this.mergeBestQ(graph)) {
}
// reconstruct communities
HashMap<Integer, ArrayList<Integer>> IdCmtyH = new HashMap<Integer, ArrayList<Integer>>();
Iterator<CommunityNode> ns = graph.getNodes();
int community = 0;
TIntIntHashMap communities = new TIntIntHashMap();
while(ns.hasNext()){
CommunityNode n = ns.next();
int r = uf.find(n);
if(!communities.contains(r)){
communities.put(r, community++);
}
n.setCommunity(communities.get(r));
}
System.exit(0);
return this.Q;
}
private void init(Graph graph) {
double M = 0.5/graph.getEdgesList().size();
Iterator<Node> ns = graph.getNodes();
while(ns.hasNext()){
Node n = ns.next();
uf.add(n);
int edges = n.getEdgesList().size();
if(edges == 0){
continue;
}
CommunityData dat = new CommunityData(M * edges, edges);
communityData.put(n.getId(), dat);
Iterator<Edge> es = n.getConnections();
while(es.hasNext()){
Edge e = es.next();
Node dest = e.getStart() == n ? e.getEnd() : e.getStart();
double dstMod = 2 * M * (1.0 - edges * dest.getEdgesList().size() * M);//(1 / (2 * M)) - ((n.getEdgesList().size() * dest.getEdgesList().size()) / ((2 * M) * (2 * M)));// * (1.0 - edges * dest.getEdgesList().size() * M);
dat.addQ(dest.getId(), dstMod);
}
Q += -1.0 * (edges*M) * (edges*M);
if(n.getId() < dat.getMxQNId()){
addToHeap(createEdge(dat.getMxQ(), n.getId(), dat.getMxQNId()));
}
}
}
void addToHeap(DoubleIntInt o){
heap.add(o);
}
DoubleIntInt createEdge(double val1, int val2, int val3){
DoubleIntInt n = new DoubleIntInt(val1, val2, val3);
if(set.containsKey(n)){
DoubleIntInt n1 = set.get(n);
heap.remove(n1);
if(n1.val1 < val1){
n1.val1 = val1;
}
n = n1;
}
else{
set.put(n, n);
}
return n;
}
void removeFromHeap(Collection<DoubleIntInt> col, DoubleIntInt o){
//set.remove(o);
col.remove(o);
}
DoubleIntInt findMxQEdge() {
while (true) {
if (heap.isEmpty()) {
break;
}
DoubleIntInt topQ = heap.first();
removeFromHeap(heap, topQ);
//heap.remove(topQ);
if (!communityData.containsKey(topQ.val2) || ! communityData.containsKey(topQ.val3)) {
continue;
}
if (topQ.val1 != communityData.get(topQ.val2).getMxQ() && topQ.val1 != communityData.get(topQ.val3).getMxQ()) {
continue;
}
return topQ;
}
return new DoubleIntInt(-1.0, -1, -1);
}
boolean mergeBestQ(Graph graph) {
DoubleIntInt topQ = findMxQEdge();
if (topQ.val1 <= 0.0) {
return false;
}
// joint communities
int i = topQ.val3;
int j = topQ.val2;
uf.union(i, j);
Q += topQ.val1;
CommunityData datJ = communityData.get(j);
CommunityData datI = communityData.get(i);
datI.delLink(j);
datJ.delLink(i);
int[] datJData = datJ.nodeToQ.keys();
for(int _k = 0; _k < datJData.length; _k++){
int k = datJData[_k];
CommunityData datK = communityData.get(k);
double newQ = datJ.nodeToQ.get(k);
//if(datJ.nodeToQ.containsKey(i)){
// newQ = datJ.nodeToQ.get(i);
//}
if (datI.nodeToQ.containsKey(k)) {
newQ = newQ + datI.nodeToQ.get(k);
datK.delLink(i);
} // K connected to I and J
else {
newQ = newQ - 2 * datI.DegFrac * datK.DegFrac;
} // K connected to J not I
datJ.addQ(k, newQ);
datK.addQ(j, newQ);
addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k)));
}
int[] datIData = datI.nodeToQ.keys();
for(int _k = 0; _k < datIData.length; _k++){
int k = datIData[_k];
if (!datJ.nodeToQ.containsKey(k)) { // K connected to I not J
CommunityData datK = communityData.get(k);
double newQ = datI.nodeToQ.get(k) - 2 * datJ.DegFrac * datK.DegFrac;
datJ.addQ(k, newQ);
datK.delLink(i);
datK.addQ(j, newQ);
addToHeap(createEdge(newQ, Math.min(j, k), Math.max(j, k)));
}
}
datJ.DegFrac += datI.DegFrac;
if (datJ.nodeToQ.isEmpty()) {
communityData.remove(j);
} // isolated community (done)
communityData.remove(i);
return true;
}
}
UPDATE:the currently listed code is fairly quick, and has half the memory usage compared to the "quickest" solution, while only being ~5% slower. the difference is in the use of hashmap + treest vs priority queue, and ensuring only a single object for a given i, j pair exists at any time.
So here's the original paper, a neat lil' six pages, only two of which are about the design & implementation. Here's a cliffnotes:
Q
, of the partition to be the ratio of the number of edges within each community to the number of edges between each community, minus the ratio you'd expect from a completely random partition. i
and j
of a partition, they then define deltaQ_ij
to be how much the modularity of the partition would change if communities i
and j
were merged. So if deltaQ_ij > 0
, merging i
and j
will improve the modularity of the partition. deltaQ_ij
for every pair of communities. Whichever two communities i, j
have the largest deltaQ_ij
, merge those two. Repeat. deltaQ_ij
all turn negative, but in the paper the authors let the algorithm run until there's only one community left. That's pretty much it for understanding the algorithm. The details are in how to compute deltaQ_ij
quickly and store the information efficiently.
Edit: Data structure time!
So first off, I think the implementation you're referencing does things a different way to the paper. I'm not quite sure how, because the code is impenetrable, but it seems to use union-find and hashsets in place of the author's binary trees and multiple heaps. Not a clue why they do it a different way. You might want to email the guy who wrote it and ask.
Anyway, the algorithm in the paper needs several things from the format deltaQ
is stored in:
dQ
quickly. deltaQ_ik
and deltaQ_ki
for a fixed i
quickly. deltaQ_kj
and deltaQ_jk
for a fixed j
quickly. The solution the authors come up to for this is as follows:
i
, each non-zero deltaQ_ik
is stored in a balanced binary tree, indexed by k
(so elements can be found easily), and in a heap (so the maximum for that community can be found easily). deltaQ_ik
from each community i
's heap is then stored in another heap, so that the overall maximums can be found easily. When community i
is merged with community j
, several things happen to the binary trees:
i
th community is added to the j
th community's binary tree. If an element with the same index k
already exists, you sum the old and new values. j
th community's binary tree to reflect the fact that the j
th community has just increased in size. k
, we update any deltaQ_kj
. i
is thrown away. And similarly, several things must happen to the heaps:
i
is thrown away. j
is rebuilt from scratch using the elements from the community's balanced binary tree. k
's heap, the position of entry deltaQ_kj
is updated. i
in the overall heap is thrown away (causing bubbling) and the entries for community j
and each community k
connected to i
or j
are updated. Strangely, when two communities are merged there's no reference in the paper as to removing deltaQ_ki
values from the k
th community's heap or tree. I think this might be dealt with by the setting of a_i = 0
, but I don't understand the algorithm well enough to be sure.
Edit: Trying to decipher the implementation you linked. Their primary datastructures are
CmtyIdUF
, a union-find structure that keeps track of which nodes are in which community (something that's neglected in the paper, but seems necessary unless you want to reconstruct community membership from a trace of the merge or something), MxQHeap
, a heap to keep track of which deltaQ_ij
is largest overall. Strangely, when they update the value of a TFltIntIntTr
in the heap, they don't ask the heap to re-heapify itself. This is worrying. Does it do it automatically or something? CmtyQH
, a hashmap that maps a community ID i
to a structure TCmtyDat
which holds what looks heap of the deltaQ_ik
for that community. I think. Strangely though, the UpdateMaxQ
of the TCmtyDat
structure takes linear time, obviating any need for a heap. What's more, the UpdateMaxQ
method only appears to be called when an element of the heap is deleted. It should definitely also be getting called when the value of any element in the heap is updated. 上一篇: 度量评估两个社区/子图之间的互联性
下一篇: java的