Storing data about undirected graph edges in a database
I have a data structure that is an undirected graph (probably dense) with nodes and edges. The edge between nodes I and J will have additional data associated with it and I want to be able to uniquely identify that edge when querying and to be able to quickly determine if an edge between I and J exists.
I have decided to accomplish this using two tables:
Table Nodes
-----------
node_id PK
...
(additional fields)
Table Edges
-----------
nodes_hash(node_id, node_id) PK
edge_thickness
...
(additional fields)
where the primary key for every edge will be computed by a hash function nodes_hash(node_id, node_id)
taking two node IDs.
My questions:
There's no reason why you should need to encode edges as hashes: ensuring that there are no duplicates is a simple matter of a unique constraint.
While there are ways of calculating hashes (creating a path enumeration string, and calculating its MD5 hash or something along those lines), there's no value in storing the hash and not use a path enumeration scheme for storing the graph data itself. Path enumeration works exactly like it does in the filesystem, storing something like /a/b/c
(if enumerating by node id) or 1.2.1.5
(if enumerating by edge ordinal).
For your specific usecase I would use a common adjacency list table setup (unless other operations in the tree call for more specialized data structures such as node sets or path/edge enumeration). In this structure, the top-level node has PARENT_ID = NULL
:
CREATE TABLE NODES(
NODE_ID INT NOT NULL,
-- OTHER NODE ATTRIBUTES
PRIMARY KEY (NODE_ID)
);
CREATE TABLE EDGES(
NODE_ID INT NOT NULL,
PARENT_ID INT,
EDGE_WEIGHT INT NOT NULL,
-- OTHER EDGE ATTRIBUTES
FOREIGN KEY (NODE_ID) REFERENCES NODES(NODE_ID),
FOREIGN KEY (PARENT_ID) REFERENCES NODES(NODE_ID),
CHECK (PARENT_ID <> NODE_ID), -- This avoids simple cycles
CONSTRAINT UNIQ_EDGE UNIQUE (NODE_ID,PARENT_ID) -- This avoids duplicate edges
);
Why would use use a hash function to generate such a key?
I would have an integer primary key (node id) for the nodes.
I would have an integer primary key (edge id) for the edges.
Then I would have two columns in the edge table for each node, with foreign key relationships to back to the node table.
I can readily think of two drawbacks to using a hash function. First, you can have collisions. In fact, if everything is stored as integers, then you will have collisions just by the pigeon hole principle.
Second, you need to store the node ids anyway. Almost any graph problem that I can think of needs to know the nodes connected by edges.
You would handle an undirected graph is the following constraints/triggers:
(1) Add a contraint/trigger that node1 < node 2 (or <= if self-connection is allowed).
(2) Add a unique index on node1, node2.
(3) Add a before-insert trigger to be sure that for arbitrary values, node1 gets the lesser value and node2 gets the larger value.
In Oracle, you can combine these by having a function-based index on least(node1, node2) and the greatest(node1, node).
链接地址: http://www.djcxy.com/p/52376.html上一篇: 具有恒定时间复杂度的无向图边的数据结构
下一篇: 将关于无向图边的数据存储在数据库中