Previous Up Next

This HTML version of Think Complexity, 2nd Edition is provided for convenience, but it is not the best format of the book. In particular, some of the symbols are not rendered correctly.

You might prefer to read the PDF version.

Chapter 2  Graphs

The first three chapters of this book are about models that describe systems that are made up of components and connections between components. For example, in a social network, the components are people and connections represent friendships, business relationships, etc. In an ecological food web, the components are species and the connections represent predator-prey relationships.

In this chapter, I introduce NetworkX, a Python package for building and studying these models. We start with the Erdős-Rényi model, which has interesting mathematical properties. In the next chapter we move on to models that are more useful for explaining real-world systems.

The code for this chapter is in chap02.ipynb in the repository for this book. More information about working with the code is in Section ??.

2.1  What is a graph?


Figure 2.1: A directed graph that represents a social network.

To most people a “graph" is a visual representation of data, like a bar chart or a plot of stock prices over time. That’s not what this chapter is about.

In this chapter, a graph is a representation of a system that contains discrete, interconnected elements. The elements are represented by nodes and the interconnections are represented by edges.

For example, you could represent a road map with a node for each city and an edge for each road between cities. Or you could represent a social network using a node for each person, with an edge between two people if they are friends and no edge otherwise.

In some graphs, edges have attributes like length, cost, or weight. For example, in a road map, the length of an edge might represent the distance between two cities, or the travel time. In a social network there might be different kinds of edges to represent different kinds of relationships: friends, business associates, etc.

Edges may be directed or undirected, depending on whether the relationships they represent are asymmetric or symmetric. In a road map, you might represent a one-way street with a directed edge and a two-way street with an undirected edge. In some social networks, like Facebook, friendship is symmetric: if A is friends with B then B is friends with A. But on Twitter, for example, the “follows” relationship is not symmetric; if A follows B, that doesn’t imply that B follows A. So you might use undirected edges to represent a Facebook network and directed edges for Twitter.

Graphs have interesting mathematical properties, and there is a branch of mathematics called graph theory that studies them.

Graphs are also useful, because there are many real world problems that can be solved using graph algorithms. For example, Dijkstra’s shortest path algorithm is an efficient way to find the shortest path from a node to all other nodes in a graph. A path is a sequence of nodes with an edge between each consecutive pair.

Graphs are usually drawn with squares or circles for nodes and lines for edges. For example, the directed graph in Figure ?? might represent three people who follow each other on Twitter. The thick part of the line indicates edge direction. In this example, Alice and Bob follow each other and both follow Chuck, but Chuck follows no one.

The undirected graph in Figure ?? shows four cities in the north-east United States; the labels on the edges indicate driving time in hours. In this example the placement of the nodes corresponds roughly to the geography of the cities, but in general the layout of a graph is arbitrary.

2.2  NetworkX


Figure 2.2: An undirected graph that represents cities and highways.

To represent graphs, we’ll use a package called NetworkX, which is the most commonly used network library in Python. You can read more about it at https://networkx.github.io/, but I’ll explain it as we go along.

We can create a directed graph by importing NetworkX (usually imported as nx) and instantiating nx.DiGraph:

import networkx as nx G = nx.DiGraph()

At this point, G is a DiGraph object that contains no nodes and no edges. We can add nodes using the add_node method:

G.add_node('Alice') G.add_node('Bob') G.add_node('Chuck')

Now we can use the nodes method to get the list of nodes:

>>> G.nodes() ['Alice', 'Bob', 'Chuck']

Adding edges works pretty much the same way:

G.add_edge('Alice', 'Bob') G.add_edge('Alice', 'Chuck') G.add_edge('Bob', 'Alice') G.add_edge('Bob', 'Chuck')

And we can use edges to get the list of edges:

>>> G.edges() [('Alice', 'Bob'), ('Alice', 'Chuck'), ('Bob', 'Alice'), ('Bob', 'Chuck')]

NetworkX provides several functions for drawing graphs; draw_circular arranges the nodes in a circle and connects them with edges:

nx.draw_circular(G, node_color=COLORS[0], node_size=2000, with_labels=True)

And that’s the code I use to generate Figure ??. The option with_labels causes the nodes to be labeled; in the next example we’ll see how to label the edges.

To generate Figure ??, I start with a dictionary that maps from each city name to its approximate longitude and latitude:

pos = dict(Albany=(-74, 43), Boston=(-71, 42), NYC=(-74, 41), Philly=(-75, 40))

Since this is an undirected graph, I instantiate nx.Graph:

G = nx.Graph()

Then I can use add_nodes_from to iterate the keys of pos and add them as nodes:

G.add_nodes_from(pos)

Next I’ll make a dictionary that maps from each edge to the corresponding driving time:

drive_times = {('Albany', 'Boston'): 3, ('Albany', 'NYC'): 4, ('Boston', 'NYC'): 4, ('NYC', 'Philly'): 2}

Now I can use add_edges_from, which iterates the keys of drive_times and adds them as edges:

G.add_edges_from(drive_times)

Now instead of draw_circular, which arranges the nodes in a circle, I’ll use draw, which takes pos as the second parameter:

nx.draw(G, pos, node_color=COLORS[1], node_shape='s', node_size=2500, with_labels=True)

pos is a dictionary that maps from each city to its coordinates; draw uses it to determine the locations of the nodes.

To add the edge labels, we use draw_networkx_edge_labels:

nx.draw_networkx_edge_labels(G, pos, edge_labels=drive_times)

drive_times is a dictionary that maps from each edge, represented by a pair of city names, to the driving distance between them. And that’s how I generated Figure ??.

In both of these examples, the nodes are strings, but in general they can be any hashable type.

2.3  Random graphs

A random graph is just what it sounds like: a graph with nodes and edges generated at random. Of course, there are many random processes that can generate graphs, so there are many kinds of random graphs.

One of the more interesting kinds is the Erdős-Rényi model, studied by Paul Erdős and Alfréd Rényi in the 1960s.

An Erdős-Rényi graph (ER graph) is characterized by two parameters: n is the number of nodes and p is the probability that there is an edge between any two nodes. See http://en.wikipedia.org/wiki/Erdos-Renyi_model.

Erdős and Rényi studied the properties of these random graphs; one of their surprising results is the existence of abrupt changes in the properties of random graphs as random edges are added.

One of the properties that displays this kind of transition is connectivity. An undirected graph is connected if there is a path from every node to every other node.

In an ER graph, the probability that the graph is connected is very low when p is small and nearly 1 when p is large. Between these two regimes, there is a rapid transition at a particular value of p, denoted p*.

Erdős and Rényi showed that this critical value is p* = lnn / n, where n is the number of nodes. A random graph, G(n, p), is unlikely to be connected if p < p* and very likely to be connected if p > p*.

To test this claim, we’ll develop algorithms to generate random graphs and check whether they are connected.

2.4  Generating graphs


Figure 2.3: A complete graph with 10 nodes.

I’ll start by generating a complete graph, which is a graph where every node is connected to every other.

Here’s a generator function that takes a list of nodes and enumerates all distinct pairs. If you are not familiar with generator functions, you can read about them at http://intermediatepythonista.com/python-generators.

def all_pairs(nodes): for i, u in enumerate(nodes): for j, v in enumerate(nodes): if i>j: yield u, v

We can use all_pairs to construct a complete graph:

def make_complete_graph(n): G = nx.Graph() nodes = range(n) G.add_nodes_from(nodes) G.add_edges_from(all_pairs(nodes)) return G

make_complete_graph takes the number of nodes, n, and returns a new Graph with n nodes and edges between all pairs of nodes.

The following code makes a complete graph with 10 nodes and draws it.

complete = make_complete_graph(10) nx.draw_circular(complete, node_color=COLORS[2], node_size=1000, with_labels=True)

Figure ?? shows the result. Soon we will modify this code to generate ER graphs, but first we’ll develop functions to check whether a graph is connected.

2.5  Connected graphs

A graph is connected if there is a path from every node to every other node (see http://en.wikipedia.org/wiki/Connectivity_(graph_theory)).

For many applications involving graphs, it is useful to check whether a graph is connected. Fortunately, there is a simple algorithm that does it.

You can start at any node and check whether you can reach all other nodes. If you can reach a node, v, you can reach any of the neighbors of v, which is any node connected by v by an edge.

The Graph class provides a method called neighbors that returns a list of neighbors for a given node. For example, in the complete graph we generated in the previous section:

>>> complete.neighbors(0) [1, 2, 3, 4, 5, 6, 7, 8, 9]

Suppose we start at node s. We can mark s as “seen”, then we can mark its neighbors. Then we mark the neighbor’s neighbors, and so on, until we can’t reach any more nodes. If all nodes are seen, the graph is connected.

Here’s what that looks like in Python:

def reachable_nodes(G, start): seen = set() stack = [start] while stack: node = stack.pop() if node not in seen: seen.add(node) stack.extend(G.neighbors(node)) return seen

reachable_nodes takes a Graph and a starting node, start, and returns the set of nodes that can be reached from start.

Initially the set, seen, is empty, and we create a list called stack that keeps track of nodes we have discovered but not yet processed. Initially the stack contains a single node, start.

Now, each time through the loop, we

  1. Remove one node from the stack.
  2. If the node is already in seen, we go back to Step 1.
  3. Otherwise, we add the node to seen and add its neighbors to the stack.

When the stack is empty, we can’t reach any more nodes, so we break out of the loop and return seen.

As an example, we can find all nodes in the complete graph that are reachable from node 0:

>>> reachable_nodes(complete, 0) {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

Initially, the stack contains node 0 and seen is empty. The first time through the loop, node 0 is added to seen and all the other nodes are added to the stack (since they are all neighbors of node 0).

The next time through the loop, pop returns the last element in the stack, which is node 9. So node 9 gets added to seen and its neighbors get added to the stack.

Notice that the same node can appear more than once in the stack; in fact, a node with k neighbors will be added to the stack k times. Later we will look for ways to make this algorithm more efficient.

We can use reachable_nodes to write is_connected:

def is_connected(G): start = next(G.nodes_iter()) reachable = reachable_nodes(G, start) return len(reachable) == len(G)

is_connected chooses a starting node by calling nodes_iter, which returns an iterator object, and passing the result to next, which returns the first node.

reachable gets the set of nodes that can be reached from start. If the size of this set is the same as the size of the graph, that means we can reach all nodes, which means the graph is connected.

A complete graph is, not surprisingly, connected:

>>> is_connected(complete) True

In the next section we will generate ER graphs and check whether they are connected.

2.6  Generating ER graphs


Figure 2.4: An ER graph with n=10 and p=0.3.

The ER graph G(n, p) contains n nodes, and each pair of nodes is connected by an edge with probability p. Generating an ER graph is similar to generating a complete graph.

The following generator function enumerates all possible edges and uses a helper function, flip, to choose which ones should be added to the graph:

def random_pairs(nodes, p): for i, u in enumerate(nodes): for j, v in enumerate(nodes): if i>j and flip(p): yield u, v

flip returns True with the given probability, p, and False with the complementary probability 1-p:

from numpy.random import random def flip(p): return random() < p

Finally, make_random_graph generates and returns the ER graph G(n, p).

def make_random_graph(n, p): G = nx.Graph() nodes = range(n) G.add_nodes_from(nodes) G.add_edges_from(random_pairs(nodes, p)) return G

make_random_graph is almost identical to make_complete_graph; the only difference is that it uses random_pairs instead of all_pairs.

Here’s an example with p=0.3:

random_graph = make_random_graph(10, 0.3)

Figure ?? shows the result. This graph turns out to be connected; in fact, most ER graphs with n=10 and p=0.3 are connected. In the next section, we’ll see how many.

2.7  Probability of connectivity


Figure 2.5: Probability of connectivity with n=10 and a range of p. The vertical line shows the predicted critical value.


Figure 2.6: Probability of connectivity for several values of n and a range of p.

For given values of n and p, we would like to know the probability that G(n, p) is connected. We can estimate it by generating a large number of random graphs and counting how many are connected. Here’s how:

def prob_connected(n, p, iters=100): count = 0 for i in range(iters): random_graph = make_random_graph(n, p) if is_connected(random_graph): count += 1 return count/iters

iters is the number of random graphs we generate. As we increase iters, the estimated probability gets more precise1.

>>> prob_connected(10, 0.3, iters=10000) 0.6454

Out of 10000 ER graphs with these parameters, 6498 are connected, so we estimate that 65% of them are connected. I chose 0.3 because it is close to the critical value where the probability of connectivity goes from near 0 to near 1. According to Erdős and Rényi, p* = lnn / n = 0.23.

We can get a clearer view of the transition by estimating the probability of connectivity for a range of values of p:

import numpy as np n = 10 ps = np.logspace(-2.5, 0, 11) ys = [prob_connected(n, p) for p in ps]

This is the first example we’ve seen using NumPy. Following convention, I import NumPy as np. The function logspace returns an array of 11 values from 10−2.5 to 100 = 1, equally spaced on a logarithmic scale.

To compute ys, I use a list comprehension that iterates the elements of ps and computes the probability that a random graph with each value of p is connected.

Figure ?? shows the results, with a vertical line at p*. The transition from 0 to 1 occurs near the predicted critical value, 0.23. With p on a log scale, the transition is roughly symmetric.

Figure ?? shows similar results for larger values of n. As n increases, the critical value gets smaller and the transition gets more abrupt.

These experiments are consistent with the results Erdős and Rényi proved in their papers.

2.8  Analysis of graph algorithms

In this chapter I presented an algorithm for checking whether a graph is connected; in the next few chapters, we will see other graph algorithms. And we will analyze the performance of those algorithms, figuring out how their run times grow as the size of the graphs increases.

If you are not already familiar with analysis of algorithms, you should read Appendix ?? before you continue.

The order of growth for graph algorithms is usually expressed as a function of n, the number of vertices, and m, the number of edges.

As an example, let’s analyze reachable_nodes from Section ??:

def reachable_nodes(G, start): seen = set() stack = [start] while stack: node = stack.pop() if node not in seen: seen.add(node) stack.extend(G.neighbors(node)) return seen

Each time through the loop, we pop a node off the stack; by default, pop removes and returns the last element of a list, which is a constant time operation.

Next we check whether the node is in seen, which is a set, so checking membership is constant time.

If the node is not already in seen, we add it, which is constant time, and then add the neighbors to the stack, which is linear in the number of neighbors.

To express the run time in terms of n and m, we can add up the total number of times each node is added seen and stack.

Each node is only added to seen once, so the total number of additions is n.

But nodes might be added to stack many times, depending on how many neighbors they have. If a node has k neighbors, it is added to stack k times. Of course, if it has k neighbors, that means it is connected to k edges.

So the total number of additions to stack is the total number of edges, m, doubled because we consider every edge twice.

Therefore, the order of growth for this function is O(n + m), which is a convenient way to say that the run time grows in proportion to either n or m, whichever is bigger.

If we know the relationship between n and m, we can simplify this expression. For example, in a complete graph the number of edges is n(n−1)/2, which is in O(n2). So for a complete graph, reachable_nodes is quadratic in n.

2.9  Exercises

The code for this chapter is in chap02.ipynb, which is a Jupyter notebook in the repository for this book. For more information about working with this code, see Section ??.

Exercise 1   Launch chap02.ipynb and run the code. There are a few short exercises embedded in the notebook that you might want to try.
Exercise 2   In Section ?? we analyzed the performance of reachable_nodes and classified it in O(n + m), where n is the number of nodes and m is the number of edges. Continuing the analysis, what is the order of growth for is_connected?
def is_connected(G): start = next(G.nodes_iter()) reachable = reachable_nodes(G, start) return len(reachable) == len(G)
Exercise 3   In my implementation of reachable_nodes, you might be bothered by the apparent inefficiency of adding all neighbors to the stack without checking whether they are already in seen. Write a version of this function that checks the neighbors before adding them to the stack. Does this “optimization” change the order of growth? Does it make the function faster?
Exercise 4  

There are actually two kinds of ER graphs. The one we generated in this chapter, G(n, p), is characterized by two parameters, the number of nodes and the probability of an edge between nodes.

An alternative definition, denoted G(n, m), is also characterized by two parameters: the number of nodes, n, and the number of edges, m. Under this definition, the number of edges is fixed, but their location is random.

Repeat the experiments we did in this chapter using this alternative definition. Here are a few suggestions for how to proceed:

1. Write a function called m_pairs that takes a list of nodes and the number of edges, m, and returns a random selection of m edges. A simple way to do that is to generate a list of all possible edges and use random.sample.

2. Write a function called make_m_graph that takes n and m and returns a random graph with n nodes and m edges.

3. Make a version of prob_connected that uses make_m_graph instead of make_random_graph.

4. Compute the probability of connectivity for a range of values of m.

How do the results of this experiment compare to the results using the first type of ER graph?


1
Since count and iters are integers, this function won’t work correctly in Python 2 unless you import division from __future__. See https://www.python.org/dev/peps/pep-0238/

Are you using one of our books in a class?

We'd like to know about it. Please consider filling out this short survey.


Think DSP

Think Java

Think Bayes

Think Python 2e

Think Stats 2e

Think Complexity


Previous Up Next