blog | news | projects | notebooks | politics | github

Visualizing Hubway Data

16 Sep 2014

Last updated: 24 Jan 2016, 11:25AM

import pandas as pd
import numpy as np
import networkx as nx

There's a whole bunch of hubway data available over at the Hubway Data Challenge website.

There's already been a bunch of visualizations submitted, especially since the submission date was way back in 2012 (I think).

I decided I'd try my hand at some network visualization of this data, especially since sigma.js seems like such a useful visualization tool.

I also wanted to see if a community detection algorithm on the graph would discover anything interesting (or otherwise obvious) about the communiting patterns of hubway users.

The pipeline looks like this:

  1. Read dataset from the hubway site into pandas

  2. Create a networkx directed graph representation

  3. Run community detection

  4. Export it to .gexf

  5. Play around with the graph in gephi until something nice appears

  6. Export THAT graph to .gexf, then play around with sigma.js until something else nice appears.

Okay, here we go:

First, import all the station and trip data from hubway.

Here's what they look like after pulling them into pandas:

stations = pd.read_csv('../data/stations_10_12_to_11_13.csv', 

trips = pd.read_csv('../data/hubwaydata_10_12_to_11_13.csv',
                    parse_dates=['start_date', 'end_date'])
/Users/dan/attic/tools/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)
# sigma.js requres the node IDs to be strings.
stations = stations.set_index(stations.index.map(str))
trips.start_station = trips.start_station.map(str)
trips.end_station = trips.end_station.map(str)
terminal station municipality lat lng nb_docks install_date last_day
3 B32006 Colleges of the Fenway Boston 42.340021 -71.100812 15 4/8/2013 11/27/2013
4 C32000 Tremont St. at Berkeley St. Boston 42.345392 -71.069616 15 4/8/2013 11/27/2013

2 rows × 8 columns

status duration start_date start_station end_date end_station bike_nr subscription_type zip_code birth_date gender
708191 Closed 240 2012-10-31 23:57:00 53 2012-11-01 00:01:00 67 B00412 Registered 02139 NaN Male
708190 Closed 840 2012-10-31 23:54:00 36 2012-11-01 00:08:00 115 B00491 Registered 02127 NaN Female

2 rows × 11 columns

Create the graph

Next, we've gotta create the graph.

To do this, I iterate through all the rows of the stations dataframe, adding nodes to the graph as I go.

I weight an edge between two stations $n_1$ and $n_2$ by the number of trips taken from $n_1$ to $n_2$, $t_{1,2}$.

(now that I'm writing this, I'm thinking I probably should have normalized these weights by the total number of trips involving some station in general. Maybe I'll update the notebook later.)

DG = nx.DiGraph()

# add station nodes
# sigma.js requires a string ID, an X, and a Y for each node.
for n, r in stations.iterrows():
# add edges for trips.
# There's an edge between two stations if there was a trip between them.
# Edges are weighted by the number of trips
for n, r in trips.iterrows():
    if DG.has_edge(r.start_station, r.end_station):
        DG[r.start_station][r.end_station]['weight'] += 1
        DG.add_edge(r.start_station, r.end_station, weight=1)

Here's what the gross networkx visualization looks like (with the Fruchterman-Reingold layout):

pos = nx.fruchterman_reingold_layout(DG)
nx.draw_networkx(DG, pos)

Community detection

I ran the Louvain community detection algorithm on the resulting network.

I'm pretty ignorant to best practices here, so I just ran it a few times until I saw a good-looking induced community graph (second image).

The first image is the network visualized using a spring layout. Nodes are colored (grayed) according to the communities discovered by the algorithm.

import community

DG = nx.Graph(DG)
partition = community.best_partition(DG)

size = float(len(set(partition.values())))
pos = nx.spring_layout(DG)
count = 0
for com in set(partition.values()) :
    count = count + 1.
    list_nodes = [nodes for nodes in partition.keys()
                                if partition[nodes] == com]
    nx.draw_networkx_nodes(DG, pos, list_nodes, node_size = 20,
                                node_color = str(count / size))

nx.draw_networkx_edges(DG,pos, alpha=0.8)

# let's see if it makes sense:
ind_graph = community.induced_graph(partition, DG)

#pos = nx.spring_layout(ind_graph)
pos = nx.fruchterman_reingold_layout(ind_graph)

nx.draw_networkx(ind_graph, pos)

Great. Now that we've run the community detection, we can then add a 'community' attribute to each node, and then export it to .gexf for use in gephi.

# add the communities to the graph
for n, d in DG.nodes_iter(data=True):
    DG.node[n]['community'] = partition[n]
nx.write_gexf(DG, '../sigma/trips.gexf')

Messing around with the network in gephi is beyond the scope of this post/notebook.

In Gephi, I did the following things:

  1. Color-coded the communities

  2. removed edges with a weight less than 50, representing routes that had fewer than fifty total trips in the dataset.

  3. applied a geo-layout to keep the geographic data of the stations intact.

You can see the finished product produced with sigma.js here

Show Code Hide Code

About Me

I'm interested in building technological platforms that leverage what we know about social dynamics to help people live their lives better.

I'm currently working at the Human Dynamics Group at the MIT Media Lab, creating systems that attempt to measure and impact human social and health behaviors.

I've also worked with the Lazer Lab, inferring partisan dynamics from congressional public statements.

You can e-mail me at dan@dcalacci.net

Send me encrypted messages using my PGP key. (via keybase)

Resume here.

see what music I listen to