blog | news | projects | notebooks | politics | github

Analyzing and Visualizing Social Network Data Using Gephi

Analyzing social network data can be a challenging task. The amount of data to analyze is often enormous, making it impossible to analyze most data sets by hand or even through basic statistical analyses. One of the best ways to gain an intuitive sense of any dataset is to visualize it using a graph, but a lot of the software available is difficult to use or understand for a beginner.

In this post, I’ll go through a step-by-step process you can take to statistically and graphically analyze a dataset. I use my Facebook data as an example set because it’s easy to obtain and intuitive to understand.

Some Questions

For many, Facebook is used as a cornerstone for social interactions. Because it is so integral to so many people’s lives, an individuals’ Facebook network can often contain an accurate picture of their social connections. Prior to ubiquitous technologies like Facebook, it would have been impossible for a normal person to collect quantitative data about their personal social network - picture yourself calling everyone you know and interrogating them about their friends and relationships.

With the data provided by normal usage of Facebook, it’s possible to make an accurate and enlightening graph of all the people in your social circle - or, at least, the people who use Facebook. With this type of data, you can answer interesting questions about your social connections:

People often have discrete groups of friends - from different times in their lives, places they’ve lived, or simply from shared interests. Most college students, for example, have at least two distinct groups: friends from high school, and friends from college. Take a look at the following questions and see which you can answer without looking through Facebook:

Getting the Data

netvizz is a tool that allows anyone with a Facebook account to gather machine-readable data about their Facebook network. When exporting your data, there’s an option to include your friends’ like and post counts in the data. If you don’t check this box, you won’t be able to see that data in the graph. Once you’re ready, click the link that will allow you to “create a gdf file from your personal network”, and wait for it to finish. Grab a cup of coffee, etc. It takes a while.

The raw data follows a certain format. It has a number of node definitions, which look like:

502385489,Jane Doe,female,en_US,291,388,392,3800,0,3800

Here's the key:

NODE_ID,NAME,SEX,LOCALE,AGERANK,LIKE_COUNT,POST_COUNT,POST_LIKE_COUNT,POST_COMMENT_COUNT,POST_ENGAGEMENT_COUNT

These node definitions describe your friends’ attributes. So, Jane Doe has a NODE_ID of 502385489, has 392 posts, and has liked 3800 posts. The field with a 0, POST_COMMENT_COUNT, is 0 for all the nodes I have information on. I’d chalk this up to a bug in the software I used. The POST_ENGAGEMENT_COUNT is the same as the POST_LIKE_COUNT because the comment field is zero.

There’s another part to the data file that dictates the connections between nodes. It’s a list of pairs of node IDs that looks like:

520532134,522242215
520532134,525824086
520532134,537430147
522242215,525824086
525824086,537430147

Each pair defines a connection between the two nodes with those IDs.

This data is hard to read for a human - it would take hours to answer the questions above if all you had was a list of hundreds of entries like the one above.

Instead of digging through the raw data, we’re going to make a pretty graph.

Where We're Headed

Gephi is a network visualization tool. You can import network data (from Facebook, for example), and produce digestible visualizations and run statistical analysis on your data.

I recently spent some time importing my Facebook data into Gephi and manipulating the results a bit. Here’s the almost-readable image that I was able to make:

facebook connections

I curate my Facebook friends pretty regularly, so I have less than 200 connections. This means that for me, the graph is pretty readable.

Each dot represents a person in my social network, and each line between two dots means that those two people are friends. The larger the dot, the higher that nodes' betweenness centrality is.

It’s clear that I have at least three distinct social groups - one pink, one green, one orange. The pink group, in the bottom half of the graph, are my friends in college. The orange/red group are my friends from high school, and the green group are my friends from middle school.

This kind of visualization makes it easy to answer some of the questions I posed earlier - Andrea Abarca connects both the orange and pink groups, and because those are the two largest groups of friends I have, she has a large betweenness centrality.

Getting the Graph

Open Gephi, and import the file you just downloaded. You should see something like this:

initial data import in Gephi

This is an unreadable blob generated from the node data you just exported.

To separate the nodes, we can run one of several layout algorithms that ship with Gephi. To run one of the algorithms, navigate to the Layout tab on the bottom left of the window, and select one of the options. I liked the results of Force Atlas, an algorithm produced in-house by Gephi:

the graph, after running force atlas

To color-code nodes by degree, go to the Ranking tab, and select the Degree parameter. It will color-code the nodes according to their degree value. You can change the color of the gradient and the range of the degree values to use here. For me, it defaulted to a gradient between deep blue and orange: the higher the degree of the node, the more orange it was. You can click the icon that looks like a gem to change the size of the nodes instead of color: the higher the degree of the node, the larger it will be.

This is what the largest group in my graph looks like after running Force Atlas and coloring the nodes:

the graph, after running force atlas and color-coding

To see some more interesting analyses about your data, you can run statistical analyses on the graph. Some of the more interesting statistics are in the Edge Overview section of the Statistics tab. If you run the Average Path Length analysis, it will give you information about the Betweenness Centrality, the Closeness Centrality, and the Eccentricity of nodes in your graph. You can then use these statistics as ranking parameters, like we did with degree, earlier. In this graph, I'm using Betweenness Centrality as a ranking parameter.

If you’re changing the size of nodes based on their ranking, you’ll want to make sure that no nodes are overlapping one another. You can do this in the Layout pane. Below your selection of a layout algorithm, there should be a list of parameters that you can change. There’s a box there, labeled Adjust by sizes. If you check that box, and then re-apply the layout algorithm, the algorithm will take the node size into account, and space things properly.

Adjusting the layout by sizes

You can already start to see some interesting properties about my network. For example, there is one connection in particular that clearly has the highest degree centrality (it’s the largest node in the graph), but it has a degree that seems to be somewhere in the middle of the range (it’s right between blue and orange).

Gephi also allows you to examine the modularity of your graph. This lets you examine the structure of the separate communities in your graph a little more easily.

To run this analysis, click Run next to the Modularity algorithm under Network Overview in the Statistics tab.

To show modularity graphically, you can partition the nodes based on modularity. To do this, navigate to the Partition tab in the top-left corner of the screen, and click the refresh button to refresh the parameter list. Then, select Modularity Class from the list of parameters, and then click apply. It will create random colors for each group the algorithm found in your graph, and assign the colors accordingly:

Modularity in my graph

This is looking pretty good, but there are a few more things we can do to clean it up.

I have a few “lonely” nodes with very low degree values scattered throughout my graph. These clutter the space, making the graph harder to interpret and less aesthetically pleasing. Gephi allows you to filter out nodes based on their various parameters. To filter out nodes with low degree values, click on the Filter tab on the right side of the window. Then, under Topology, select In Degree Range to filter out nodes by their degree values. After playing around with this value a bit, I decided to filter out nodes with a degree value less than 6:

After filtering nodes with small degree values

If you’d like to add labels to the graph, click the T button on the bottom of the window. You can then adjust the font size for the labels.

To change the layout to make sure that the labels don’t overlap, you can run another layout adjustment, this time with the “Label Adjust” algorithm. Once this is done, click on the Preview button at the top of the window:

Previewing the graph

You can then export the final graph as a PDF or image file.

To view specific statistical analyses on your data, navigate to the Statistics tab again, and run one the corresponding algorithm. the Avg. Path length program, for example, produces distribution graphs for attributes like Closeness Centrality and Eccentricity.

To view the statistical attributes of an individual node, select the Edit button on the left side of the viewing pane. The icon is a cursor with a question mark next to it.

You can then look at the attributes of that node, which include any statistical attributes you computed in the Statistics tab.

Some Analysis

With this visualization and the accompanying statistical software in Gephi, it's possible to answer the questions I posed at the beginning of this post.

I can now say with certainty that no, none of my friends know all the same people as me.

After examining my graph and looking at the degree distribution data for my friends, I found that I have three people in my network who have no connections. This question is easy to answer if you rank nodes by their degree values.

This question is easy to answer when nodes are ranked by their betweenness centrality. The largest nodes on your graph are likely to be the nodes that span multiple groups.

If you rank your nodes by degree value, this question is also pretty easy to answer - look for the largest node within some module.

This is a question that is kind of unique to the kind of data I analyzed, since I have information on the number of posts my friends have made on facebook. To answer this question, I ranked my nodes by their POST_ENGAGEMENT_COUNT, and examined the node sizes.

These questions were all trivial to answer by looking at the graph and changing its properties. By visualizing a social network using a graph, it's possible to quickly gain intuition about that network. Although these questions could be answered more precisely by more rigorous numerical analysis, being able to reason about them quickly is a valuable asset.


About Me

I'm interested in building technological platforms that leverage what we know about social dynamics to help people live their lives better.

I'm currently working at the Human Dynamics Group at the MIT Media Lab, creating systems that attempt to measure and impact human social and health behaviors.

I've also worked with the Lazer Lab, inferring partisan dynamics from congressional public statements.

You can e-mail me at dan@dcalacci.net

Send me encrypted messages using my PGP key. (via keybase)

Resume here.

see what music I listen to