Category: Uncategorized

DevRel Is Becoming Developer GTM. No — It Always Was.

Published by Ryan Boyd on April 25, 2026

There’s a take making the rounds right now: “DevRel is shifting into developer GTM.”

No, it’s not. It always was.

Your CFO didn’t approve that headcount for funsies. Good DevRel has always been marketing and selling through education and inspiration. That’s the whole job. That’s why companies fund it.

What’s actually changing is the timeline leadership is willing to wait on.

The work didn’t change. The patience did.

Stressed companies — the ones worried about revenue, runway, or missing a market window — want this-quarter impact. They want a DevRel hire to move a number you can point to before the next board deck.

Healthier companies have the patience to wait for larger impact next year, or the year after. Pipeline takes time. Trust takes time. A developer audience is a compounding asset, not a campaign.

That’s the whole shift. Leadership is just saying the quiet part out loud.

When the macro tightens, the runway shortens, and the question “how does this ladder to revenue?” goes from a polite annual review topic to a Monday morning email. The activities don’t have to change. The narrative around them does.

Every DevRel activity ladders to the business

If you can’t trace your work to one of these, somebody’s doing it wrong. Probably the person who hired you. Possibly you.

Content (blog posts, talks, videos, podcasts)

Top-of-funnel SEO and brand → leads
Inspiration → developers pulling teammates in
Trust → shorter sales cycles when they finally talk to an AE

Code (libraries, samples, SDKs, integrations, “skills”)

Developer productivity → faster ramp → higher ACV at the same price point
SEO surface area (sample repos rank, integrations show up in partner directories) → leads
A new wedge into LLM-mediated discovery, where the agent is the developer

Community (Slack, Discord, forums, events, user groups)

Net revenue retention through more committed users
Stronger brand and word-of-mouth → cheaper CAC
Free customer support that’s better than your paid customer support

Docs

Faster ramp → faster time-to-value → revenue expansion
Feature adoption → NRR
Lower support cost per customer → better gross margin

That’s the entire scoreboard: faster ramping, expansion revenue, higher retention, lower CAC. Pick one. Probably more than one. If a DevRel activity can’t be traced to any of them, you should be able to explain why anyway — and “developer love” is not the answer your CFO is grading.

“But what about community management? Or technical evangelism?”

The common pushback: community management is really field marketing, advocacy and education are really technical marketing, and the new “build for agents” work is a different beast entirely. Every company is different. Every role mutates with the company stage and the industry.

All true. But the test isn’t what bucket does the role belong in. The test is does it ladder to a business goal a CFO would fund?

Take community travel. The lazy version is “show up, give the talk, post the photos, fly home.” The version that actually pays for itself is lining up two to five customer meetings with the AEs in that city while you’re already there. Influence a buying decision. Unblock an expansion deal. Inspire a champion to escalate internally.

That’s not field marketing. That’s not “community.” That’s level-2 sales engineering — the company brings in the big guns to influence a deal, and the DevRel person is the big guns. The next level up is product/eng execs, the C-suite, or founders.

If you’re doing community travel and not lining up customer meetings, you’re leaving money on the table. If you’re doing customer meetings and not telling your CFO, you’re leaving budget on the table.

The three activities of any company

In our world (B2B-ish software, mostly), every company does exactly three things:

Build the product
Take that product to market
Run the company internally (ops, finance, HR)

Some DevRel work touches product — feedback loops, design partnerships, pre-release shaping. Real, valuable, and rarely the primary reason a company hires a DevRel team.

Almost everything else is GTM. And honestly, you can argue everything is GTM, because the only reason a company builds a product is to take it to a market. Leadership has a fiduciary duty to maximize returns to shareholders. If your work doesn’t show up somewhere on the path between “build” and “money in the bank,” it’s a hobby the company is paying for.

That isn’t cynical. It’s the deal.

What this means if you’re a DevRel IC

Know which of the four levers (revenue, ramp/NRR, CAC) your current work moves. If you can’t name it, your manager probably can’t either, and you’re both exposed.
Bring receipts to your reviews. “Influenced X new logos, Y in expansion, Z in pipeline” beats “wrote 12 blog posts and spoke at 4 conferences” every time, even if the blog posts and talks are how you got there.
Make peace with attribution being messy. It is messy. Do it anyway. Bad attribution is better than no attribution, and the spreadsheet will sharpen over time.
When the macro tightens, get closer to revenue, not further from it. The instinct to retreat into “pure community work” when budgets get cut is exactly backwards.

What this means if you lead a DevRel team

Stop pretending the GTM frame is new. It’s not new — your patience is just shorter than it used to be. Be honest with the team about that.
Defend the long-timeline activities by tying them to long-timeline metrics (NRR, brand share-of-voice, organic pipeline a year out). Don’t try to defend a podcast on a quarterly attribution model.
Make sure every IC has at least one this-quarter lever and at least one next-year lever. People who only do long-cycle work get cut first.
Build the muscle of bringing your team into customer conversations. The DevRel folks moving real revenue aren’t doing it by accident — they have a system, and they tell their CFO about it.

So what am I missing?

Genuinely — what am I missing?

I keep hearing “DevRel is shifting.” I keep looking at the work and seeing the same four things ladder to the same four metrics they always did. The pressure changed. The label changed. The job didn’t.

If you think the job actually changed, I want to hear why. Hit me up.

Thanks to Adi Polak for the chat and inspiration to write this post.

Finding the Perfect DevRel Metric

Finding the Perfect DevRel Metric

Published by Ryan Boyd on March 9, 2023

Want to figure out the perfect metric by which to measure the success of your DevRel team? A metric which you report up to your leadership that’s clear and concise?

Those who have worked with me know that I am a fan of metrics, constantly iterating and testing new approaches to DevRel to try and push those metrics. This is especially important to align developer relations teams around common goals and drive improvements, but is also a valuable way of obtaining buy-in on your priorities and communicating them throughout the organization.

“We need a single metric”

DevRel teams are frequently asked for a single top-line metric to represent the team’s goal and all of their work — both short-term and long-term. This can be valuable for clear and concise internal communication, though over-simplification can be detrimental for many of the other motivations behind goal setting.

Let’s talk through a few metrics I’ve used in the past, their value and their challenges. If your developer product is a modern cloud service with authenticated developers and no open source component, you can skip this section, but it may still be helpful.

1 million developers building with Google. Google DevRel (~2007) .

Google’s nascent developer relations team took form under Online Sales and Operations (OSO), a highly metrics-driven organization led by Sheryl Sandberg [more on that in a future post]. Our Director was asked for a visionary metric to drive the team by and we settled on a large, bold number.

This metric seemed perfect. It represented the community of developers we were trying to build and many of the activities of the team (documentation/tutorials/guides, blog posts, support, DevEx improvements, etc) can theoretically be traced up to impacting the metric.

Measurement methodology is the crux of any good metric, and it was very challenging in this case to identify an individual developer. Development of a unified Google Account for all products was underway by a four-person engineering team IIRC, but it was not adopted by all Google services. We simply had no foolproof way to identify specific humans (developers) building with Google.

Nonetheless, we settled upon Google Accounts for those products which supported them. However, one of the most active developer communities, Google Maps, didn’t even require authentication for developers. We needed a proxy to represent a developer. We used unique websites for Google Maps. We, of course, knew that a single developer could be behind hundreds or thousands of auto-generated content sites on the web, but had no way to account for that, just like we had no way to account for a single developer having multiple Google accounts.

Perhaps the biggest concern with this calculation was that 95% of all “developers building with Google” were accounted for because of Google Maps. Teams working on other products had no meaningful way to move the number, making the metric worthless for aligning and encouraging these teams internally.

Perhaps we could have had a metric like X million developers on mature products, and Y developers on labs products? (though the Google Code Labs program I co-founded is no longer in existence)

X monthly active machines running Neo4j. Neo4j DevRel (~2015).

Neo4j is an open source graph database, distributed as a Community Edition (GPL) and a commercial Enterprise Edition (then AGPL). It is also now available as a cloud service – Neo4j Aura, but was not at the time we used this metric.

Monthly active machines was based on unique MAC addresses pinging back saying that they were running Neo4j. This telemetry data existed by default in both the free and commercial editions of the database, but was able to be disabled by the user [or blocked by a good firewall config].

Good metrics are not easily manipulated by a single actor. This metric was.

Problems:

No meaningful measurement of actual usage / value achieved for a specific human, just that the machine was running
MAC addresses are not always unique (VMs, etc)
Some of the highest value users disabled this telemetry
Runaway CI jobs caused spikes in the metric
A large financial services firm chose to install Neo4j on all client instances for a specific feature which was not enabled by default. Neo4j was running, but not in use.

Solutions explored:

Only count machines running > X queries / hour
Cap the number of new MAC addresses added per month/day/etc

I’m sure there were ways to use a mixture of data science and pattern matching to ensure we counted only “real” machines and smooth out some of the charts, but we were a small team without any dedicated data folk.

Y data people educated on Delta Lake. Databricks DevRel (~2020)

The goal of this metric was to motivate the right type of scalable DevRel programs, while communicating up a single top-line number instead of a list of program metrics.

I think this metric worked well internally for prioritization of other goals, but failed to succeed as a metric for managing up to the CMO and CEO.

Why? Nuances. How the hell can we say someone is educated?

We used a composite metric to infer that some fractional data person was educated ~ 0.2 of the way if they read a blog post, 0.5 of the way if they watched an on-demand video, 0.5 of the way if they downloaded a book, 1.0 of the way if they watched a live video or attended a live event, etc.

This mechanism worked in many cases. However, the CEO was easily able to come up with examples where this failed to motivate the right activities: “If someone read 4 blog posts, were they just as educated as someone attending a talk (by the creator of Delta Lake)? The variables of the specific speakers, blog post authors and content quality were simply not accounted for, yet had an impact on the effectiveness of the education.

Sure, if we owned the training and certification programs, we could have assessments to grade our target audience [and such programs were under discussion for the OSS project]. However, our team could be very successful at building wide awareness and adoption even if not a single person got “certified.”

So, we could be successful and not get credit, and we could get credit while not being successful.

The other big issue with this metric is that there were people on the team who were responsible for other product areas (MLflow, Spark OSS, Databricks product, etc). While they helped move this metric at times when “co-marketing” the products, they didn’t move it significantly enough to feel ownership over the number.

What metrics are valuable?

The only perfect metrics are those that point up and to the right at an increasing rate.

Okay, in all seriousness, metrics for DevRel on open source projects are super challenging, but modern day cloud APIs/services are much easier to measure when developers are authenticated.

If your developer accounts are unified across many services (single sign-on to things like product, docs, forums, support, etc), then you’re in a very good place to create DevRel metrics around adoption. You’ll have a great way of knowing whether developers are building with your product, how quickly they can be successful (aha moments), and whether they continue to build. You’ll also be able to measure their impact on the rest of the community.

Depending on your priorities any given quarter, you then choose metrics like:

Number of developers with active applications
Time for new developers to achieve X
Number of new developers creating their fist applications
Number of developers actively building in the last X days

Do you need to report up a single metric?

While I have fought tooth-and-nail over this with past leaders, I do actually think it is valuable to place a single goal that everyone on the team is running towards and communicate it up. However, I would only do this when that metric is simple, represents the work of the team well, and not easily manipulated. Otherwise, I would choose up to 3 metrics and regularly report on those.

DevRel should NEVER be in Sales

Published by Ryan Boyd on June 13, 2022

I was one of the first employees in Google’s developer relations team, joining in the fall of 2006. Did you know that when Google started doing developer relations, the group was called API Support and under TechOps which was in the Online Sales and Operations (OSO) org with @sherylsandberg as the org leader? What led to its prominence within Google?

DevOps. No, not that DevOps

The name changed, of course, to Developer Operations (DevOps, which means a very different thing today!) and we had very capable leadership under Mike Winton. But I think a singular decision had the greatest effect: the move into the Engineering org. Why?

Google has an engineering-driven culture

Many of you have heard me argue that DevRel is marketing, so why was DevRel being in the engineering org at Google so important? It’s contextual. Google is an engineering-led culture. To influence eng teams to effect change on behalf of the developer communities requires their respect.

Respect comes with understanding and helping with their challenges, but also knowing their language, systems, code, tooling and operations. Access to these at Google required being in the engineering org. Also, it greatly helped that a Eng VP had to sign off on the quality of all hires.

I’d argue that DevRel can be super successful in a marketing group, but it requires the right culture and leadership that truly gets the role it plays in product and community success. Without that, a home in engineering is much better for all.

How about having DevRel in a sales org?

No, DevRel never belongs in sales. DevRel is a long game, and sales is driven by short-term lead/opp/closing targets — all sales leaders will be tempted to reach these targets by involving some of their most talented technologists in DevRel, and that comes at the expense of our real long-term goals.

Lakehouse isn’t an architecture; it’s a way of life

Published by Ryan Boyd on April 12, 2021

Recently a tweet of mine was revealed to have been included in a Twilio board of directors presentation from 2011. The tweet was about the simplicity of the developer experience for both Twilio and Google App Engine. What’s this have to do with Lakehouses? Everything.

Apparently screen caps in 2011 weren’t high res

My entire career has been about enabling simplified experiences for technologists so they can focus on what matters — the key differentiators of their business or application.

Google App Engine, though released before its time, made it a lot easier to launch and maintain production-ready web applications.
Google Apps Marketplace made it easier to market B2B apps to a large audience.
Neo4j and the property graph data model makes it easier to understand and query the relationships between data.

The same is true with Databricks and the Lakehouse architecture.

Nearly all large enterprises today have two-tier data architectures — combining a data lake and a data warehouse to power their data science, machine learning, data analytics, and business intelligence (BI).

Data Lake, storing all the fresh enterprise data, populated directly from key business applications and sources. By using popular open data formats, the data lake is great for compatibility with popular distributed computing frameworks and data science tools. However, a traditional data lake is typically slow to query plus lacks schema validation, transactions, and other features needed to ensure data integrity.
Data Warehouse, with a subset of the enterprise data ETLd from the data lake, stores mission critical data “cleanly” in proprietary data formats. It’s fast to query this subset, but the underlying data is often stale due to the complex ETL processes used to move the data from the business apps -> lake -> warehouse. The proprietary data formats also make it difficult to work with the data in other systems and keep users locked in to their warehouse engine.

Simplicity is king

Why have a two-tier data architecture when a single tier will satisfy the performance and data integrity requirements, improve data freshness and reduce cost?

It simply wasn’t possible before the advent of data technologies like Delta Lake, enabling highly-performant access to data stored in open data formats (like Parquet) with the data integrity constraints and ACID transactions only previously possible in data warehouses.

bestOf(DataLake) + bestOf(DataWarehouse) => (DataLakehouse)

Reduce your (mental, financial, ops) overhead

I’d encourage y’all to invest in simplicity and reduce the complexity of your data architecture. The first step is reading up on new technologies like Delta Lake. There is a great VLDB paper on the technology as well as a Getting Started with Delta Lake tech talk series by some of my esteemed colleagues.

DevRel @ Databricks: Year 1

Published by Ryan Boyd on January 27, 2021

I joined Databricks in October 2019. My first day was also the first day of the Spark + AI Summit in Amsterdam – a heck of an exciting introduction to a new team, new company and new community.

Why Databricks? I was very happy at Neo4j and certainly loved working with the Neo4j community. Databricks brought with it an exciting opportunity though – build a talented team at one of the fastest growing cloud data startups in history. I also have the opportunity to work directly with the founders and senior leadership at the company who understand the value of developer relations and the importance of building a great community of data scientists, data engineers and data analysts.

Our Team

The team is now 6 located in San Francisco CA, Seattle WA, Boulder CO, Sante Fe NM, and Blacksburg VA. We have Developer Advocates and Program Managers all working together to grow awareness and adoption of Databricks and the open source projects which we support.

First-Year Team Accomplishments

Here’s some of our accomplishments from the first-year, working along with amazing collaborators across the company and community.

Launched the Databricks University Alliance, a community of professors at some of the world’s top universities sharing best practices and using Databricks to help teach data science, data engineering, data analytics and more.
Built and executed (along with the broader company) two of the largest virtual events for the Data + AI Community, with the June Spark + AI Summit and the fall Data + AI Summit Europe.
Published the Learning Spark 2nd Edition book with O’Reilly and made it available for free.
Hosted 44 Data + AI Online meetups, bringing together thousands of members in the community to learn. Published these on our YouTube channel, growing subscribers by 50%.
Published a few great tech talk series, including Getting Started with Delta Lake, Diving into Delta Lake, Managing the Machine Learning Lifecycle, and Introduction to Data Analysis for Aspiring Data Scientists.
Published a COVID-19 Hub, with data, notebooks and dashboards.
Ran a Hackathon for Social Good with hundreds of participants resulting in $35k donations in charity.
Released our first season of the Data Brew podcast/vidcast, focused on lakehouses, interviewing experts from across the data + AI community.

Keep in mind: a lot of work our team has done in 2020 hasn’t yet been released – stay tuned! And, of course, this is all within the context of the broader accomplishments of the company and community in 2020.

Looking Forward

We have a really excited 2021 planned as the team continues many of the initiatives above and takes on new challenges. We’ll be focused on making it easier to learn data science, data engineering and data analytics, as well as making it simple to apply these learnings using Databricks. An important part of this mission will be growing and strengthening the community so we can all learn from each other.

Are you a data geek and want to join the adventure? We have data engineering/analytics advocate roles and developer (online) experience advocate roles open in the US as well as a regional advocate role in Europe. Reach out to me at (firstname).(lastname)@databricks.com if you want to learn more!

Moving RDBMS data into a Graph Database

Published by Ryan Boyd on March 13, 2017

One of the most common questions we get at Neo4j is how to move from a SQL database to a Graph Database like Neo4j. The previous solution for accomplishing this was to export the SQL tables into CSV files and then importing the CSV files with neo4j-import or LOAD CSV. There’s a much better way: JDBC!

Neo4j JDBC Support

There are two distinct ways you can use JDBC within Neo4j:

Access Neo4j Data via JDBC. Do you have existing code that accesses your SQL database using JDBC, and you want to move that code to access Neo4j instead? Neo4j has a JDBC Driver. Just update your code to use the awesome power of the Cypher query language instead of SQL, and switch over the JDBC driver you’re using, and you’re off to the races!
Import SQL Databases into Neo4j. Do you have data in your SQL database that you want to move into a Graph? The APOC library for Neo4j has a set of procedures in apoc.load.jdbc to make this simple. This blog post will cover this use case.

Loading Sample Northwind SQL tables into MySQL

In order to run the code snippets in the following sections, you’ll need to have the Northwind SQL tables in a MySQL database accessible from your Neo4j server. I’ve published a GitHub Gist of the SQL script which you can execute in MySQL Workbench or using the command-line client.

In order to run this, I created a blank MySQL database in Docker:

docker run -P -e MYSQL_ROOT_PASSWORD=my-secret-pw -e MYSQL_DATABASE=northwind -e MYSQL_USER=northwind -e MYSQL_PASSWORD=my-secret-pw mysql

1	docker run -P -e MYSQL_ROOT_PASSWORD=my-secret-pw -e MYSQL_DATABASE=northwind -e MYSQL_USER=northwind -e MYSQL_PASSWORD=my-secret-pw mysql

Loading data from RDBMS into Neo4j using JDBC

With the APOC JDBC support, you can load data from any type of database which supports JDBC. In this post, we’ll talk about moving data from a MySQL database to Neo4j, but you can apply this concept to any other type of database: PostgreSQL, Oracle, Hive, etc. You can use it for other NoSQL databases too, but APOC has direct support for MongoDB, Couchbase and more.

1. Install APOC and JDBC Driver into Neo4j `plugins` directory

Note: This step is not necessary if you’re using the Neo4j Sandbox and MySQL or PostgreSQL. Each Sandbox comes with APOC and the JDBC drivers for these database systems.

All JAR files placed in the Neo4j plugins directory are made available for use by Neo4j. We need to copy the APOC library and JDBC drivers into this directory.

First, download APOC. Be sure to grab the download that is for your version of Neo4j.

Next, download the JDBC driver. Then, copy the file into your plugins directory:

cp mysql-connector-java-5.1.36.jar ~/neo4j/plugins/

1	cp mysql-connector-java-5.1.36.jar ~/neo4j/plugins/

Finally, restart Neo4j on your system.

2. Register the JDBC Driver with APOC

Open up the Neo4j Browser web interface:
2017-03-13_13-53-55

In the Neo4j Browser, enter a Cypher statement to load the required JDBC driver:

CALL apoc.load.driver("com.mysql.jdbc.Driver");

1	CALL apoc.load.driver("com.mysql.jdbc.Driver");

3. Start pulling Northwind SQL tables into Neo4j with JDBC and Cypher

Run the following Cypher queries, courtesy of William Lyon, separately in the Neo4j Browser:

// Create Product nodes based on each row of the Products table
CALL apoc.load.jdbc("jdbc:mysql://:3306/northwind?user=northwind&amp;password=my-secret-pw","Products") YIELD row 
CREATE (p:Product {ProductID: row.ProductID})
SET p.ProductName = row.ProductName,
    p.CategoryID  = row.CategoryID,
    p.SupplierID  = row.SupplierID;

// Create Product nodes based on each row of the Products table

CALL apoc.load.jdbc("jdbc:mysql://:3306/northwind?user=northwind&password=my-secret-pw","Products") YIELD row

CREATE (p:Product {ProductID: row.ProductID})

SET p.ProductName = row.ProductName,

p.CategoryID = row.CategoryID,

p.SupplierID = row.SupplierID;

// Create Orders nodes
CALL apoc.load.jdbc("jdbc:mysql://:3306/northwind?user=northwind&amp;password=my-secret-pw","Orders") YIELD row 
CREATE (o:Order {OrderID: row.OrderID})
SET o.CustomerID = row.CustomerID,
    o.EmployeeID = row.EmployeeID;

// Create Orders nodes

CALL apoc.load.jdbc("jdbc:mysql://:3306/northwind?user=northwind&password=my-secret-pw","Orders") YIELD row

CREATE (o:Order {OrderID: row.OrderID})

SET o.CustomerID = row.CustomerID,

o.EmployeeID = row.EmployeeID;

// Create OrderDetails relationships
CALL apoc.load.jdbc("jdbc:mysql://:3306/northwind?user=northwind&amp;password=my-secret-pw","OrderDetails") YIELD row 
MATCH (p:Product {ProductID: row.ProductID})
MATCH (o:Order {OrderID: row.OrderID})
CREATE (o)-[r:CONTAINS]-&gt;(p)
SET r.UnitPrice = row.UnitPrice,
    r.Quantity  = row.Quantity,
    r.Discount  = row.Discount;

// Create OrderDetails relationships

CALL apoc.load.jdbc("jdbc:mysql://:3306/northwind?user=northwind&password=my-secret-pw","OrderDetails") YIELD row

MATCH (p:Product {ProductID: row.ProductID})

MATCH (o:Order {OrderID: row.OrderID})

CREATE (o)-[r:CONTAINS]->(p)

SET r.UnitPrice = row.UnitPrice,

r.Quantity = row.Quantity,

r.Discount = row.Discount;

// create PLACED relationships
MATCH (o:Order)
MATCH (c:Customer {CustomerID: o.CustomerID})
CREATE (c)-[:PLACED]-&gt;(o);

// create PLACED relationships

MATCH (o:Order)

MATCH (c:Customer {CustomerID: o.CustomerID})

CREATE (c)-[:PLACED]->(o);

Running Cypher Queries on Imported Data

Here’s a simple Cypher query for collaborative filtering product recommendations:

// simple collaborative filtering product recommendations
MATCH (c:Customer) WHERE c.ContactName = "Roland Mendel"
MATCH (c)-[:PLACED]-&gt;(o:Order)-[:CONTAINS]-&gt;(p:Product)
MATCH (p)&lt;-[:CONTAINS]-(:Order)(:Order)-[:CONTAINS]-&gt;(p2:Product)
RETURN p2.ProductName, count(*) AS weight ORDER BY weight DESC LIMIT 10;

// simple collaborative filtering product recommendations

MATCH (c:Customer) WHERE c.ContactName = "Roland Mendel"

MATCH (c)-[:PLACED]->(o:Order)-[:CONTAINS]->(p:Product)

MATCH (p)<-[:CONTAINS]-(:Order)(:Order)-[:CONTAINS]->(p2:Product)

RETURN p2.ProductName, count(*) AS weight ORDER BY weight DESC LIMIT 10;

Results:
2017-03-13_14-46-29

Next Steps

If this was your first experience with Neo4j, you probably want to learn more about Neo4j’s Cypher query language. Neo4j has some great (free) online training you can take to learn more. You can also use the Cypher Refcard to power your journey to becoming a Graphista.

Graphing Hillary Clinton’s E-mails in Neo4j

Published by Ryan Boyd on November 6, 2015

Technologies: Neo4j, OpenRefine, Prismatic Topics API, Python, Py2neo

Bernie is sick and tired of hearing about Hillary’s e-mails and so am I. So, why am I writing about them? Well, they can possibly provide an interesting insight into how our government works (or doesn’t work) — if only they were in a better format than PDFs!! They represent a perfect graph!

I started off by downloading the CSV files created by Ben Hammer. Some of the information about who messages were from/to aren’t very normalized in that dataset, so I used the OpenRefine faceting feature and created emails-refined.csv.

I imported these into Neo4j:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "https://s3-us-west-2.amazonaws.com/neo4j-datasets-public/Emails-refined.csv" AS line
MERGE (fr:Person {alias: COALESCE(line.MetadataFrom, line.ExtractedFrom, '')})
MERGE (to:Person {alias: COALESCE(line.MetadataTo, line.ExtractedTo, '')})
MERGE (em:Email { id: line.Id })
ON CREATE SET em.foia_doc=line.DocNumber, em.subject=line.MetadataSubject, em.to=line.MetadataTo, em.from=line.MetadataFrom, em.text=line.RawText, em.ex_to=line.ExtractedTo, em.ex_from=line.ExtractedFrom
MERGE (to)(fr)

USING PERIODIC COMMIT

LOAD CSV WITH HEADERS FROM "https://s3-us-west-2.amazonaws.com/neo4j-datasets-public/Emails-refined.csv" AS line

MERGE (fr:Person {alias: COALESCE(line.MetadataFrom, line.ExtractedFrom, '')})

MERGE (to:Person {alias: COALESCE(line.MetadataTo, line.ExtractedTo, '')})

MERGE (em:Email { id: line.Id })

ON CREATE SET em.foia_doc=line.DocNumber, em.subject=line.MetadataSubject, em.to=line.MetadataTo, em.from=line.MetadataFrom, em.text=line.RawText, em.ex_to=line.ExtractedTo, em.ex_from=line.ExtractedFrom

MERGE (to)(fr)

With the data in Neo4j, I got to explore the Person nodes Hillary sent the most Email nodes to.

MATCH (p:Person)(h:Person {alias: "Clinton, Hillary"})
RETURN p.alias AS name, COUNT(*) AS count
ORDER BY count DESC;

MATCH (p:Person)(h:Person {alias: "Clinton, Hillary"})

RETURN p.alias AS name, COUNT(*) AS count

ORDER BY count DESC;

Knowing the e-mails and senders+receivers is interesting, but I wanted to see what the e-mails are about! While the subject lines are included with the e-mails, they’re often opaque, like the meaningful subject “HEY” used in an e-mail from Jake Sullivan to Hillary Clinton. Natural language processing to the rescue!

I built a small Python script and used Py2neo to query all e-mails without attached topics. I then go through each e-mail and send the raw body text and subject to the Prismatic Topics API. The API returns a set of topics, which I then use to create REFERENCES relationships between the e-mails and topics. This code is based on the excellent post on the topic by Mark Needham.

Now I can explore e-mails by topic, like the graph below where I see e-mails related to David Cameron. When I double-clicked on the e-mail with subject ‘GUARDIAN’ in the Neo4j Browser, I can see all the other topics that e-mail references, including Sin Fein, Northern Ireland, Ireland, and Peace.

With this additional topic information, I can start to understand more context around Hillary’s e-mails.

What fun things can you find in her e-mails?

I’ve opened up the Neo4j instance with this data for the world to explore. Check it out at http://ec2-54-209-65-47.compute-1.amazonaws.com:7474/browser/. The dataset is open to the public, but I’ve marked it as read-only. Mention me on Twitter with @ryguyrg if you discover any interesting nuggets of knowledge in Hillary’s e-mails!

Hello world!

Published by Ryan Boyd on July 11, 2009

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!