Technologies: Neo4j, OpenRefine, Prismatic Topics API, Python, Py2neo
Bernie is sick and tired of hearing about Hillary’s e-mails and so am I. So, why am I writing about them? Well, they can possibly provide an interesting insight into how our government works (or doesn’t work) — if only they were in a better format than PDFs!! They represent a perfect graph!
I started off by downloading the CSV files created by Ben Hammer. Some of the information about who messages were from/to aren’t very normalized in that dataset, so I used the OpenRefine faceting feature and created emails-refined.csv.
I imported these into Neo4j:
1 2 3 4 5 6 7 |
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "https://s3-us-west-2.amazonaws.com/neo4j-datasets-public/Emails-refined.csv" AS line MERGE (fr:Person {alias: COALESCE(line.MetadataFrom, line.ExtractedFrom, '')}) MERGE (to:Person {alias: COALESCE(line.MetadataTo, line.ExtractedTo, '')}) MERGE (em:Email { id: line.Id }) ON CREATE SET em.foia_doc=line.DocNumber, em.subject=line.MetadataSubject, em.to=line.MetadataTo, em.from=line.MetadataFrom, em.text=line.RawText, em.ex_to=line.ExtractedTo, em.ex_from=line.ExtractedFrom MERGE (to)(fr) |
With the data in Neo4j, I got to explore the Person nodes Hillary sent the most Email nodes to.
1 2 3 |
MATCH (p:Person)(h:Person {alias: "Clinton, Hillary"}) RETURN p.alias AS name, COUNT(*) AS count ORDER BY count DESC; |
Knowing the e-mails and senders+receivers is interesting, but I wanted to see what the e-mails are about! While the subject lines are included with the e-mails, they’re often opaque, like the meaningful subject “HEY” used in an e-mail from Jake Sullivan to Hillary Clinton. Natural language processing to the rescue!
I built a small Python script and used Py2neo to query all e-mails without attached topics. I then go through each e-mail and send the raw body text and subject to the Prismatic Topics API. The API returns a set of topics, which I then use to create REFERENCES
relationships between the e-mails and topics. This code is based on the excellent post on the topic by Mark Needham.
Now I can explore e-mails by topic, like the graph below where I see e-mails related to David Cameron. When I double-clicked on the e-mail with subject ‘GUARDIAN’ in the Neo4j Browser, I can see all the other topics that e-mail references, including Sin Fein, Northern Ireland, Ireland, and Peace.
With this additional topic information, I can start to understand more context around Hillary’s e-mails.
What fun things can you find in her e-mails?
I’ve opened up the Neo4j instance with this data for the world to explore. Check it out at http://ec2-54-209-65-47.compute-1.amazonaws.com:7474/browser/. The dataset is open to the public, but I’ve marked it as read-only. Mention me on Twitter with @ryguyrg if you discover any interesting nuggets of knowledge in Hillary’s e-mails!
Be First to Comment