This article introduces you to an application that combines multiple microservices with a graph processing platform to rank communities of users on Twitter. We’re going to use a collection of popular tools as a part of this article’s sample application. The tools we’ll use, in the order of importance, will be:
Ranking Twitter Profiles
Let’s do an overview of the problem we will solve as a part of our sample application. The problem we’re going to solve is how to discover communities of influencers on Twitter using a set of seed profiles as inputs. To solve this problem without a background in machine learning or social network analytics might be a bit of a stretch, but we’re going to take a stab at it using a little bit of computer science history.
I dug up the original research paper on PageRank from Stanford for some inspiration. In the paper, the authors talk about the notion of approximating the "importance" of an academic publication by weighting the value of its citations.
The reason that PageRank is interesting is that there are many cases where simple citation counting does not correspond to our common sense notion of importance. For example, if a webpage has a link to the Yahoo home page, it may be just one link but it is a very important one. This page should be ranked higher than many pages with more links but from obscure places. PageRank is an attempt to see how good an approximation to "importance" can be obtained just from the link structure.
— Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999)
The PageRank Citation Ranking: Bringing Order to the Web
Now let’s take the same definition that is described in the paper and apply it to our problem of discovering important profiles on Twitter. Twitter users typically follow other users to track their updates as a part of their stream. We can use the same reasoning behind using PageRank on citations to approximate the "importance" of profiles on Twitter. This reasoning would tell us that it’s not the number of followers that make a profile important, it is measured by how important those followers are.
That’s exactly what we’re going to build in this article, and we’ll end up with something that looks like the following table.
The first thing we’re going to need to worry about when building this solution is how we’re going to calculate PageRank on potentially millions of users and links. To do this, we’re going to use something called a graph processing platform
What is a graph processing platform?
A graph processing platform is an application architecture that provides a general-purpose job scheduling interface for analyzing graphs. The application we’ll build will make use of a graph processing platform to analyze and rank communities of users on Twitter.
The diagram below illustrates a graph processing platform
Submitting PageRank Jobs
The graph processing platform I’ve described will provide us with a general purpose API for submitting PageRank jobs to Apache Spark’s GraphX module from Neo4j. The PageRank results from GraphX will be automatically applied back to Neo4j without any additional work to manually handle data loading. The workflow for this is extremely simple for our purposes. From a backend service we will only need to make a simple HTTP request to Neo4j to begin a PageRank job.
I’ve also taken care of making sure that the graph processing platform is easily deployable to a cloud provider using Docker containers.
By default, Docker Compose will orchestrate containers on a single virtual machine. If we were to build a truly fault tolerant and resilient cloud-based application, we’d need to be sure to scale our system to multiple virtual machines using a cloud platform. This is the subject of a later article.
System Architecture Diagram
The diagram below shows each component and microservice that we will create as a part of this sample application. Notice how we’re connecting the Spring Boot applications to the graph processing platform we looked at earlier. Also, notice the connections between the services, these connections define communication points between each service and what protocol is used.
The three applications that are colored in blue are stateless services. Stateless services will not attach a persistent backing service or need to worry about managing state locally. The application that is colored in green is the Twitter Crawler service. Components that are colored in green will typically have an attached backing service. These backing services are responsible for managing state locally, and will either persist state to disk or in-memory.
Thats it for now! Stay tuned in the next series of posts where I will walkthrough each of the following concerns as a part of building this service.
- Creating Spring Data Neo4j repositories
- Exposing repository APIs using Spring Data REST
- Connecting to the Twitter API
- Using AMQP to import profiles from the Twitter API
- Scheduling new PageRank jobs
- Discovering new users
- Registering as a discovery client with Eureka