One of the most renowned EU&USA publishers reached out to us with an intention to build a new value-added service for their subscribers focused on the prediction of a bill-voting outcome in US congress. Such service was offering an innovative, impartial and accurate approach to political analytics, as opposed to the classic “expert” approach used by other players on the publishing market.
Solution
The journey began with a deep analysis of the problem and available data to solve it. The main challenge was that in order to get high accuracy the machine learning model had to gather attributes of a bill and voting habits of senators from multiple non-congruent sources. An added complexity was the fact that both “bill” and “voter” are complex entities and specific voter decision regarding specific bill depends on multiple attributes and their combinations, therefore requiring advanced approaches for prediction.
The initial development team consisted of two data scientists and one DevOps. First, we built an API-like connector for grabbing and processing the data. A lot of work was dedicated to clean up existing data – all success of the model hinged on apposite data cleansing and enriching. Moreover, considering the potential growth of external sources, the current data processing module must have an easily expendable architecture for rapid scale-up once we have a new source on the board.
The next step was dedicated to getting a full understanding of the data thus we ran a set of models to get descriptive statistics and full comprehension of the given data. Having the latter completed, we did the PCA (Principal Component Analysis) to understand the weights and how the key features do affect the outcome. After the all of abovementioned, we devised a plan of model testing – starting from 10 “competitors” we boiled a list down to the four key models and amalgamated them into an ensemble.
Technology and skills
Skills
Predictive Analytics
Technology Stack
Python
SQL & noSQl
R
mlLib
Spark
Version control & development tools
Git
JuPyter
Jira/Confluence
Slack
Result
The model got an accuracy of 84% proving the business case. Such successful proof of concept initiated a series of new initiatives and value-added services based on the data existing in the organization and utilizing the power of data science and machine learning to gain unique insights from it.