Blog Archives

Social Media Data - The kind of new frontier.

4/28/2016

INTRO:
Riot games is responsible for making one of the worlds most popular games, "League Of Legends".

I would bet my bottom dollar that the team at Riot have a fairly large data warehouse, probably something like Hadoop and/or a flavour of MPP database. What's interesting about "transactional" data warehousing is it can monitor when an event starts or stops happening, but it struggles to tell you why.

Within the last 10 years most games have an online component whether you notice it or not. With the advancement of connected consoles and PC's, companies are able to fine tune their games years after the initial release date using updates (referred to as patches).

In League of Legends you can play as a certain "Champion" . Champions have different attributes, some may be better at defence while others are better at offence. Reading through the patch notes supplied by Riot they are continually tweaking each champion.

Although I don't have the transactional data from League of Legends, I do have a good slice of social media data for late 2015, all in all about 5,000,000 comments.

I'm going to show why tracking your customers feedback is almost as important as tracking transactional event based data.

THE SETUP:

I have a bunch of software we have written that collects social data in real-time with my company needles.io .
We are going to focus on two updates (patches): 5.23 and 5.22.
We have a machine learning frame work that we have trained to understand sentiment and other things (this also works in real-time).
Do some analysis.

THE DEEP-ISH DIVE:

Note: A Champion is character you can play as in League Of Legends

1.0 This graph below is showing the count of comments that mention any "Champion" for the month of November.

1.1 Next we look at specific group of Champions that were tweaked in the patches. The graph below shows what percentage of all Champion comments these dictated over the month of November.

Two date ranges we will look into with noticeable peaks: (10-Nov to 15-Nov) and (21-Nov to 24-Nov), which strangely enough are around the release date of our patches!

1.2 - Here we look at the count of Champion comments between the dates (10-Nov to 15-Nov) and (21-Nov to 24-Nov). Orange means the champion was updated in the patch and Purple means it wasn't.

Patch 5.22

Patch 5.23

You might have noticed that I have annotated a couple of Champions in each of the graphs. I'm going to take a closer look at these.

Some champions dominate comments threads more than others by being more popular or more controversial, we should take this into account when digging in deeper.

1.3 - We attempt to weed out the dominating Champions from our selected list

Percentage Of Comments Grouped by Champions - All Time

Percentage of Comments Grouped by Champions - (10-15 Nov)

Percentage Comments Grouped by Champions - (21-24 Nov)

Using the an example of the Champion Bard, you can summarize the above three graphs:

Over all time Bard gets an average comment percentage by Champion of 1.21%
Between 10-15 November Bard gets an average comment percentage by Champion of 1.8%
Between 21-24 November Bard gets an average comment percentage by Champion of 2.0%

The most prominent mover in % gain is the Champion "Graves", I think Graves warrants some special attention (especially during the Patch 5.22 period).

1.4 - Count of Champion "Graves" Comments Over Time

All the tedious work above of filtering things out has paid off, we can see two very obvious spikes for Graves one of which covers patch 5.22.

Next we will use an algorithm that detects noun phrases in comments for the period 10th to 15th of November and determine which appear most frequently.

1.5 Noun Phrase Frequency For Champion Graves between 10-15 November.

Looking at the data we can see that there are some obvious hot spots in noun frequency, particularly "jungle"

Next we will have a look at comments between the same time period that also talk about other Champions.

1.6 Count of Comments Mentioning Graves Between 10-16 November that also talk about other Champions.

Now we will also cross relate any comments that mention graves for the same period and mention any other entity in the League of Legends Universe

1.7 Count of Comments Mentioning Graves Between 10-15 November that also talk about other League Of Legends Entities

This visualisation shows:

ItemTypes :Jungle and Lane show strong comment co-relations.
Champions: Lucian, Vayne and Yasuo show strong comment co-relations.

I'm still on the fence regarding sentiment analysis and its accuracy, you can find plenty of articles covering this topic online. For the sake of doing it, I decided to run a sentiment algorithm on all comments that mention Graves and also other League of Legends entities.

1.8 - Sentiment Analysis Of Comments Mentioning Graves by All League Of Legend Entities.

This graph is fairly large and hard to consume, lets split it out by the top Positive and Negative counts

Negative

Positive

We are starting to tell a story without even looking at the comments (yet), below are a few examples:

People who talk about Graves and the item types Lane and Jungle high forms of negativity.
People who talk about Graves and the Champion Caitlyn seem to be more positive.

(there are some entities I have omitted due to their generic names, eg: "magic")

Next we will look at the key words people are saying when they talk about Graves, Jungle and Lane

1.9 Count of Key Words In Comments

Graves and Jungle

Graves and Lane

There are two clear themes between both Lane and Jungle - people are talking about damage.
Lets look at some of the raw comments that we filtered down to people who talk about "Graves", "Lane" or Jungle" with the key word of Damage (there are a lot, so I will track down only a few key ones that resemble the story of the rest)

Comment #1:
His Q does a ton of damage and has a pretty flexible mana cost. if nothing else, his Q is too strong already in terms of waveclear potential. on top of that, a lot of powerful graves play is coming in the solo lanes and jungle

Comment #2:
They were weak because the meta was very heavily against them. TP was very strong so you often saw the two solo lanes run it and its low cd allowed them to go bot just about any time they pleased.
The adcs that are meta rn, Graves, MF, Luc, Cait were lane bullies that werent viable due to how much they would get punished for pushing. Now if they get ahead, they snowball the game and become an issue, thats what lane bullies do, thats what Renekton does if he gets ahead, thats what LB does if she gets ahead. If the scaling ADCs are allowed to scale (Vayne, Jinx, Trist) they become very strong, but games often end before they have their items, so the lane bullies are the ones that seem broken. The ones that are a little strong are being toned down, but this is fine. The meta was not adcs carrying last season, this is why Sivir has a Victorious Skin, because they were basically damage supports, now they carry if they get ahead, its not any different than any other lane.

Comment #3
So I just played my first game of the new preseason and it was interesting.I played a quick blind pick game as Warwick jungle. I quickly realized you cannot 1v1 the rift herald even as Warwick, not even at level 9.
The thing does crazy damage it felt like at least.So I faced 4 ADCs on the other team, Im guessing people really wanted to try out the new items and Marksmen update, I just hope thats not a regular thing I run into. Someone on my team had Graves and was wrecking, really cant wait to try him out. Seems really good right now.Almost everyone was running Warlords Bloodlust as their keystone, but 3 people didnt even have a keystone set. I think thats about it. I went 11/6/2 and we lost at 30 minutes.

Comment #4
Main support here. So its kind of hard too say when you want to take Ignite or exhaust depending on your match-up, so lets create an example. You said who you play, so lets create a random I dont know, Bard, Vayne. Dont know why that came to mind anyway. Depending on the other match-up youll want to decide quickly, as Bard Vayne you have a lot of potential to rip someone in lane very quickly, I tend to see Bards take Ignite, if Im playing Bard I normally take ignite too, thus doesnt answer your question however, so Ill take it into a more easier break down, you should take exhaust if you dont want your ADC to get ripped apart by an all in team (this is where supports make mistakes) Your ADC should be your main form of damage so if your ADC is squishy (Vayne, Ez I might say put Graves in there with changes) Youll want to take exhaust I think I missed some but you should be seeing the point, if your ADC is something like; Cait, Lucian, Draven, Trist you probably want to take Ignite just because of the power you have in lane but it also depends on the ADC and there support if you see something like Braum, Lucian you instantly want to take Exhaust as theyre very bursty and can destroy your lane again as Annie Kalista, this probably makes no sense but if you have anymore questions you can respond and Ill answer them the best I can :P I kinda hoped this helps although it probably didnt make much sense .

I could go on, unless you're a League Of Legends player or work at Riot games this might not make any sense to you.

CONCLUSION:
If you got this far, well done! It can seem a little bit boring if you don't know anything about gaming or even League of Legends. In summary this is what we did:

Collect 5 million plus comments.
Look for peaks in comments (filter based on those dates).
Looked at the comment count based on what champion was mentioned.
Filter out the noisy champions, identify Champion "Graves" as a potential subject.
Look at what other entities that people are talking about as well as graves. Started to discover that people were interested in Jungle and Lane.
Filter again based on Graves, Jungle and Lane. Keyword match and discover that people are talk about "damage"
Look at the raw data!

You might be asking yourself "I can look at forums, Reddit or twitter on my own and figure this out!". Well what you're not taking into consideration as humans we are attracted to emotional statements, for example something like: "OMG MY PING ON LEAGUE OF LEGENDS SUCKS", because it's all in caps and contains strong sentiment.. we fall into the trap of thinking this is important, but there is a chance this player lives in a remote village and on dial-up. Unless you can aggregate these themes it's hard to trust them or yourself.

It's also an extremely easy way for people working on a specific problem to access a specific entity and drill down on key words to help them design and build better features.

This analysis would have been really fun to marry together with transactional data, imagine having access to information like Champion churn or Effectiveness from game-play (one can only wish!).

PS - If you're interested in the tech used: Java, Python, SQL, Postgres were the main ones.

0 Comments

MetaBase Data Visualisation Review - It's Open Source Yall

4/24/2016

5 Comments

INTRO:
I'm going on a little bit of an open source visualisation review stint. Last week when I reviewed Zeppelin I came across metabase. When I looked at the UI and visualisation screenshots I was a little blown away on how gorgeous it looked, usually with open-source anything you're faced with engineer lead designs and to be fair.. engineers aren't generally good at making things look good (gives engineers a hug).

WHAT IS METABASE?
The one liner from the github account says: "Metabase is the easy, open source way for everyone in your company to ask questions and learn from data."

After getting under the hood I have found that it's a great replacement for tool suites like Tableau Server or Microstategy. The whole focus is about enabling a data centric organisation, you can tell the contributors have focused deeply on enabling the average user with the ability to deep dive into data without even being close to a SQL power user (easier than excel!).

THE INSTALL:
My setup is the usual 14.04 ubuntu VM, I also have hosted postgres, mysql and neo4j. So if you're following these steps and you're using a different OS flavour you will need to alter the commands.

They say it's a 5 minute install, so I'm hoping this section is pretty small!

Download the package (always check for new versions)

wget http://downloads.metabase.com/v0.16.1/metabase.jar

Run it
You can tell from the command that metabase is Java, so you're gonna need to have that installed. I'm currently running "Java(TM) SE Runtime Environment (build 1.8.0_51-b16)". You can see your version of Java by running "java -version" on the command line. Ok let's crank it up:

java -jar metabase.jar

I think that's the first time I have started something open source without a failure on the first attempt. You get a terminal full of messages like this one:

LETS GO EXPLORING!
By default metabase starts on port 3000, I navigated to "http://192.168.1.21:3000" (obviously change the IP address to whatever your host is on):

After I select "Let's get started", it's already asking my for account and login creation information. This makes me very hopeful that there is good integrated account management.

After the login information is complete, it's asking me to setup data sources. You can see it supports a good amount of standard data sources (sql server, postgres etc) and big data tech (redshift, mongo, big query) which is awesome. I'm going to go ahead and connect up my postgres instance.

After I setup my postgres db, it's giving me the option of allowing metabase to collect information on tool usage, you can opt out if you wish.

I'm presented with some nice messages helping me on my MetaBase journey (again.. kudos to whoever is doing the UI/UX)

MetaBase has a sample dataset to get started, so if you don't have a DB to connect to you can still evaluate MetaBase. When you first login to Metabase it will take you on your first query tutorial.

MetaBase uses the terminology of "Questions" which I assume they are attempting to take away technical terms like "Query" to open to the door for non technical people to feel comfortable with exploring datasets.

I'm going to have a go at running a query against my stock database on postgres.

There's nothing fancy going on with this, it's as simple as looking at the raw data. Being a SQL guy for 15 years I have to get my head around the simplified terms to query the data (But I can see that someone without experience would pick it up quick).

If you're an excel power user, you can download the data directly from this page by selecting the download icon towards the upper right of the above screen shot.

Once I selected the download option, it warns me that the dataset is large and the max I can retrieve is 1 million rows (which will probably crash excel anyway).

Above I'm doing a simple average aggregation of stock prices. I was super impressed on how easy it was to group by time by providing you with the options of hour, day, week etc, much faster than writing it in SQL.

You can save each "Question" providing a useful Description.

What Visualisations Are Available?:
Pie, Line, Bar, Data Charts, Bar, Area and World US Maps

Dashboards
I saved my visualisation above and created a dashboard. To add a visualisation to a dashboard simply hit the (+) button as seen on the image below:

Once a visualisation is selected, you're presented with a grid like system for positioning and aesthetics. This is great to see as some graphs require more audience focus than others.

Above you can see a really quick dashboard I setup, it was completely intuitive and took me about 5 minutes to do.

PULSES
Another cool feature; Pulses allow you to send data on a schedule either via email or everyones favourite messaging app "slack"

MetaBase provides you with a historical view of changes or activity in your organisation on the splash screen.

What About Power Users?
For people who are SQL experts and want to write some killer SQL Scripts: In the new 'Question" option select "Open Editor" and go crazy!

What I loved:

Almost everything, seriously this is cool
The easy setup.
The thought of enabling otherwise complex queries for everyday business users, it seems that this is their main goal and I'm a total advocate of that.
It looks dead sexy.

What I would like to see more of:

Some more graph options, plot charts, regression etc.
Locking down of datasets, example: groups to manage read / write access.
LDAP and or Active Directory integration for accounts.

CONCLUSION
This is the best open source visualisation suite I have laid my eyes on. It's extremely well thought out, it's free and is an awesome alternative to paid tools suites like Tableau or Microstrategy. This should be on every startups (people with a budget) list of technology to implement internally.

5 Comments

Apache Zeppelin

4/12/2016

1 Comment

I haven't done a review on a big data toolset for a few months, instead I have been boring you with traffic and stock data. Well fear no more! Today I review Apache Zeppelin.

It's good to note that the project is under incubation with Apache right now, which means that it's going to be a little rough around the edges. The project looks very healthy with 110 contributors on github which gives me faith that it's here for the long term.

WHAT IS ZEPPELIN?
Do you work in data analysis? Do you have multiple complex simple and big data environments? Do you work in a team? Do you like Sharing? This is a great platform for analysts to share, document and report data.

It supports a plethora of systems and languages: Python, R, Spark, Hadoop, Postgres, ElasticSearch, JDBC, Flat Files, Ignite, Flink, Cassandra, Hbase (and I'm sure many more to come).

In their words :
A web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

HOW TO INSTALL IT:
I'm installing Zeppelin on an Ubuntu 14.04 VM host, so you may need to adjust any code snippets based on your operating system of choice.

Download the package, unpack and remove tgz file.
At the time of this post, the latest stable version is 0.5.6. You might want to check to see if there's a newer version if you're following my steps.

wget http://www-us.apache.org/dist/incubator/zeppelin/0.5.6-incubating/zeppelin-0.5.6-incubating.tgz
tar -zxvf zeppelin-0.5.6-incubating.tgz
rm zeppelin-0.5.6-incubating.tgz

There's a bunch of dependencies, make sure you have the following:

Maven (not maven2) - Minimum Version 3.1.x
Git - "sudo apt-get install git"
NPM (node package manager) - "sudo apt-get install npm"
libfontconfig - "sudo apt-get install libfontconfig"
Java - They say Java 7 but I used Java 8 no problems.

For me I had an old version of Maven, I had to uninstall the old version and run the following to get it all up to date:

wget http://www.eu.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
sudo tar -zxf apache-maven-3.3.3-bin.tar.gz -C /usr/local/
sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/bin/mvn

Configure Maven to get some more memory:

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=1024m"

Build the package, this takes some time! (make sure you're in the zeppelin dir)

mvn clean install -DskipTests

Start it up!

bin/zeppelin-daemon.sh start

EXPLORING ZEPPELIN
In your browser, go to "192.168.1.21:8080", replace my own IP with your host IP.

So where to from here? I have a couple of databases (postgres and mysql) setup on the same VM so we will now connect the postgres database.

In your browser, select the "Interpreter" tab and click the create button. (An interpreter is like setting up a new connection to a new environment")

Make sure you select the correct "interpreter" in the drop down, I selected psql for postgres. Add in the usual connection details. You will see that my JDBC URL is localhost, as it's installed on the same VM. One thing I'm not too keen on is the fact is once you save the "interpreter" it's happy to show your password to the world.

One thing which took me awhile to get the connection working with Postgres was the connection URL, if you don't specify the database you're gonna have a bad time. My connection URL ending up being "jdbc:postgresql://192.168.1.21:5432/needles"

RUNNING OUR FIRST QUERY:

You need to specify the interpreter in the first line in my example I used %psql then my query.

GRAPHS
Below I ran a simple aggregate to try out the graphing options.

The first result I got didn't show this! Both axis were showing numerical values, and you can see from my query that I am grouping by a date. To fix this I selected the "Settings" tab, re-arranged the values and keys (similar to excel pivot charts) to reflect what I wanted. Displayed below are more graphing options using the same query:

It's important to note that each notebook at the time of the blogpost can only have one interpreter, it's a known limitation in Zeppelin.

OTHER COOL THINGS

You can change each workbook to display either "default, simple or report". Report will display only the graphs and none of the setup (for our executive friends). Default and simple seem to be the same?
You can "schedule" workbooks, this will refresh graphs throughout the day.
You can link graphs and put them on other sites (think confluence etc).
It has SSL if you want it.
There now seems to be a notion of workbook locking and user permissions, I didn't have time to set this up and with my brief search on the net I couldn't find anything to help me out. Chances are it uses local accounts and you have to configure this: zeppelin.ssl.client.auth.
Version control!

THINGS I WOULD LOVE

A contributor to focus on the UI, provide colour options and nicer looking graphs.
The ability to add "Notes", this tool is a data discovery tool! It's important for analysts to know the data journey without guessing via queries alone.
Documentation on setting up user accounts and groups.
Write some better logic to interpret the difference between keys and values.

CONCLUSION!
This is cool! $0 cost, connects to almost everything and it's only going to get better. Analytical teams need shared platforms that enable story telling and data discovery.

1 Comment

Social Media Data - The kind of new frontier.

MetaBase Data Visualisation Review - It's Open Source Yall

Apache Zeppelin

Author

Archives

Categories