Ed's nerdy data rants

Social Media Data - The kind of new frontier.

4/28/2016

INTRO:
Riot games is responsible for making one of the worlds most popular games, "League Of Legends".

I would bet my bottom dollar that the team at Riot have a fairly large data warehouse, probably something like Hadoop and/or a flavour of MPP database. What's interesting about "transactional" data warehousing is it can monitor when an event starts or stops happening, but it struggles to tell you why.

Within the last 10 years most games have an online component whether you notice it or not. With the advancement of connected consoles and PC's, companies are able to fine tune their games years after the initial release date using updates (referred to as patches).

In League of Legends you can play as a certain "Champion" . Champions have different attributes, some may be better at defence while others are better at offence. Reading through the patch notes supplied by Riot they are continually tweaking each champion.

Although I don't have the transactional data from League of Legends, I do have a good slice of social media data for late 2015, all in all about 5,000,000 comments.

I'm going to show why tracking your customers feedback is almost as important as tracking transactional event based data.

THE SETUP:

I have a bunch of software we have written that collects social data in real-time with my company needles.io .
We are going to focus on two updates (patches): 5.23 and 5.22.
We have a machine learning frame work that we have trained to understand sentiment and other things (this also works in real-time).
Do some analysis.

THE DEEP-ISH DIVE:

Note: A Champion is character you can play as in League Of Legends

1.0 This graph below is showing the count of comments that mention any "Champion" for the month of November.

1.1 Next we look at specific group of Champions that were tweaked in the patches. The graph below shows what percentage of all Champion comments these dictated over the month of November.

Two date ranges we will look into with noticeable peaks: (10-Nov to 15-Nov) and (21-Nov to 24-Nov), which strangely enough are around the release date of our patches!

1.2 - Here we look at the count of Champion comments between the dates (10-Nov to 15-Nov) and (21-Nov to 24-Nov). Orange means the champion was updated in the patch and Purple means it wasn't.

Patch 5.22

Patch 5.23

You might have noticed that I have annotated a couple of Champions in each of the graphs. I'm going to take a closer look at these.

Some champions dominate comments threads more than others by being more popular or more controversial, we should take this into account when digging in deeper.

1.3 - We attempt to weed out the dominating Champions from our selected list

Percentage Of Comments Grouped by Champions - All Time

Percentage of Comments Grouped by Champions - (10-15 Nov)

Percentage Comments Grouped by Champions - (21-24 Nov)

Using the an example of the Champion Bard, you can summarize the above three graphs:

Over all time Bard gets an average comment percentage by Champion of 1.21%
Between 10-15 November Bard gets an average comment percentage by Champion of 1.8%
Between 21-24 November Bard gets an average comment percentage by Champion of 2.0%

The most prominent mover in % gain is the Champion "Graves", I think Graves warrants some special attention (especially during the Patch 5.22 period).

1.4 - Count of Champion "Graves" Comments Over Time

All the tedious work above of filtering things out has paid off, we can see two very obvious spikes for Graves one of which covers patch 5.22.

Next we will use an algorithm that detects noun phrases in comments for the period 10th to 15th of November and determine which appear most frequently.

1.5 Noun Phrase Frequency For Champion Graves between 10-15 November.

Looking at the data we can see that there are some obvious hot spots in noun frequency, particularly "jungle"

Next we will have a look at comments between the same time period that also talk about other Champions.

1.6 Count of Comments Mentioning Graves Between 10-16 November that also talk about other Champions.

Now we will also cross relate any comments that mention graves for the same period and mention any other entity in the League of Legends Universe

1.7 Count of Comments Mentioning Graves Between 10-15 November that also talk about other League Of Legends Entities

This visualisation shows:

ItemTypes :Jungle and Lane show strong comment co-relations.
Champions: Lucian, Vayne and Yasuo show strong comment co-relations.

I'm still on the fence regarding sentiment analysis and its accuracy, you can find plenty of articles covering this topic online. For the sake of doing it, I decided to run a sentiment algorithm on all comments that mention Graves and also other League of Legends entities.

1.8 - Sentiment Analysis Of Comments Mentioning Graves by All League Of Legend Entities.

This graph is fairly large and hard to consume, lets split it out by the top Positive and Negative counts

Negative

Positive

We are starting to tell a story without even looking at the comments (yet), below are a few examples:

People who talk about Graves and the item types Lane and Jungle high forms of negativity.
People who talk about Graves and the Champion Caitlyn seem to be more positive.

(there are some entities I have omitted due to their generic names, eg: "magic")

Next we will look at the key words people are saying when they talk about Graves, Jungle and Lane

1.9 Count of Key Words In Comments

Graves and Jungle

Graves and Lane

There are two clear themes between both Lane and Jungle - people are talking about damage.
Lets look at some of the raw comments that we filtered down to people who talk about "Graves", "Lane" or Jungle" with the key word of Damage (there are a lot, so I will track down only a few key ones that resemble the story of the rest)

Comment #1:
His Q does a ton of damage and has a pretty flexible mana cost. if nothing else, his Q is too strong already in terms of waveclear potential. on top of that, a lot of powerful graves play is coming in the solo lanes and jungle

Comment #2:
They were weak because the meta was very heavily against them. TP was very strong so you often saw the two solo lanes run it and its low cd allowed them to go bot just about any time they pleased.
The adcs that are meta rn, Graves, MF, Luc, Cait were lane bullies that werent viable due to how much they would get punished for pushing. Now if they get ahead, they snowball the game and become an issue, thats what lane bullies do, thats what Renekton does if he gets ahead, thats what LB does if she gets ahead. If the scaling ADCs are allowed to scale (Vayne, Jinx, Trist) they become very strong, but games often end before they have their items, so the lane bullies are the ones that seem broken. The ones that are a little strong are being toned down, but this is fine. The meta was not adcs carrying last season, this is why Sivir has a Victorious Skin, because they were basically damage supports, now they carry if they get ahead, its not any different than any other lane.

Comment #3
So I just played my first game of the new preseason and it was interesting.I played a quick blind pick game as Warwick jungle. I quickly realized you cannot 1v1 the rift herald even as Warwick, not even at level 9.
The thing does crazy damage it felt like at least.So I faced 4 ADCs on the other team, Im guessing people really wanted to try out the new items and Marksmen update, I just hope thats not a regular thing I run into. Someone on my team had Graves and was wrecking, really cant wait to try him out. Seems really good right now.Almost everyone was running Warlords Bloodlust as their keystone, but 3 people didnt even have a keystone set. I think thats about it. I went 11/6/2 and we lost at 30 minutes.

Comment #4
Main support here. So its kind of hard too say when you want to take Ignite or exhaust depending on your match-up, so lets create an example. You said who you play, so lets create a random I dont know, Bard, Vayne. Dont know why that came to mind anyway. Depending on the other match-up youll want to decide quickly, as Bard Vayne you have a lot of potential to rip someone in lane very quickly, I tend to see Bards take Ignite, if Im playing Bard I normally take ignite too, thus doesnt answer your question however, so Ill take it into a more easier break down, you should take exhaust if you dont want your ADC to get ripped apart by an all in team (this is where supports make mistakes) Your ADC should be your main form of damage so if your ADC is squishy (Vayne, Ez I might say put Graves in there with changes) Youll want to take exhaust I think I missed some but you should be seeing the point, if your ADC is something like; Cait, Lucian, Draven, Trist you probably want to take Ignite just because of the power you have in lane but it also depends on the ADC and there support if you see something like Braum, Lucian you instantly want to take Exhaust as theyre very bursty and can destroy your lane again as Annie Kalista, this probably makes no sense but if you have anymore questions you can respond and Ill answer them the best I can :P I kinda hoped this helps although it probably didnt make much sense .

I could go on, unless you're a League Of Legends player or work at Riot games this might not make any sense to you.

CONCLUSION:
If you got this far, well done! It can seem a little bit boring if you don't know anything about gaming or even League of Legends. In summary this is what we did:

Collect 5 million plus comments.
Look for peaks in comments (filter based on those dates).
Looked at the comment count based on what champion was mentioned.
Filter out the noisy champions, identify Champion "Graves" as a potential subject.
Look at what other entities that people are talking about as well as graves. Started to discover that people were interested in Jungle and Lane.
Filter again based on Graves, Jungle and Lane. Keyword match and discover that people are talk about "damage"
Look at the raw data!

You might be asking yourself "I can look at forums, Reddit or twitter on my own and figure this out!". Well what you're not taking into consideration as humans we are attracted to emotional statements, for example something like: "OMG MY PING ON LEAGUE OF LEGENDS SUCKS", because it's all in caps and contains strong sentiment.. we fall into the trap of thinking this is important, but there is a chance this player lives in a remote village and on dial-up. Unless you can aggregate these themes it's hard to trust them or yourself.

It's also an extremely easy way for people working on a specific problem to access a specific entity and drill down on key words to help them design and build better features.

This analysis would have been really fun to marry together with transactional data, imagine having access to information like Champion churn or Effectiveness from game-play (one can only wish!).

PS - If you're interested in the tech used: Java, Python, SQL, Postgres were the main ones.

0 Comments

MetaBase Data Visualisation Review - It's Open Source Yall

4/24/2016

5 Comments

INTRO:
I'm going on a little bit of an open source visualisation review stint. Last week when I reviewed Zeppelin I came across metabase. When I looked at the UI and visualisation screenshots I was a little blown away on how gorgeous it looked, usually with open-source anything you're faced with engineer lead designs and to be fair.. engineers aren't generally good at making things look good (gives engineers a hug).

WHAT IS METABASE?
The one liner from the github account says: "Metabase is the easy, open source way for everyone in your company to ask questions and learn from data."

After getting under the hood I have found that it's a great replacement for tool suites like Tableau Server or Microstategy. The whole focus is about enabling a data centric organisation, you can tell the contributors have focused deeply on enabling the average user with the ability to deep dive into data without even being close to a SQL power user (easier than excel!).

THE INSTALL:
My setup is the usual 14.04 ubuntu VM, I also have hosted postgres, mysql and neo4j. So if you're following these steps and you're using a different OS flavour you will need to alter the commands.

They say it's a 5 minute install, so I'm hoping this section is pretty small!

Download the package (always check for new versions)

wget http://downloads.metabase.com/v0.16.1/metabase.jar

Run it
You can tell from the command that metabase is Java, so you're gonna need to have that installed. I'm currently running "Java(TM) SE Runtime Environment (build 1.8.0_51-b16)". You can see your version of Java by running "java -version" on the command line. Ok let's crank it up:

java -jar metabase.jar

I think that's the first time I have started something open source without a failure on the first attempt. You get a terminal full of messages like this one:

LETS GO EXPLORING!
By default metabase starts on port 3000, I navigated to "http://192.168.1.21:3000" (obviously change the IP address to whatever your host is on):

After I select "Let's get started", it's already asking my for account and login creation information. This makes me very hopeful that there is good integrated account management.

After the login information is complete, it's asking me to setup data sources. You can see it supports a good amount of standard data sources (sql server, postgres etc) and big data tech (redshift, mongo, big query) which is awesome. I'm going to go ahead and connect up my postgres instance.

After I setup my postgres db, it's giving me the option of allowing metabase to collect information on tool usage, you can opt out if you wish.

I'm presented with some nice messages helping me on my MetaBase journey (again.. kudos to whoever is doing the UI/UX)

MetaBase has a sample dataset to get started, so if you don't have a DB to connect to you can still evaluate MetaBase. When you first login to Metabase it will take you on your first query tutorial.

MetaBase uses the terminology of "Questions" which I assume they are attempting to take away technical terms like "Query" to open to the door for non technical people to feel comfortable with exploring datasets.

I'm going to have a go at running a query against my stock database on postgres.

There's nothing fancy going on with this, it's as simple as looking at the raw data. Being a SQL guy for 15 years I have to get my head around the simplified terms to query the data (But I can see that someone without experience would pick it up quick).

If you're an excel power user, you can download the data directly from this page by selecting the download icon towards the upper right of the above screen shot.

Once I selected the download option, it warns me that the dataset is large and the max I can retrieve is 1 million rows (which will probably crash excel anyway).

Above I'm doing a simple average aggregation of stock prices. I was super impressed on how easy it was to group by time by providing you with the options of hour, day, week etc, much faster than writing it in SQL.

You can save each "Question" providing a useful Description.

What Visualisations Are Available?:
Pie, Line, Bar, Data Charts, Bar, Area and World US Maps

Dashboards
I saved my visualisation above and created a dashboard. To add a visualisation to a dashboard simply hit the (+) button as seen on the image below:

Once a visualisation is selected, you're presented with a grid like system for positioning and aesthetics. This is great to see as some graphs require more audience focus than others.

Above you can see a really quick dashboard I setup, it was completely intuitive and took me about 5 minutes to do.

PULSES
Another cool feature; Pulses allow you to send data on a schedule either via email or everyones favourite messaging app "slack"

MetaBase provides you with a historical view of changes or activity in your organisation on the splash screen.

What About Power Users?
For people who are SQL experts and want to write some killer SQL Scripts: In the new 'Question" option select "Open Editor" and go crazy!

What I loved:

Almost everything, seriously this is cool
The easy setup.
The thought of enabling otherwise complex queries for everyday business users, it seems that this is their main goal and I'm a total advocate of that.
It looks dead sexy.

What I would like to see more of:

Some more graph options, plot charts, regression etc.
Locking down of datasets, example: groups to manage read / write access.
LDAP and or Active Directory integration for accounts.

CONCLUSION
This is the best open source visualisation suite I have laid my eyes on. It's extremely well thought out, it's free and is an awesome alternative to paid tools suites like Tableau or Microstrategy. This should be on every startups (people with a budget) list of technology to implement internally.

5 Comments

Apache Zeppelin

4/12/2016

1 Comment

I haven't done a review on a big data toolset for a few months, instead I have been boring you with traffic and stock data. Well fear no more! Today I review Apache Zeppelin.

It's good to note that the project is under incubation with Apache right now, which means that it's going to be a little rough around the edges. The project looks very healthy with 110 contributors on github which gives me faith that it's here for the long term.

WHAT IS ZEPPELIN?
Do you work in data analysis? Do you have multiple complex simple and big data environments? Do you work in a team? Do you like Sharing? This is a great platform for analysts to share, document and report data.

It supports a plethora of systems and languages: Python, R, Spark, Hadoop, Postgres, ElasticSearch, JDBC, Flat Files, Ignite, Flink, Cassandra, Hbase (and I'm sure many more to come).

In their words :
A web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

HOW TO INSTALL IT:
I'm installing Zeppelin on an Ubuntu 14.04 VM host, so you may need to adjust any code snippets based on your operating system of choice.

Download the package, unpack and remove tgz file.
At the time of this post, the latest stable version is 0.5.6. You might want to check to see if there's a newer version if you're following my steps.

wget http://www-us.apache.org/dist/incubator/zeppelin/0.5.6-incubating/zeppelin-0.5.6-incubating.tgz
tar -zxvf zeppelin-0.5.6-incubating.tgz
rm zeppelin-0.5.6-incubating.tgz

There's a bunch of dependencies, make sure you have the following:

Maven (not maven2) - Minimum Version 3.1.x
Git - "sudo apt-get install git"
NPM (node package manager) - "sudo apt-get install npm"
libfontconfig - "sudo apt-get install libfontconfig"
Java - They say Java 7 but I used Java 8 no problems.

For me I had an old version of Maven, I had to uninstall the old version and run the following to get it all up to date:

wget http://www.eu.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
sudo tar -zxf apache-maven-3.3.3-bin.tar.gz -C /usr/local/
sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/bin/mvn

Configure Maven to get some more memory:

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=1024m"

Build the package, this takes some time! (make sure you're in the zeppelin dir)

mvn clean install -DskipTests

Start it up!

bin/zeppelin-daemon.sh start

EXPLORING ZEPPELIN
In your browser, go to "192.168.1.21:8080", replace my own IP with your host IP.

So where to from here? I have a couple of databases (postgres and mysql) setup on the same VM so we will now connect the postgres database.

In your browser, select the "Interpreter" tab and click the create button. (An interpreter is like setting up a new connection to a new environment")

Make sure you select the correct "interpreter" in the drop down, I selected psql for postgres. Add in the usual connection details. You will see that my JDBC URL is localhost, as it's installed on the same VM. One thing I'm not too keen on is the fact is once you save the "interpreter" it's happy to show your password to the world.

One thing which took me awhile to get the connection working with Postgres was the connection URL, if you don't specify the database you're gonna have a bad time. My connection URL ending up being "jdbc:postgresql://192.168.1.21:5432/needles"

RUNNING OUR FIRST QUERY:

You need to specify the interpreter in the first line in my example I used %psql then my query.

GRAPHS
Below I ran a simple aggregate to try out the graphing options.

The first result I got didn't show this! Both axis were showing numerical values, and you can see from my query that I am grouping by a date. To fix this I selected the "Settings" tab, re-arranged the values and keys (similar to excel pivot charts) to reflect what I wanted. Displayed below are more graphing options using the same query:

It's important to note that each notebook at the time of the blogpost can only have one interpreter, it's a known limitation in Zeppelin.

OTHER COOL THINGS

You can change each workbook to display either "default, simple or report". Report will display only the graphs and none of the setup (for our executive friends). Default and simple seem to be the same?
You can "schedule" workbooks, this will refresh graphs throughout the day.
You can link graphs and put them on other sites (think confluence etc).
It has SSL if you want it.
There now seems to be a notion of workbook locking and user permissions, I didn't have time to set this up and with my brief search on the net I couldn't find anything to help me out. Chances are it uses local accounts and you have to configure this: zeppelin.ssl.client.auth.
Version control!

THINGS I WOULD LOVE

A contributor to focus on the UI, provide colour options and nicer looking graphs.
The ability to add "Notes", this tool is a data discovery tool! It's important for analysts to know the data journey without guessing via queries alone.
Documentation on setting up user accounts and groups.
Write some better logic to interpret the difference between keys and values.

CONCLUSION!
This is cool! $0 cost, connects to almost everything and it's only going to get better. Analytical teams need shared platforms that enable story telling and data discovery.

1 Comment

Analyzing The Stock Analyst (Jim Cramer)

3/30/2016

0 Comments

INTRO:

I have dabbled twice in the stock market with very little success.

One take away I got from my experience was that the stock market is not for the faint hearted, much like a day at the race track.. I figured that the stock market was gambling for rich people. I went in very naive and bet on companies I liked, no more analysis than that. My biggest win was a 10% increase in 24 hours and my biggest loss was 20% in 24 hours.

There are loads of stock market celebrities out there, while living in Canada I frequently came across a guy on TV called "Jim Cramer", his show is called "Mad Money".

On his show Jim makes the stock-market kinda fun, he bounces around the studio with what seems like a supply of endless comedic props, whilst people call him up for advice on what to buy and sell.

So why am I talking about Jim? Well.. since I'm terrible of the stock market but pretty good with data, why not actually look at how successful a real stock market expert is?

THE SETUP:
They say most data analysis is 90% collection and preparation, that was 100% percent true in this case, but here's what I did:

Took a slice of Jim Cramer buys and sells from his website for 2011 and added them to a spreadsheet. (I covered a 3 month period roughly of which I gathered around 120 stock recommendations).
Got all historical data from Yahoo for every Jim recommendation covering 2011-2016. (you can grab a copy here if you want to play around yourself.. I did all the hardwork).
Put it all in a database.
Did some analysis .

Note: I'm no financial expert, I'm merely seeing if Jim has the chops! (is that a NZ saying?)

THE TLDR:
Summarized up, this is what I found from a high level:

Jim Cramer is smarter at the stock-market than I am.
Diversification rules.
Don't be greedy looking for the "Apple 1998" stock win.
If you buy based on a Jim Cramer recommendation, YOU need to keep up to date on that company to know when to sell (otherwise you're gonna have a bad time).

THE DEEPER DIVE:
It's very useful to note that some of these analyses are based on "The stupid", "The Realistic" and "The Perfect" stock-market scenarios, please..please don't think that you as an everyday person can achieve "The Perfect" based on what I'm saying (otherwise I wouldn't write this blog and keep all the information to myself!)

Jims Recommended Stock Value Over Time

In the above scenario I have fictitiously purchased $1,000 worth of stock from each Bull (means buy) recommendation. Our highest point of earnings is the 1st of July, 2014 where our portfolio has gained ~35k in ~3 years. You might also see some points that dip to the zero line, this is where yahoo data still reports stock prices on American holidays but zeros them all out.

The Perfect

Buying $1000 worth of each recommended buy (bull), then selling them at the highest point price between 2011 and 2016. (again.. I have to point out that no one on this planet can do this). We have earned ourselves a handsome $79,138 with 2014 and 2015 showing the most gains, that's earning almost 80% over 5 years!

The Realistic Sell at 10% gain (or max sell if 10% isn't met)

This is probably the closest to real home investors, once we reach a max gain of 10% we sell (or max sell if 10% isn't met). We made a profit of $8,210. You might be thinking that that's not much compared to the ~80% perfect sell, but the important thing to note here is we made $8,210 in around 1-3 months.

Stock Industries Over Time

This one shows us the gains made via industry. Now before we get too far ahead of ourselves Jim Cramer may have suggested more of a particular industry which shows far more swing in a particular area. We'll clarify that below. (Remember this is the same as the very first graph, it assumes we never sell a share).

Average Percent Gain By Industry

We are getting closer to what the numbers reflect here, be aware that those two heavy hitters (Entertainment and Travel) only contain 1 company in those sectors, Cramer did a great pick on them but it's not really fair to include in a "by Industry" analysis.

Average Earned by Count Of Companies in Industry

The above scenario is based on "The Perfect" sell. This graph sheds more light on what I was previously talking about above, Travel and Entertainment showing massive earning increases, but there is only 1 company in each of those categories. Jim Cramer has numerous sections in his show where he talks about diversification, violent percent swings in either direction are mainly based around the fact we have a smaller amount of companies in those industries.

Segment Specials

This graph shows the average sell price for a share over time split by "Specials". A recommendation with a special of 1 means it was a special and 0 was on a lightening round.

A segment "Special" is one that Cramer may interview a CEO or just plain old informs us what he feels is a buy and wants to let us in on it.

It's interesting to note that Jims "Specials" are really propping up the figures with non-special recommendations falling flat and lower in 2014.

Compare to the S&P and DOW
Below are the graphs for the S&P and the DOW over the same period.

S&P

DOW

What are the S&P or DOW? To quote google:

S&P
"The Standard & Poor's 500, often abbreviated as the S&P 500, or just "the S&P", is an American stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE or NASDAQ. The S&P 500 index components and their weightings are determined by S&P Dow Jones Indices."

DOW
The Dow Jones Industrial Average (DJIA) is a price-weighted average of 30 significant stocks traded on the New York Stock Exchange and the Nasdaq. The DJIA was invented by Charles Dow back in 1896.

Why are these relevant? Well if you have been following the stock market since 2009, there has been some massive investment, we want to make sure that what Jim is saying isn't just riding off of the coat-tails of a massive stock climb.

FINISH!
That's it folks! I had fun looking throughout the data, less so collecting and cleaning. I'm kinda tempted to cut more time segments for further analysis, especially since the market is in a little bit of turmoil in 2016, but there is a baby on the way and there is much to do!

0 Comments

IIS, Logstash, Elasticsearch & nxlog

3/10/2016

0 Comments

I'm going to cover how to implement IIS web logs into ElasticSearch via Logstash and nxlog.

There are a lot of ways to do this, but in short we need to implement what's called a "log shipper" on a windows server to fire events through to logstash. I see a lot of people online using an install of logstash on each machine, I try and avoid this as logstash is quite 'heavy' in terms of system resource (I have seen many cases where it uses up to 4gb of memory just to send logs).

I'm going be using the following versions of Elastic stuff for this post:

ElasticSearch 2.2.0
Logstash 2.2.2

Setting up Nxlog
First you need to go ahead and download the executable from nxlogs website. The community edition will be all that you need.

Nxlog installs in "C:\Program Files (x86)\nxlog" by default. There are two folders that you will become familiar with:

C:\Program Files (x86)\nxlog\data - (this is where the application logs to)
C:\Program Files (x86)\nxlog\conf - (the configuration for shipping IIS logs).

In the conf directory, backup the conf file by copy and pasting, rename to something like "nxlog_backup". In the original conf file remove any default log confs and add the following lines for IIS:

<Input iis_1>
Module im_file
File "C:\inetpub\logs\LogFiles\W3SVC1\u_ex*.log"
ReadFromLast False
SavePos False
Exec if $raw_event =~ /^#/ drop();
</Input>
<Output out_iis>
Module om_tcp
Host put.logstash.ip.address.here
Port 3516
OutputType LineBased
</Output>
<Route 1>
Path iis_1 => out_iis
</Route>

Two bits in BOLD need to be altered: "File" needs to be the location of where IIS is throwing logs to, and "Host" needs to be the IP address of your logstash server. We're not quite ready to start nxlog yet, but that's the windows side mainly done (fairly easy huh).

Setup a Config in Logstash
Below is my config for logstash, you will notice the GROK pattern is based on how IIS has configured the logs (we will get onto that soon)

##### LOGSTASH CONFIG #####

input {
tcp {
type => "iis"
port => 3516
host => "your.logstash.host.ip.here"
}
}
filter {
if [type] == "iis" {
grok {
match => {"message" => "%{TIMESTAMP_ISO8601:timeofevent} %{IPORHOST:hostip} %{WORD:method} %{URIPATH:page} %{NOTSPACE:query} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientip} %{NOTSPACE:useragent} %{NUMBER:status} %{NUMBER:response} %{NUMBER:win32status} %{NUMBER:timetaken}"}
}
}
}
output {
elasticsearch {
user => "user"
password => "password"
hosts => ["elastic.node.1","elastic.node.2","elastic.node.3"]
index => "iis"
document_type => "main"
}
}

A few things you're going to need to alter: Change the logstash.host.ip.here with your logstash host ip. You also may notice that I have a user and password in my output section, that's because I have shield enabled on the cluster, if you don't use shield just remove those entries.

Another important thing to note is the GROK pattern I'm using, if you copy and paste my example you might have a bad time due to your configuration of logstash having more or less log attributes.

You can find out your IIS log config by going into your windows server, opening "Internet Information Services" and select the "logging" icon.

Once you have selected the logging option, click the "Select Fields" button, this shows us what IIS has configured for web-server logs.

Alternatively you can see the order and fields included in each log file at the top of each IIS log. Check the screen shot below.

ELASTIC SEARCH INDEX N STUFF
Before you fire up logstash, you will need to have an index to write to, first you need to create the index, in my case I have called it "iis":

curl -XPUT 'http://your.elastic.server.ip:9200/iis/'

I have created a fairly simplistic mapping to go with our index:

curl -XPUT 'http://your.ip.address:9200/iis/_mapping/main' -d '
{
"main": {
"properties": {
"timeofevent": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss"},
"hostip": {"type": "string","index" : "not_analyzed"},
"method": {"type": "string","index" : "not_analyzed"},
"page": {"type": "string","index" : "not_analyzed"},
"query": {"type": "string","index" : "not_analyzed"},
"port": {"type": "integer"},
"username": {"type": "string","index" : "not_analyzed"},
"clientip": {"type": "string","index" : "not_analyzed"},
"useragent": {"type": "string","index" : "not_analyzed"},
"status": {"type": "integer"},
"response": {"type": "integer"},
"win32status": {"type": "integer"},
"timetaken": {"type": "integer"}
}
}
}'

FIRE IT UP!

LOGSTASH
Before you fire up logstash, test your newly created config file by running the following command:
sudo /etc/init.d/logstash configtest

If all passes, you can start up logstash by running:
sudo /etc/init.d/logstash start

Remember, if things start to go awol, you can check the logs for logstash by running the following command:
sudo tail /var/log/logstash/logstash.log

NXLOG
You can startup nxlog through powershell or through the services manager.

See if everything is working by checking the size of our index in Elastic by running:
curl 'your.elastic.ip.address:9200/_cat/indices?v'

That's it folks:
If you're using Kibana you can make a real pretty dashboard to display what's happening in real time on your web-server (here's one below I made in about 20 minutes):

0 Comments

Windows and logstash (quick n dirty)

2/24/2016

0 Comments

Yes, even Windows needs love... I understand that most startups or cool kids use linux, but there are a heck of a lot of businesses that run IIS / C# / SQL Server to get things done. I must admit I have previously setup logstash monitoring on Centos and Ubuntu and it's pretty easy... so easy that I don't need to write about it.

Installing logstash on every server to collect and send logs through to elastic can cause a bunch of unnecessary overhead. You might come across the term "logstash shipper", which effectively sends logs to a centralized logstash server, so... that's what we plan to do, or better yet skip logstash and go straight to Elastic.

There are 5 millions ways to skin a cat (send logs to Elastic), but I'm going to go straight Elastics direct offerings and have a play around with Beats and winlogbeat, not to be confused with the overly expensive headphones by Dr Dre that sound like poop and only 16 year olds and recruiters wear.

#1 DOWNLOAD

https://www.elastic.co/downloads/beats/winlogbeat
Some reference install documentation
https://www.elastic.co/guide/en/beats/winlogbeat/current/index.html

#2 INSTALLING

I'm going to be the first to admit it, I don't know what powershell is (I have been too linuxed for too long).
Google tells me:

Alright, now all the windows fan boys can scoff at my lack of knowledge of powershell (I'm actually MCP certified for Win server 2003, but that was over a decade ago):

Download the current winlogbeat zip from elastic (URL above).
Extract the contents to C:\Program Files\
Rename the folder in C:\Program Files\ to "winlogbeat" removing any version references.
Open powershell as administrator (right click and select "run as administrator").
Navigate in powershell to your new dir " cd 'C:\Program Files\Winlogbeat' "
In powershell run the following ".\install-service-winlogbeat.ps1", you should get some output like below:

EDIT! - If you're having issues in powershell about something like "not digitally signed" run this in powershell "Set-ExecutionPolicy Unrestricted"

#3 CONFIGURATION

So it seems to me that the service is now installed. Before we kick this thing off we need to configure some stuff, eg: like where to send all of the glorious logs. (I'm assuming you already have an elastic instance running some where).

Most of your configuration for your server is located in "C:\Program Files\winlogbeat\winlogbeat.yml".. if you don't know what YAML is, it stands for "yet another markup language" which is slightly confusing as it's more data oriented than markup. Let's move on..

The great thing about the config YAML file is that the developers at Elastic have thankfully made some very clear comments in the file, so it shouldn't be too easy to screw up.

I cut all the comments out of the winlogbeat.yml file so it was easier to make sense of for this blog. Below is a small explanation for each section:

Registry_file: We are using windows, we installed something, a registry file was created. This was pre-populated and when I installed the service, it generated the registry file.
event_logs: What logs to capture from windows, more information to follow below.
output: Where to send the logs (in my case an elastic server on my network).
logging: Where to store the log files and how often we rotate the files so that we don't bloat.

Now to figure out what "event_logs" are available to you, in power shell run "Get-EventLog *" this will return a list of what can be gathered from this server, see below (I'm running this on my win 7 desktop initially):

Pro-tip: Before you save your own config file, make a backup of the old one. If you're not used to YAML, chances are you'll format something wrong. Try www.yamllint.com to validate your formatting.

If you're happy with the config file, save it!

#4 CREATE INDEX AND MAPPING IN ELASTIC
In Elastics documentation they ask me to run the following command in power shell (replacing the localhost section with your hostname:

PS C:\Program Files\Winlogbeat> Invoke-WebRequest -Method Put -InFile winlogbeat.template.json -Uri http://localhost:9200/_template/winlogbeat?pretty

I get the following error:

When I look at the file winlogbeat.template.json it's a standard mapping file. If you want more information to fix this then check out this article: https://matt40k.uk/2014/07/the-term-invoke-webrequest-is-not-recognized/ .

I will just grab the winlogbeat.template.json file and run it a linux terminal (the same way I created all my other mappings) to get passed the errors above. (insert lazyness here).

#5 START WINLOGBEAT SERVICE

In powershell, startup winlogbeat by running: "Start-Service winlogbeat". You can see in my screenshots below that winlogbeat has started and collecting files.

I added a new index-pattern in kibana called "winlogbeat-*" and as you can see below it has started ingesting data.

#6 FINISHING UP
A couple of notes to tidy up.

You might want to double check your service for winlogbeat, and make any changes based on your requirements. (example: setting startup type to "Automatic" or ensuring any failure scenarios are covered in the "recovery tab"
Setting the logging level: If you're pretty scarce on elastic space, you might want to adjust the logging setting in "Winlogbeat.yml" to something like "level: warning" to keep the chatter down.

That's it! We're all done and sending windows event logs through to Elastic. Doing all this enables those kind folks in operations to fix your terrible code and the issues it creates on their infrastructure much faster (because OP's are perfect).

0 Comments

My travel woes

2/24/2016

7 Comments

Over a year ago I moved from Canada back to New Zealand. My wife didn't want to directly live in Wellington due to the weather being sub-par, so we moved up the coast to a small town called Raumati.

Unfortunately for me there isn't a lot of data work in Raumati, so I commute into Wellington 3 days a week to help out my clients. On the weekend you can travel this stretch of road in around 40-50 minutes, in 2015 during peak traffic I was averaging 60-70 minutes, 2016 struck and some days I'm up to 120 minutes.

Now there have a been a lot of road works on my journey that started in 2016 and it's easy to blame the NZ traffic authority, but I have another hypothesis (that isn't ground breaking) but I think it's due to the petrol prices being so low.

The New Zealand government is pretty open with its data, so I started to collect traffic stats on State highway 1 (SH1) and historical petrol prices to see if there is a co-relation between the two.

So to stop my babbling, I will share my findings:

The data you're looking covers 2012 through to 2015 (4 years). NZTA hasn't provided data for 2016 yet. The Y Axis represents the cost of gas & X covers the amount of vehicles. NZTA provided data based on many points on SH1, I used the collection from Pukerua bay as that's pretty close to my place.

Using the same data (2012-2015), you can see a much stronger co-relation between gas prices and number of vehicles on the road (graph below). Since I don't have vehicle data for 2016, I can only make assumptions on the volumes, you can see that we are still quite a few cents down from December 2015.

So what drives people to drop public transport and take the petrol powered horse and carriage? I drive a honda fit to work (don't laugh) and it averages around 14km per litre (which is pretty good). My total costs for driving into work 5 days a week is:

$55 Gas
$50 Parking

Monthly my total costs would be around $200 parking and $220 gas totalling $420 based on 4 weeks a month. If I took the train a monthly pass would cost me $271, so if I took the train I would be saving money right? One thing we haven't taken into consideration is couples who both work or people that car pool. If we had a couple driving into work they would save a total of $122 per month just by driving in.

I will let you come to your on conclusions on the analysis, I feel like my original hypothesis is correct and more people are driving in based on the price of fuel. Personally I value time more than money and losing 2 hours a day isn't worth it which is causing me to get pretty close to getting that train pass.

Side note: It's also valid to point out the December has the highest traffic numbers of any month for every year (christmas and holidays etc). If you're hell bent on driving into work, you might want to pray that gas prices go back up.. otherwise suck it up.

Data Sources:
Train costs
Vehicle Data
Petrol Data

7 Comments

BUILD A THING #5 - NEO4J INGESTION cont..

7/14/2015

1 Comment

8th-9th June
I have been busy creating the entity relationships, you can see a screen shot of the data for Destiny Activities and the associated relationships. You can see already it's not very clean, two entries for the same activity (inherited from the dubious website I ripped it from).

Another way ensure we only have unique entity nodes is to issue a statement in Neo4j like so:
"CREATE CONSTRAINT ON (activity:Activity) ASSERT activity.isbn IS UNIQUE;"

If you try and create the same node, Neo4J will issue you something back like this:
"Neo.ClientError.Schema.ConstraintViolation"

I am also tackling these duplicate entries by writing each entry into a hashset and if it doesn't exist, write it to a JSON Array:

if (!hs.contains(finalNodes.toString())) {
hs.add(finalNodes.toString());
newArray.put(finalNodes);
}

12th June
I finished off the entity JSON create scripts and now I'm working back on what I call the NEO4J utils library and it's going really well!

I built a class with a couple of functions which create nodes and relationships (using the restful APIs), tested it end to end to ensure it all worked, as you can see below it's only two nodes but it's great to see it working:

More progress: Things are starting to get exciting, I just did a test load of all the medal entities, you can see below I haven't added the relationships but in an hour (probably a day) or so I plan to have that running.

Some time later: I made some pretty good progress with the entities and the relationships, after 2 weeks of automating I have the Destiny medals and their relationships all stored in NEO4J. Below you can see a video where I did a "MATCH ALL" type query to see all my hard work in action!

What you're seeing: "Medals" in pink, "Medal Types" in yellow and "Medal Weight" in red. All the lines joining the nodes are relationships

It's fair to say I'm getting pretty amped about this, I need to spend some time mapping out what's next, I'm assuming that I'm going to be stepping into document classification using Reddit posts.

Go to my next post

1 Comment

BUILD A THING #4 - NEO4j INGESTION

7/8/2015

0 Comments

It's been a weird week. I resigned from Vend, I have no bathroom (renovation) and it's pretty cold. Enough about my sob stories!

In my last post in this series I was talking about the API ingestion, thinking about how stupid I was to have multiple calls for different attributes / relationships on a single node. Well, as I have mentioned before this is my first time with Neo4j for some reason I skipped over the very first part of the Java API example which shows you how to use the Cypher language in an API call. So erase most of my last post from your memory.

So, today (8th July) I have been running over in my brain how to create the entities and it's fair to say I have been fairly consumed by the whole thing.

Below is a screen shot of a RAW entity detailing 3 activities.

In a traditional database and if I was lazy, I would insert this into a table called destiny_activities, but because we are using Neo4j all that goes out the window.

If we look at theNeo4J best practices, you can see that the model is extremely normalized. Below is a screen shot of how the Neo guys would datamodel book reviews

So, my preliminary data model based on the activity excel file above might look like:

This blog is a saving grace, as it forces me through an iterative design process. When you're a sole developer working on something generally we just smash code until it works and the design aspect goes some what out the window.

The next question I need to answer is how to ingest the data into Neo4J, as I create relationships to other nodes I will need to create all entity nodes first then the relationships.

My thought at the moment is to create a JSON structure that looks like this:

{ "class": "activity", "name": "A guardian rises..", "desc": "Rise in the light", "relationships": [ { "class": "activity_levels", "name": "activity_level_1" }, { "class": "game_mode", "name": "story" }, { "class": "destination", "name": "old_russia, earth" } ] }

I would infer all of this information in a one off script I would write in Java. Once I'm done I would then write a one off script that will first run through all the entities and create them, then when that is done I would create the relationships.

I'm being a little stubborn and trying not to use theCSV importer and solely use the API's, my reason behind this is I want to get used to these interfaces as when I'm done I want this all to run automatically with no hand holding.

Click here to see my next post

0 Comments

BUILD A THING #3 - NEO4J STUFF

7/6/2015

3 Comments

I'm busy digging into the Neo4J implementation in Java, there seems to be two ways of writing / reading from the database: The first is pretty similar to connecting to any database, the other is using the restful API's.

When you fire up Neo4J you do get a nice and pretty interface/console to use, one thing I have learned is that the web console utilizes the rest API's behind the scene (makes sense).

I created a Neo4J DB externally on a server at home, working with DB's for so long I assumed that I could create a connection client sending along the usual "host/port/pass/user" combo. The setup is a little weird as, if you have physical separation between the application and database it somewhat forces you to use the APIs, which in the long run isn't a bad thing if you think about the way that most modern application development architecture is done these days.

I'm going to have a go at writing information from the "Destiny Entities" I created in my last post through the API's.

(Monday 29th June)
It's been a day or so fiddling with Neo4j and I only just figured out how it all fits together, I was searching high and low with the rest API on how to make a new node with labels in one API call. I was assuming that it would have been a single object containing all the relationships and labels, I had my relational DB hat on and it took me awhile to understand how it worked (duh). This is all a discovery process for me so there are still some things I'm going to get wrong.

So I understand that you create a Node, then apply labels, properties and all that good stuff in different API calls.. my brain was thinking it was all one.. when it is not! You know the old CREATE TABLE statement with PK/FK stuff all hanging off one statement, it's easy to get confused if you're old school. In saying that it would be nice if it create API call and Neo did the decoupling behind the scenes so I didn't have to make network latent rest calls for one node, label, properties creation.

(Thursday / Friday 2 July)
Firstly I manually loaded a couple of Destiny Entity nodes, learned a bunch made a couple of mistakes (see below) but it's still pretty cool getting visual feedback right away.

But overall things are going okey dokey! If you have read this blog series I had previously ripped destiny entity information, and before I start parsing community post information I need to get the structure and relationships between the entities correct so I can map it all out.

On Thursday I spent the time making sense of the entity data and creating the relationships & classes in Java, it probably needs another day or so before it's ready to push 4000+ of these nodes in Neo4j. By the time it's finished the output will be classes with relationship information to other classes all stored in JSON ready to consume in Neo4j.

Hopefully next week I can post a visualisation of the entities all mapped out, keep posted.

Check out my next post

3 Comments

<<Previous

Social Media Data - The kind of new frontier.

MetaBase Data Visualisation Review - It's Open Source Yall

Apache Zeppelin

Analyzing The Stock Analyst (Jim Cramer)

IIS, Logstash, Elasticsearch & nxlog

Windows and logstash (quick n dirty)

My travel woes

BUILD A THING #5 - NEO4J INGESTION cont..

BUILD A THING #4 - NEO4j INGESTION

BUILD A THING #3 - NEO4J STUFF

Author

Archives

Categories