Blog Archives

BUILD A THING - #2 PLANNING

6/30/2015

The more I start looking at this project, the more work if have on my plate (typical developer underestimating the work involved).

I have decided for a proof of concept to stick to Destiny and Reddit as my source of community information, these will both be scalable when I'm done to extend to other games and community sites.

Things I have done so far:

Setup an ubuntu VM
Installed Neo4j
Watched about 5 videos on graph databases.
Written an API fetcher for reddit (rough but working)
Ripped Destiny entity information from "undisclosed" websites.

I didn't mention any thing about "entity information" in my previous post, but the more I learn about graph databases, I realised I need to classify the documents.. What this means is we have to label text efficiently in a way in which we can analyse it later, for example:

If I looked at this reddit post, and split out the comment:

"theric light allows you to upgrade old gear. It drops from nightfall, trials of osiris, and "other endgame activities." Pulls your guns up to 365 and keeps them fully upgraded"

We could build a classifier that creates labels for "theric light", "nightfall" and "trials of osiris". Because I ripped what I call the Entity information (#5) I can infer this relationship pretty easily. We can also use standard text algorithms to measure sentiment, objectivity and subjectivity.

Graph databases always talk about the whiteboard is your data model, it's no longer an ERD showing PK/FK relationships etc. Below is a quick example of a whiteboard based data model from the reddit post.

So you can tell from above, that when I'm processing my reddit comment, and putting it into a graph database I'm going to create the relationships as:

talks about "trials of osiris"
talks about "night fall"
talks about "theric light"
Has low objectiveness
Has high sentiment

All of these relationships are going to have to be determined outside of Neo4j.

As you can see I have a long way to go and by no means have I designed a finished product. I'm attempting to show the way in which I create this monster and start core development after initial analysis and design.

Free-bee - Here is the repo that I ripped for the Destiny entities.

Go to my next post to see how I went.

0 Comments

Horizontal Vs Vertical BIG DATA

6/8/2015

13 Comments

INTRO
Lets cut to the chase, what's the difference between vertical and horizontal big data?

Vertical
A platform for hundreds of thousands of customers to run data analysis on their own service. Generally this service would be a feature to an existing product, for example:

Online Banking: Depending on your bank, you can run weekly, monthly, quarterly and annual reports based on transactional history.
Gaming: You can track stats across the life cycle of a game, Kill / Death Ratio, Score Per Minute etc. Some systems will allow you to cross relate stats to your friends (that's a tricky bit).
Online Tax / Point of sale: Run reports based on sales by outlet / product item / sales clerk etc.

Horizontal
I put this is in the more traditional data warehouse category. A team collects data from a transactional event based system, munges it into a centralized store. A team of analysts / scientists job is to provide insights back to the business to help facilitate data design decisions into their core product (yes I know there are other use cases!).

Not only are businesses using data to steer the ship, but people in their everyday lives want it also. Look at the adoption of stats around fitness for example, people depend on this information to help them make better decisions and help set goals. Companies are being forced to provide insights about all subject matter back to their customers, it's becoming an expectation around most new products.

THE TRICKY BIT
Now I could write an article all about Hadoop and how it has transformed an industry, but I have been working on hadoop for 6+ years and you can read about that practically anywhere. What I'm genuinely interested in is the challenges in Vertical Big Data services. Let me try and draw up a crappy google image for you to help illustrate the challenge:

Horizontal (generalizations):

Support a team of analysts.
Has access to entirety of atomic data for all customers.
Analysts write ad-hoc queries.
Queries take seconds to minutes to return results.
Is generally batch loaded.
Stability of stack generally does not impact the core product, so having a days down time will not cause issues to the end user.

Vertical (generalizations):

Supports hundreds of thousands of users.
Has access to only their domain of data.
Queries are pre-canned, or have limited scope for adjusting.
Is real time-ish.
Stability of stack impacts the core product, as it's a customer facing feature.

HOW NOT TO DO VERTICAL
You're using a product, lets say an auction site. You're a user looking for a new 2008 Honda Fit. Because you're tech savvy and like to save money, this auction site allows you to see average selling price of this particular model since it's inception in 2008 (time series). This allows you to see the models depreciation over time and will give you an idea on the selling price in 5 years when you buy a new Nissan GTR.

The auctions site simply has a bit of sql like this that returns the result:

SELECT avg(price), date
FROM    auctions
WHERE make = 'honda'
AND        model = 'fit'
AND        year = '2008';
GROUP BY date;

Now because the Auction house is "smart", Bill the DBA put some indexes in and it works just fine. When the product is released is all seems to go ok, after a few weeks the DB is slowly getting hammered as people are finding this new feature awesome and are using it to scan entire catalogs of makes and models to search for cars which will return the best return when they sell. What's worse is those lovely executives up on the top floor have found their magic sauce vs their competitors, "giving people reporting". They want more reports and fast, they don't care about stability as that's Bill's job, in Canadian.. they're saying "Git er done".

Bill has three options:

Lazy and expensive - Add 400 new db shards.
Quit - You made an initial bad decision, you feel like you're on the Technical Titanic.
Push back - Put your hand up and say we need a new solution.

Don't mix and match OLTP and OLAP to make some bastardized child "OLTPAP. Generally only use your transactional core product database for singular CRUD activities. If you're designing something that has any level of aggregation, that means your transactional data is working much harder.

Modular separated services always win, if you design a new stack and it goes down due to load, it doesn't bring down the core product, customers are less angry, sure they don't know the history of Honda Fits, but they can still buy one.

A high level of the wrong stack could look like this:

Separating out the customer reporting could look like this:

As illustrated above, we could utilize technologies like Kafka and Event buses (API ingestion) etc which can work in near to real time.

The big questions is, what can go into that pink box to serve 100's of aggregated queries per second? There are a few options which I will run through.

THE OLD TRIED AND TESTED
You have been working with MYSQL for 10 years, you probably wrote your own sharding application, you flip master / slave roles and servers in your sleep. The developers can use their existing ORM to develop out of. The only issue is you're basically replicating your transactional db and it's not very cost efficient when you do the math. I won't go further into this one.

DRUID.IO
This lil guy popped into the scene a few years ago and is used quite widely internally at companies like NetFlix and EBay. In their words Druid is:

Query timeseries data as it is being ingested for both immediate and historical insights. Aggregate, drill-down, and slice-n-dice N-dimensional data with consistent, fast response times.

The way it works illustrated:

It can look a little daunting, and there is a lot moving parts (not as simple as a master/slave topology), but it is powerful in it's own right.

How does it work:
It kind of cheats behind the scenes, once you send through an event it will roll it up based on the schema you provide the event , so three raw events might look like:

customer_id, store_id, total_purchase, date
123                      , 1               , $12, 2015/01/01
123                      , 1               , $15,                           2015/01/01
123                      , 2               , $5,                            2015/01/01

It would roll up it as so
customer_id,   store_id, total_purchase, date
123                       , 1                ,$27         2015/01/01
123                      , 2                ,$5                               2015/01/01

Your dimensions in the druid schema would be: store_id and customer_id, metric would be total_purchase, granularity_spec set on "day"

If your dimensions don't change very often, it has the possibility to be really good at rolling up to a handful of segments types.

Pros

The druid team has written a hadoop connector, this has the ability to overwrite segments that druid has already written. When would you use this? Well distributed systems has a tendency to duplicate events, Hadoop can consolidate and unique-ify (is that a word?) records each night.
It's fast and doesn't need to all live in memory, the historical segments sit in storage like S3 (cost efficient) .

Cons

Lots of moving parts.
The native query interface is fairly in-depth JSON, they do state that there is some simple SQL interface to use (I haven't tested it).
Screw your dimensions / metrics up in the schema, you have lost the raw grain. (best to backup to Hadoop).

One thing I'm unsure of (didn't want to put it as a pro or a con) was could Druid handle 100's of queries a second? There might need to be a cache in front of druid which periodically pulls in fresh data, adding more complexity.

NOSQL
Take your pick! I'm going to be platform agnostic on this one (eg: cassandra, riak, mongo). But actually storing data inside an object per customer could be pliable especially if you have a low number of events per dimension(s). Storing the data but getting your application stack to do the aggregation at this level could be worth it.

Now, this only works if the aggregations you want to provide are silo'd to a particular customer. If we use the example of the Auctions over time (as above), NOSQL could provide a little bit of a mess (aggregation over thousands of objects). But again there are several work around to this, you could run the aggregation in Hadoop which populates new service wide data views).

SPARK STREAMING or STORM (my preferred option).
Storm is the more mature platform, but Spark is being adopted at an alarming rate! Both require you to have your aggregations in memory (don't believe the lies). To facilitate less memory usage, writing out completed aggregation segments to a NOSQL solution like Cassandra can save you in the long run in terms of memory costs. Spark and Storm can run on Hadoop, so Hadoop can leverage the data as well for Horizontal type analysis.

CONCLUSION
Don't use your OLTP for aggregations, not even once. Do a bit of a brain storm on what types of reporting you plan to deliver to your customers now and in 2 years time, this kind of planning will save you all kinds of technical debt in the future.

Is your reporting silo'd to one customers data (eg: financial)? Or are you going to share events from your entire customer base into the reports (eg: auction house). This will dictate the direction you need to go with in terms of picking the right tools.

13 Comments

BUILD A THING - #2 PLANNING

Horizontal Vs Vertical BIG DATA

Author

Archives

Categories