## Posts Tagged ‘**Data Compression**’

## Decoding State

In a previous article covering the concept of using decision trees for Robocode I posed the question:

Finally, how do we best encode the information regarding the state of the game itself? For example, does it make sense to store the distance between the two tanks even though the same information could be determined via enough splits? (I say yes, if just to speed things up) If so, what other attributes would be relevant?

I’m still not certain of the answer, but I’ve got some additional insight to further clarify the problem.

So we know that given a table of the state of the game encoded in any form (ie: x and y positions of both bots, hit points, and size of the battlefield) we could, given enough data points, build out a tree that very well approximates the correct firing angles for every situation. This make encoding something like “distance to target” technically redundant. However, the distance to the target seems to be a very relevant attribute to capture- something clearly not easily captured via the x and y positions of two bots. I’m not sure how it hit me, but that uncapturable essence of distance is unique strictly because it describes the *relationship* between the two variables.

That said, we now run into a problem- there are an infinite number of relationships between any two variables. Distance, obviously, is the relationship defined by (x -a) ** 2 + (y – b) ** 2 = c. But countless other relationships exist- for example any of the lines defined by any polynomials (ie: a * x ** 2 + b * x + c = y) or another example– any of the lines defined by exponential relations (ie: y = a * n ** x + b). These relationships suddenly become impossible to handle and were the main reason I decided to investigate other options when I realized multiple regressions suffer the same fault. (Remember, that what I’ve just described is the relationship between only two dimensions. With each added dimension, the simple relationships grow at a rate of O(n!).)

Fortunately, it may very well be the case that it’s not TOO important that we capture all of them. Take for example the concept of curve fitting. While for any n points, a polynomial of degree (n – 1) is guaranteed to exist, such polynomials are often so ridiculous in their fit that they are not useful for analyzing data. Thus, many statisticians will deliberately use only simple models to analyze data. It’s more or less the same with us- if we capture only relationships we expect to be useful, we’ll probably be good enough for most cases (especially given the size of the tanks and the relatively small size of the battlefield).

All told, while I’m still interested in figuring out definitively what is the right way to handle decoding the data, I think for now just capturing a few relationships that describe circles, straight lines, and hyperbolas, a very reasonable tank could be constructed.

## Robocode, Decision Trees (Dynamic Segmentation), and Information Science

So some time ago, I discovered that when I was talking about what is known to the Robocode community as “Dynamic Segmentation“, what I actually was talking about is known to the machine learning community as a decision tree. To those who are unfamiliar, a decision tree is a tool used to classify information by identifying relevant clusters. That’s still a little vague, so I’ll explain in the context of trying to compress an image.

Let’s imagine that we have an image of black and white pixels that we’re trying to compress. To start, we first translate everything into a bit matrix. Now things get interesting. To compress the image optimally, we need to recursively divide the image along the x or y axis until each bit is in a box of identical numbers. There are several ways to do this, the most obvious being to perform a depth first search of all possible combinations of splits that satisfy the end condition until the tree with the least splits is found. That’s a terrible solution though and fortunately information science offers some help.

Rather than doing a recursive depth first search, a better solution is to calculate the entropy of the original image and for all possible splits calculate the reduction in entropy and then selecting the one that reduces entropy the most. (For those unfamiliar, entropy is more or less the misclassification rate– not exactly so, but they’re closely related.) Doing so repeatedly ought to give us not necessarily the best solution, but one that is good enough.

The resulting tree structure used to compress the image is called a decision tree. The reason being that we now have a series of rules that can tell us what color a pixel ought to be at any x-y coordinate, not so different than a min-heap really. The code for a more general tree is below.

class DecisionTree(): def __init__(self, table, target_column, fitness_func): self.table = table self.target_component = target_column self.fitness_func = fitness_func self.tree = None def build(self): '''Build the decision tree.''' target_values = self.table.select(self.target_column) if not self.table: return None elif len(set(target_values)) == 1: return target_values[0] else: splitting_column = self.choose_column() if splitting_column is None: return sample_analysis.mode(target_values) #Could be average or median for continuous data else: self.tree = {'splitting_column': splitting_column} #Could be a problem on big data splits = {} for row in self.table.get_rows(): try: splits[row[splitting_column]].append(row) except: splits[row[splitting_column]] = [row] for split in splits: subtree = DecisionTree(Table(splits[split]), self.target_column, self.fitness_func) self.tree[split] = subtree def choose_column(self, significance=0.0): '''Get the attribute with the highest information gain.''' best_gain = 0.0 best_column = None for column in self.table.columns: gain = fitness_func(vectors, attribute, target_attribute) if gain > best_gain: best_gain = gain best_column = column if best_gain > significance: #Chosen for significance return best_column else: return None

tldr: A decision tree algorithm takes a table of data, a target column, and a fitness function (used to figure out how to split the data) and constructs an n-dimensional tree which can be used to explain the data.

Now, back to Robocode- The decision tree is an incredibly formidable tool. Not only can it be used to compress and explain data, but it can be and is used for forecasting. Since the algorithm identifies ultimately identifies clusters, then we can reasonably expect future data to simply fit in to our rules. So simply load it up with a table of wave data, identify the correct firing angle as the column to split on, select a reasonable fitness function (these get complex) and let the code figure out the clusters. Then just before firing, apply the rules you’ve discovered to find the relevant cluster, and fire!

All said, there are a few interesting technical problems that arise that I want to raise, partly because I don’t yet know the answer, and just for you, the reader, to consider before getting to coding.

- First off, what fitness functions do we use to categorize the data at each split? It gets tricky when you start mixing in categorical data with numerical data. Even barring that, fitness functions often can be biased.
- A tree is a static data structure. Therefore, if it can be determined that the tree is not performing well, the entire tree needs to be recalculated (in opposition to what I had believed earlier about simply splitting leaves). While this isn’t so much a problem for small amounts of data, which huge amount it can be a problem. Are there ways to make the tree more flexible so that it never has to fully be recalculated?
- Done incorrectly, a tree can overclassify the data, for example, making each row id its own cluster. Even if that can be fixed (and it can), where is the correct spot to draw the line on classifying the data, taking into the account that the more splits we make, the less likely it is that we split correctly? (That is to say, at the most extreme, you can be 100% sure you correctly classified the data as ‘data’ if you make no splits.) My intuition says some genetic algorithms can figure this out, but it’s hard to say for sure.
- Finally, how do we best encode the information regarding the state of the game itself? For example, does it make sense to store the distance between the two tanks even though the same information could be determined via enough splits? (I say yes, if just to speed things up) If so, what other attributes would be relevant?

In any case, there is a LOT of research on this topic as I’ve recently discovered. (Claude Shannon is the man!) Let me know of any thoughts / answers and good luck!