## Posts Tagged ‘**statistics**’

## Decoding State

In a previous article covering the concept of using decision trees for Robocode I posed the question:

Finally, how do we best encode the information regarding the state of the game itself? For example, does it make sense to store the distance between the two tanks even though the same information could be determined via enough splits? (I say yes, if just to speed things up) If so, what other attributes would be relevant?

I’m still not certain of the answer, but I’ve got some additional insight to further clarify the problem.

So we know that given a table of the state of the game encoded in any form (ie: x and y positions of both bots, hit points, and size of the battlefield) we could, given enough data points, build out a tree that very well approximates the correct firing angles for every situation. This make encoding something like “distance to target” technically redundant. However, the distance to the target seems to be a very relevant attribute to capture- something clearly not easily captured via the x and y positions of two bots. I’m not sure how it hit me, but that uncapturable essence of distance is unique strictly because it describes the *relationship* between the two variables.

That said, we now run into a problem- there are an infinite number of relationships between any two variables. Distance, obviously, is the relationship defined by (x -a) ** 2 + (y – b) ** 2 = c. But countless other relationships exist- for example any of the lines defined by any polynomials (ie: a * x ** 2 + b * x + c = y) or another example– any of the lines defined by exponential relations (ie: y = a * n ** x + b). These relationships suddenly become impossible to handle and were the main reason I decided to investigate other options when I realized multiple regressions suffer the same fault. (Remember, that what I’ve just described is the relationship between only two dimensions. With each added dimension, the simple relationships grow at a rate of O(n!).)

Fortunately, it may very well be the case that it’s not TOO important that we capture all of them. Take for example the concept of curve fitting. While for any n points, a polynomial of degree (n – 1) is guaranteed to exist, such polynomials are often so ridiculous in their fit that they are not useful for analyzing data. Thus, many statisticians will deliberately use only simple models to analyze data. It’s more or less the same with us- if we capture only relationships we expect to be useful, we’ll probably be good enough for most cases (especially given the size of the tanks and the relatively small size of the battlefield).

All told, while I’m still interested in figuring out definitively what is the right way to handle decoding the data, I think for now just capturing a few relationships that describe circles, straight lines, and hyperbolas, a very reasonable tank could be constructed.

## Robocode, Decision Trees (Dynamic Segmentation), and Information Science

So some time ago, I discovered that when I was talking about what is known to the Robocode community as “Dynamic Segmentation“, what I actually was talking about is known to the machine learning community as a decision tree. To those who are unfamiliar, a decision tree is a tool used to classify information by identifying relevant clusters. That’s still a little vague, so I’ll explain in the context of trying to compress an image.

Let’s imagine that we have an image of black and white pixels that we’re trying to compress. To start, we first translate everything into a bit matrix. Now things get interesting. To compress the image optimally, we need to recursively divide the image along the x or y axis until each bit is in a box of identical numbers. There are several ways to do this, the most obvious being to perform a depth first search of all possible combinations of splits that satisfy the end condition until the tree with the least splits is found. That’s a terrible solution though and fortunately information science offers some help.

Rather than doing a recursive depth first search, a better solution is to calculate the entropy of the original image and for all possible splits calculate the reduction in entropy and then selecting the one that reduces entropy the most. (For those unfamiliar, entropy is more or less the misclassification rate– not exactly so, but they’re closely related.) Doing so repeatedly ought to give us not necessarily the best solution, but one that is good enough.

The resulting tree structure used to compress the image is called a decision tree. The reason being that we now have a series of rules that can tell us what color a pixel ought to be at any x-y coordinate, not so different than a min-heap really. The code for a more general tree is below.

class DecisionTree(): def __init__(self, table, target_column, fitness_func): self.table = table self.target_component = target_column self.fitness_func = fitness_func self.tree = None def build(self): '''Build the decision tree.''' target_values = self.table.select(self.target_column) if not self.table: return None elif len(set(target_values)) == 1: return target_values[0] else: splitting_column = self.choose_column() if splitting_column is None: return sample_analysis.mode(target_values) #Could be average or median for continuous data else: self.tree = {'splitting_column': splitting_column} #Could be a problem on big data splits = {} for row in self.table.get_rows(): try: splits[row[splitting_column]].append(row) except: splits[row[splitting_column]] = [row] for split in splits: subtree = DecisionTree(Table(splits[split]), self.target_column, self.fitness_func) self.tree[split] = subtree def choose_column(self, significance=0.0): '''Get the attribute with the highest information gain.''' best_gain = 0.0 best_column = None for column in self.table.columns: gain = fitness_func(vectors, attribute, target_attribute) if gain > best_gain: best_gain = gain best_column = column if best_gain > significance: #Chosen for significance return best_column else: return None

tldr: A decision tree algorithm takes a table of data, a target column, and a fitness function (used to figure out how to split the data) and constructs an n-dimensional tree which can be used to explain the data.

Now, back to Robocode- The decision tree is an incredibly formidable tool. Not only can it be used to compress and explain data, but it can be and is used for forecasting. Since the algorithm identifies ultimately identifies clusters, then we can reasonably expect future data to simply fit in to our rules. So simply load it up with a table of wave data, identify the correct firing angle as the column to split on, select a reasonable fitness function (these get complex) and let the code figure out the clusters. Then just before firing, apply the rules you’ve discovered to find the relevant cluster, and fire!

All said, there are a few interesting technical problems that arise that I want to raise, partly because I don’t yet know the answer, and just for you, the reader, to consider before getting to coding.

- First off, what fitness functions do we use to categorize the data at each split? It gets tricky when you start mixing in categorical data with numerical data. Even barring that, fitness functions often can be biased.
- A tree is a static data structure. Therefore, if it can be determined that the tree is not performing well, the entire tree needs to be recalculated (in opposition to what I had believed earlier about simply splitting leaves). While this isn’t so much a problem for small amounts of data, which huge amount it can be a problem. Are there ways to make the tree more flexible so that it never has to fully be recalculated?
- Done incorrectly, a tree can overclassify the data, for example, making each row id its own cluster. Even if that can be fixed (and it can), where is the correct spot to draw the line on classifying the data, taking into the account that the more splits we make, the less likely it is that we split correctly? (That is to say, at the most extreme, you can be 100% sure you correctly classified the data as ‘data’ if you make no splits.) My intuition says some genetic algorithms can figure this out, but it’s hard to say for sure.
- Finally, how do we best encode the information regarding the state of the game itself? For example, does it make sense to store the distance between the two tanks even though the same information could be determined via enough splits? (I say yes, if just to speed things up) If so, what other attributes would be relevant?

In any case, there is a LOT of research on this topic as I’ve recently discovered. (Claude Shannon is the man!) Let me know of any thoughts / answers and good luck!

## Dynamic Segmentation

In order to properly parse the input data gathered by a bot in Robocode, it’s insufficient to statically declare a number of bins beforehand and simply insert the data into the structure.

The main problem with this approach is that the data is getting subdivided way too many times to be useful. For example, if wanted to increase accuracy, we would desire to eliminate any confounding information, and therefore segment on every possible independent variable and more- the reason being that if we can assume bots are imperfect, than our goal is to find actually find flaws. This means we want to segment on a large number of variables, for example distance to target, time since start of battle, bullet power, target velocity, target acceleration, target angular acceleration, target heading, target x-distance from center, target y-distance from center, target hit points, shoot velocity, shooter acceleration, shooter angular acceleration ,shooter heading, shooter x-distance from center, shooter y-distance from center, shooter hit points, and possibly more. That’s a lot, and if we segment on all of that, it’s going to take a very long time to acquire meaningful data that we can act on.

Another problem is that by dividing information, information loses value near the divisions. This can be remedied to an extent through some information blurring, but generally it’s not a trivial problem to solve. Particularly, static divisions are particularly worrisome since there is no good way to know where to divide the data.

Fortunately, there is a solution- dynamic segmentation. The idea is to, rather than statically declaring the segmentation divisions, simply declare along which axes the data can be segmented, and then build a tree structure, which splits data when it makes sense to. This is not as simple as it seems, but to illustrate the general idea, if our enemy is moving on a clockwise spiral relative to our position moving randomly towards our away for sizable amounts of time, then while our first few shoots will all be clumped together in one node, eventually the bot should realize that by splitting the data up into ‘target moving closer’ and ‘target moving away’ that it’s shots will become more accurate. This is very cool, because the bot will generally have some pretty good guesses most of the time, and only improve with more data. Furthermore, it reduces the need to worry about any kind of covariance, since the bot will automatically detect it, being able to split anywhere- for example, rather than tracking distance to corners, the bot will eventually learn where the corners are (provided the enemy acts unusually in them) by first segmenting on target x-distance from center and then segmenting on target-y distance from center.

The problem now is to determine when is it appropriate to split. By virtue of the problem of determining whether or not a die is load, we can figure this out. Immediately, it’s apparent that small amounts of data won’t suffice. Additionally, it’s fairly clear that variance has something to do with it, since if we’re consistently hitting our target it would be rather foolish to split the data. The question is how much variance do we allow.

To that, unfortunately I’m not exactly sure. I think a chi-squared test is the solution, although from my research it seems it can get pretty involved. (Even determining whether or not a coin is biased is fairly complicated.) For now though, I just want to throw out the idea of utilizing a tree structure.

## Jellybean Jars Done Right

It’s fairly close to Easter, so I figure a few of these will be popping up in the future, and I thought I’d just write a post about how to game the system properly and guess the correct number with a good deal of precision.

So back in senior year of high school, we had a week long contest known as Spirit Week where the school was divided into two teams to compete in a series of events for no reason other than bragging rights. As you can imagine, most of the stuff was rather silly and I didn’t bother. However, one event was more suited to my tastes: The moderators filled up two decent sized jars with Jolly Ranchers and gave each of us one guess. The person who submitted the closest number to the real count would win some points for their team as well as the two jars of candy.

My friend Vinh and I decided to take up the challenge for some kicks. In order to do this, I took a bunch of 2×4 Lego bricks from my house, a small jar, and then bought some Jolly Ranchers from a nearby convenience store. (I thought the Lego bricks were a good approximation, but it turned out later I didn’t need them.) I then proceeded to fill up the jar with all the Jolly Ranchers I had, measured the height, and calculated the jar volume to Jolly Rancher ratio (measuring Jolly Rancher volume is impractical because of the weirdness of the shape and the chaotic way that they stack, so this is the only reasonable way). It came out to be that 66% of the jar was empty space! (And for those interested, a jar of 2×4 lego bricks is 50% empty space. In other words, if you don’t know what to do with all of your bricks and hate how much space they take up, you might compress them by pressing them together rather than just throwing them all into a box for a 50% reduction!)

Now typically they don’t let you measure the jar, and in that case your guess is as good as mine. (I would suppose taking a picture of the jar next to a meterstick would be useful in the case.) But in this case they let people measure. Unfortunately, the one jar was kind of round, so a simple height x width x length was insufficient, but fortunately, my friend Vinh knew how to do curve fitting on our TI-83s, so I simply recorded a few widths at various heights and we took a few integrals to get the volume. Multiplying the volume by the space to Jolly Rancher ratio gave us an approximation, and being a team event, Vinh and I printed out a bunch of numbers for our friends to guess that gave us plenty of room to be off by.

Subsequently, one of my friends, Max, won,with a guess off by three of our original approximation. Max kept one of the jars, I got one (offered to split it with Vinh but he didn’t care. Me either really, it’s now preserved as a trophy in my basement.), and the team scored a bunch of points. (We still lost though unfortunately.)

And that’s how you do it the nerd way.

## Some Segmentation Ideas

Continuing with my earlier post on Robocode, I’d like to describe some thoughts on segmentation strategies.

So the basic idea is, we want to minimize our bins at all times. In effect, this means calculating the Maximum Escape Angle, calculating the arc formed by the intersection of the target and the circle of radius distance to target with center on our tank (that is, the orbital path our opponent could take), and then taking the ceiling of one half of the ratio (one half because a single shot will hit our opponent just as well as if hits the leftmost part of our opponent as the rightmost side).

Interestingly, we can adjust the Maximum Escape by adjusting the speed of our bullet. Given that, we can actually reduce or expand the number of bins as we desire- a useful ability when trying to maximize probability (since things are discrete, a change from N to N-1 bins can be fairly significant assuming the probability to hit is a uniform distribution at both ranges).

Furthermore, we can supplement our segmentation data by adjusting the range at which we fight, spiraling toward and away our opponent as necessary, in order to keep our targets at optimal distances.

## Wave Surfing Explained

I’ve recently reignited my interest in Robocode but this time armed with everything I’ve learned since September.

For the uninitiated, Robocode is an educational programming game, originally written with Java in mind, where players code up algorithms to guide virtual tanks in destroying each other. It’s a unique challenge, combining fields like programming, statistics, linear algebra, and game theory all into one.

Originally intended for students, the game was quickly picked up by experienced programmers and is now rather competitive. Many players use sophisticated algorithms in order to dodge enemy bullets and make sure their own find their mark.

Anyway, let me continue to explain wave surfing- but first I need to explain segmentation.

**Segmentation**

Since in Robocode neither bot can see any of the bullets, developers have to find ways to track their enemy through alternative means. Segmentation is a targeting technique that involves finding which firing angles are most likely to hit the target by dividing the continuous range of firing angles into discrete bins, and finding which bins are most popular. To implement segmentation, one needs to track each bullet with “waves”.

A wave is basically analogous to a ripple in a pond after tossing a stone into the water. We track the bullets until the “wave” hits our opponent. At that point, we can determine what the correct firing angle was at the time of firing, and subsequently increment the value in the bin which contains the correct firing angle. We then proceed to fire at the center of the most popular bin whenever we get the chance.

A primitive approach of segmentation might only segment by firing angle. However more sophisticated approaches first segment by distance, and then angle. (And even more sophisticated approaches segment by even more dimensions.) However, these improvements come at a cost of slowing down the rate of learning, but the accuracy gain is typically well worth the price.

Optimal segmentation (I think anyway) ought to reduce the number of bins down to the minimum needed to hit every possible position of the opponent. For example, an opponent at point blank range would only require one bin, since all firing angles within the escape angle would be guaranteed to hit the target. As the distance increases however, more and more bins become necessary (I believe at an exponential rate). By reducing the number of bins in this fashion, we increase learning speed, and reduce memory costs.

**Wave Surfing**

Wave Surfing is an evasive technique first developed and implemented by ABC in mid-2004. To put it succinctly, wave surfing is anti-segmentation. Every time the enemy fires a shot, we create a wave, and see whether or he not the bullet hit when the wave reaches our bot, and subsequently increment our own bin. In this way, we know what our opponent believes to be the most popular firing angle, and therefore make sure to avoid it, producing a near uniform distribution at all ranges.

**Why this is optimal**

Refer back to my earlier post on the game theoretical solution to Rock-Paper-Scissors. Basically, to do anything else is to tip our opponent off to a flaw in our strategy. If you know for example that your opponent will play Rock 2/3 of the time and Paper 1/3 of the time, your optimal strategy becomes to keep playing Paper.

**Implementation Issues**

A literal implementation of the above is still susceptible to intelligent algorithms. For example, at the start of a game or Rock-Paper-Scissors, if your opponent first plays Scissors, it would be illogical to assume that because your opponent has so far only played Scissors, it would be logical to assume he would do so again. This illustrates the problem of determining when the opponent is playing sub-optimally with an unequal probability distribution. Thankfully, statistics provides an answer. Using statistical methods to determine statistically significant variance will lead to a solution- typically, with low amount of data, statisticians say nothing can be concluded, and when more data, more confidence can be placed on inferences. A simple implementation of confidence intervals ought to be sufficient.

## Curve Fitting using Linear Algebra

My initial interest in curve fitting came a while ago when programming tanks for Robocode, but realizing the complexity given my limited knowledge of calculus, my plans came to a screeching halt. However, I recently got into the concept of hacking, and subsequently found HackThisSite, which poses training puzzles to the hackers of the future. One, required the player to reverse engineer an encryption algorithm (an extremely simple one at that) but the puzzle made me realize that I could probably use what I was learning in my math classes to automate the process and tackle any simple encryption algorithms. And then I realized that actually, this is how targeting algorithms are built.

Linear algebra is cool because it let’s us compose functions that take map an arbitrary number of arguments to a real. This is useful in a lot of cases. For one, I could map angular velocity, angular acceleration, and distance of a target to firing angles. Alternatively, I could map the value of a pot, the number of opponents, and my current winnings to an amount to bet.

Let’s start simple. Let’s say we wanted to fit a curve of power 2 through the points (1,2), (3,4), and (5,6). To do this, we want to find the general solution to y=ax^2+bx+c where a, b, c are unknown. So let’s write our matrices out.

a*(1)+b*(1)+c*1=(2)

a*(9)+b*(3)+c*1=(4)

a*(25)+b*(5)+c*1=(6)

Using linear algebra, we can turn all of this into three matrices of the form Ax=b. In this case, A=((1,1,1), (9,3,1), (25,5,1)), x=((a), (b), (c)) and b = ((2), (4), (6)). In this form, this solution unfolds itself very clearly: x=A^-1*b. The solution is a=0, b=1, c=1.

Now let’s get a little more complex. The next question is what if we have more rows than columns? In this case, A has no inverse. The solution: we turn to projection. For those who are unfamiliar, the details are messy, but basically, we come up with (A^t)Ax=(A^t)b. Since (A^t)A is square, we can invert it, and therefore we get x=((A^t)(A))^(-1)(A^t)b. It’s a bit messy, so I won’t present an example, but it works.

What that equation basically gives us is actually the least squares approximation. So basically, if we were to take A as a column vector of the x’s, and b a column vector of the y’s, and if A had a bunch of rows with some noise in the data, we would end up with a line that goes through them all pretty nicely.

The next question is, how we do fit a plane through a set of data points- and subsequently, how do we fit an n-dimensional object through a set of data points?

The solution is surprisingly simple. Just add a column for the new input! So let’s say we wanted to fit a straight plane through some data points- Well, the equation for a plane is ax+by+cz=d. Or more relevantly, ax+by+d =z (the negative signs don’t really matter since a, b, and d will take care of that). Now it looks like something we can handle. Simply write out each row in A to be (x, y, 1), x to be a column vector of (a, b, d) and b to be the corresponding z’s. Use the equation presented above, and you’ve got your solution!

This is awesome, because now we can add as many variables as we want, and add dependencies as well. Say the enemy tanks are factoring in some amount of distance*velocity into their equations in order to trip us up a little- we simply add in a distance*velocity vector into A, another variable in x, and now we’ve got it covered.

The only problem as I see it is the enormous scale we are looking at here. Basically, if we look to include all dependencies, the width of our matrix A is equal to n^v, where n is the degree of the polynomial that we’re trying to fit through the data, and v is the number of variables we have. Not the worst, especially considering we generally want small polynomials (since large polynomials tend to behave oddly) but with a bunch of variables, even inititally small amounts of computation could become costly.

Next objective: Build a program that updates a curve with new data without recalculating the whole thing.