Author Archives: David Goldberg

Congruent Numbers Part I

Usually my posts have a connection to eBay, but this time I’m writing about a recreational math problem that caught my attention.


One of the most famous facts of mathematics is that a triangle with sides of length 3, 4, and 5 is a right triangle, which means it can be drawn with a horizontal base and a vertical side, forming a right angle. Since the area of a triangle is one-half base times height, this triangle has area \frac{1}{2}(3 \times 4) = 6.
Is there a right triangle with area 5? Of course there is, since a right triangle with a base of length 1 and height 10 has area 5. But the third side (the hypotenuse) has length \sqrt{101}, which is not an integer.

That the side is {\scriptstyle \sqrt{101}} follows from the famous Pythagorean theorem of high-school geometry, which says that if two sides of a right triangle have lengths {\scriptstyle \alpha} and {\scriptstyle \beta}, then the third side has length {\scriptstyle \gamma} with {\scriptstyle \gamma^2 = \alpha^2 + \beta^2}. So when {\scriptstyle \alpha = 1} and {\scriptstyle \beta = 10}, the third side has length {\scriptstyle \gamma = \sqrt{101}}. The converse of the Pythagorean theorem is also true. It says that if {\scriptstyle \gamma^2 = \alpha^2 + \beta^2}, then the triangle with side lengths = {\scriptstyle \alpha, \beta, \gamma} is a right triangle. In particular, the triangle with sides of length 3, 4, and 5 is a right triangle since {\scriptstyle 3^2 + 4^2 = 5^2 = 25}.

If you require all three sides of a triangle to be integers, then there is no right triangle with area 5. A brute-force enumeration verifies this, and easily shows that 6 is the smallest, and the next few N after 6 that are the area of a right triangle are 24, 30, 54, 60. So asking if there is a right triangle with integer area N appears to be of little interest. But what if you allow rational sides? It’s possible for the height and width to be non-integer fractions, but have integer area. And in fact there is such a triangle with area N=5. The equation

    \[ \left(\frac{3}{2}\right) ^2 + \left(\frac{20}{3}\right) ^2 = \left(\frac{41}{6}\right) ^2 \]

shows that a triangle with sides \alpha = 3/2, \beta = 20/3 and \gamma = 41/6 is a right triangle. The area is \frac{1}{2} \alpha \beta = 5.

An integer N that is the area of a right triangle having all sides of rational length is called congruent. What numbers are congruent? Detecting them has something in common with detecting primes. There is a fast test to see if a number is prime or composite, but it’s a lot harder to obtain direct evidence that a number is composite by producing its factors. Similarly, there is a fast test to see if a number is congruent, but it is hard to find the side lengths \alpha, \beta, \gamma that demonstrate directly that N is congruent.

The analogy isn’t exact, because the direct test for congruent numbers isn’t guaranteed to work. More precisely, there is a proof that it works, but the proof depends on an unproven hypothesis called the Birch and Swinnerton-Dyer conjecture.

The reason it can be hard to find the sides is that they can be fractions with a lot of digits. For example, the sides of a right triangle with area N=79 are

    \[ \left(\frac{335946000}{2950969}\right)^2 + \left(\frac{233126551}{167973000}\right)^2 = \left(\frac{56434050774922081}{495683115837000}\right)^2 \]

I found tables on several web sites containing the side lengths for all congruent numbers less than 1000, but I believe they are all copies of the same table, because they all have an identical glitch at N = 559 which I will explain shortly.

The table appears on multiple web sites, but these variants are identical except for a misprint in the entry for {\scriptstyle N=199}. It has the incorrect value {\scriptstyle 27368486201} on all but the site, which has the correct value {\scriptstyle 2736848\underline{7}6201}. However, that site has a different misprint at {\scriptstyle N=31} which says that {\scriptstyle D=103020} but it should be {\scriptstyle 103\underline{3}20}.


In this posting and the next I will explain my efforts to compute the sides of a triangle with area N. I got interested in this problem because

  • I wondered how it was done.
  • I wanted to compute sides for N > 1000.
  • I wanted to check if the existing table can be improved.

I found the existing table can indeed be improved, as I now explain. The rational sides demonstrating that N is congruent are not unique. For example, 6 is the area of a triangle with sides (3,4,5) and also (120/7, 7/10, 1201/70).

Elliptic curve theory explains why there are infinitely many triangles for each area. It gives a formula that takes any set of sides and produces a new set of sides.

I’ll always try to find the example with the smallest denominators. So I prefer (3,4,5) (denominator of 1) to (120/7, 7/10, 1201/70) (denominator of 70).

After I wrote my own code to generate a table of side lengths, I discovered that the web table doesn’t always give sides with the smallest denominator. Two examples are below, where the triangle sides are \alpha, \beta, \gamma.

Rendered by


How big (how many digits) are the numerator and denominator of the sides? I used the web table to plot the number of digits needed for the denominator D of the length of the hypotenuse \gamma against the area N. The plot is below. The vertical axis is \log_{10}D which is the number of base-10 digits in the denominator D. Note that prime values of N (blue) generally require more digits than composite numbers do.

The straight line suggests that the number of digits needed for the denominator increases linearly with N. In other words, the size of D is exponential in N.


Elementary method of computing the triangles

In this section I explain in detail a search method for finding the sides for the triangle of area N that can easily find the fractions for N=79, which are

    \[ \left(\frac{335946000}{2950969}\right)^2 + \left(\frac{233126551}{167973000}\right)^2 = \left(\frac{56434050774922081}{495683115837000}\right)^2 \]

A brute-force exhaustive search for N=79 would find the fractions \alpha = 335946000/2950969 and \beta = 233126551/167973000 by trying all \alpha = a/d with integers a, d having up to k=9 digits (and similarly for \beta). That is n^4 possibilities, where n = 10^k. This search would take a long time, n^4 = 10^{36}. There are some simple tricks that make it much faster to find the fraction by brute force.

Going back to sides for the congruent number N=5,

    \[ \left(\frac{3}{2}\right) ^2 + \left(\frac{20}{3}\right) ^2 = \left(\frac{41}{6}\right) ^2 \]

Rewrite the equation to have a common denominator.

    \[ \left(\frac{9}{6}\right) ^2 + \left(\frac{40}{6}\right) ^2 = \left(\frac{41}{6}\right) ^2 \]

The numerators satisfy

    \begin{eqnarray*} 9^2 + 40^2 & = & 41^2 \\ 81 + 1600 & = & 1681 \end{eqnarray*}

A triple of such integers is called a pythagorean triple. This link between congruent numbers and Pythagorean triples holds in general, as summarized below:

Rendered by

So from rational numbers \alpha, \beta, and \gamma you get integers A and B that are Pythagorean, meaning A^2 + B^2 is a square. I’ll always assume that any factor common to all of A, B, C, and D has been removed, in other words that \gcd(A,B,C,D) = 1. Also note that if N is congruent, then multiplying the sides of its triangle by the integer k shows that k^2N is congruent. And conversely, if you have a triangle showing that k^2N is congruent, divide the sides by k to get a triangle for N. So I only need to consider square-free N.

Here are the first few square-free congruent numbers.

Rendered by

Note that in each case the denominator of \gamma is the product of the denominators of \alpha and \beta. A proof that this is always so is in the final post of this three-part series.

Pythagorean triples (A, B, C) can be written in the form

(1)   \begin{eqnarray*} A &=& 2PQ \\ B &=& P^2 - Q^2 \qquad (\mbox{with } P > Q) \nonumber \\ C &=& P^2 + Q^2 \nonumber \end{eqnarray*}

The proof of this can be found on the web and in most number theory textbooks under the name Euclid’s Formula, but I also give a proof in the final post of this series. The formulas can be rewritten as

(2)   \begin{eqnarray*} P &=& \sqrt{\frac{C+B}{2}} \\ Q &=& \sqrt{\frac{C-B}{2}} \nonumber \\ \end{eqnarray*}

So there is a 1-1 correspondence between \alpha, \beta, \gamma and P, Q. Specifically, starting with \alpha, \beta, \gamma you rewrite with a common denominator to get A, B, C and then use (2) to get P, Q. Conversely, given P and Q use (1) to compute A, B, C, and then get D using

(3)   \begin{equation*}  % \frac{AB}{2D^2} = N \qquad D = \sqrt{\frac{AB}{2N}} = \sqrt{\frac{PQ(P^2 - Q^2)}{N}} \frac{AB}{2D^2} = N \qquad D = \sqrt{\frac{AB}{2N}} \end{equation*}

Finally, reduce A/D to get \alpha = a/d, similarly for \beta = b/e.

This means that a brute-force search can be done using only two integers P and Q (with P > Q) instead of four integers a, b, d, and e. What is their size? If a, b etc. have k digits, then A, B etc. have 2k digits and P, Q have k digits, so brute force is now n^2 = 10^{2k} \approx 10^{18}. Huge but smaller than 10^{36}.

Here is a table of the first few square-free congruent numbers represented as P and Q:

Rendered by

Note that P and Q are always of opposite parity (that is, one is odd and the other even). As I’ll explain in the final post, once you remove any common factors from (A,B,C,D) then P and Q must be of opposite parity. This is the glitch in the web table I mentioned earlier. For N=559, the web table gives P=2608225, Q=4489, which are both odd. If you use them to generate (A,B,C,D) you will find they all have a common factor of 2. If you remove this factor and recompute P, Q you get P' = 1306357, Q'= 1301868 with P' < P.

A major improvement in exhaustive search can be made by noting the special form of P and Q. You can see the pattern in the following table of P and Q, where they are written in partially factored form.
Rendered by
For entries in the table, P is a square, or a square times a divisor of N, and similarly for Q. As N=30 shows, both P and Q can have divisors. Stated more formally, P = sP_0^2 and Q = tQ_0^2 where st|N. Since this is true for all numbers in the table above, it’s plausible that it’s true in general, and I’ll show that in the final post.

This property means you only need to search up to the square root of P and Q. In other words, you only need to check n = 10^{k/2} values of P and similarly for Q, so exhaustive search takes n^2 = 10^k \approx 10^9 steps when N=79. This  can be done in less than a minute, at least on my laptop.

I won’t bother to give the code, because although it easily computes the sides for N=79 it can’t get the side lengths for more challenging areas such as N=157. In the next post I’ll explain an improved algorithm using elliptic curves and give code for the SageMath system that can solve N=157.

Powered by QuickLaTeX

Peer Groups in Empirical Bayes

In a post from February, I sang the praises of Empirical Bayes, and showed how eBay uses it to judge the popularity of an item. This post discusses an important practical issue in using Empirical Bayes which I call “Peer Groups”.

(Update, August 29, 2017:  I just discovered that the book, Computer Age Statistical Inference by Efron & Hastie, also discusses peer groups in section 15.6, although they call it “relevance” rather than “peer groups.”)

First, a quick summary of the February post. The popularity of an item with multiple copies available for sale can be measured by the number sold divided by the number of times the item has been viewed, sales/impressions for short. The problem was how to interpret the ratio sales/impressions when the number of impressions is small and there might not be any sales yet. The solution was to think of the ratio as a proxy for the probability of sale (call it \pi), and use Bayes theorem to estimate \pi. Bayes theorem requires a prior probability, which I estimated using some of eBay’s voluminous sales data. This method is called Empirical Bayes because the prior is determined empirically, using the data itself.

That brings me to peer groups. When computing the prior probability that an item gets a sale, I want to base it on sales/impressions data from similar items, which I call a peer group. For example if the item is a piece of jewelry, the peer group might be all items of jewelry listed on eBay in the past month. I can get more specific. If the item is new, then the peer group might be restricted to new items. If the list price is $138, the peer group might be further restricted to items whose price was between $130 and $140, and so on.

Once you have identified a peer group and used it to estimate the prior probability, you use Bayes theorem to combine that with the observed count of sales and impressions to compute the probability of sale. This is the number you want—the probability that the next impression will result in a sale. It is called the posterior probability, to distinguish it from the prior probability.

There’s a tension in selecting the peer group. You might think that a peer group more strongly constrained to be similar to the item under consideration will result in a better prior and therefore a better estimate of the probability of a sale. But as the peer group gets smaller and smaller, the estimate of the prior based on the group becomes noisier and less reliable.

Which finally brings me to the subject of this post. In the case where the peer group is specified by a continuous variable like price, you can get the best of both worlds—a narrowly defined peer group and a lot of data (hence low noise) to estimate the prior parameters.

The idea

The idea is modeling. If the prior depends on the price p, and if there is a model for the dependence, the same data used to compute the prior can be used to find the model. Then an item of price p is assigned the prior given by the model at p, which is essentially the peer group of all items with exactly price p. Since this prior is a prediction of the model, it indirectly uses all the data, since the model depends on the entire data set.

Dependence on price

What is needed to apply Bayes theorem is not a single probability \pi, but rather a probability distribution on \pi. I assume the distribution of \pi is a Beta distribution B(\alpha, \beta), which has two parameters. Specifying the prior means providing values of \alpha and \beta.

So our idea is to see if there is a simple parametrized function that explains the dependence of \alpha(p) and \beta(p) on the price p. The beta distribution B(\alpha, \beta) has a mean of \mu = \alpha/(\alpha + \beta). As a first step, I examine the dependence of \mu (rather than \alpha and \beta) on price.


The fit to the power law \mu \propto \beta^{-0.67} is very good. The values of \alpha and \beta are noisier than \mu. But I do know one thing: sales/impressions is small, so that \mu is small, and therefore \alpha \ll \beta so \mu \approx \alpha/\beta. It follows that if \alpha and \beta fit a power law, so would \mu. Thus a power law for \alpha and \beta is consistent with the plot above.

Here are plots of \alpha(p) and \beta(p). Although somewhat noisy, their fits to power laws are reasonable. And the exponents add as expected: the exponent for \alpha is -0.32, for \beta is 0.35, and for \mu = \alpha/(\alpha + \beta) \approx \alpha/\beta is -0.32 - 0.35 = -0.67.




Once the form of the dependence of \alpha and \beta on price is known, the Empirical Bayes computations proceed as usual. Instead of having to determine two constants \alpha and \beta, I use Empirical Bayes to determine four constants c_1, c_2, c_3, and c_4, where

    \[ \alpha(p) = c_1 p^{c_2} \qquad \beta(p) = c_3 p^{c_4} \]

The details are in the February posting, so I just summarize them here. The c_i are computed using maximum likelihood as follows. The probability of seeing a sales/impressions ratio of k_i/n_i is

    \[ q_i(\alpha, \beta) = \binom{n_i}{k_i} \frac{B(\alpha + k_i, n_i + \beta - k_i)}{B(\alpha, \beta)} \]

and max likelihood maximizes the product \prod_i q_i(\alpha, \beta) or equivalently the log

    \[ l(\alpha, \beta) = \sum_i \log q_i(\alpha, \beta) \]

Instead of maximizing a function of two variables \alpha, \beta maximize

    \[ l(c_1, c_2, c_3, c_4) = \sum_i \log q_i(c_1 p_i^{c_2}, c_3 p_i^{c_4}) \]

Once you have computed c_1, c_2, c_3, c_4, then an item with k sales out of n impressions at price p has a posterior probability of (\alpha + k)/(\alpha + \beta + n) =(c_1 p^{c_2} + k)/(c_1 p^{c_2} + c_3 p^{c_4} + n).

Beta regression

When people hear about the peer group problem with a beta distribution prior, they sometimes suggest using beta regression. This suggestion turns out not to be as promising as it first seems. In this section I will dig into beta regression, but it is somewhat of a detour so feel free to skip over it.

When we first learn about linear regression, we think of points on the (x,y) plane and drawing the line that best fits them. For example the x-coordinate might be a person’s height, the y coordinate is the person’s weight, and the line shows how (on the average) weight varies with height.

A more sophisticated way to think about linear regression is that each point represents a random variable Y_i. In the example above, x_i is a height, and Y_i represents the distribution of weights for people of height x_i. The height of the line at x_i represents the mean of Y_i. If the line is y = ax + b, then Y_i has a normal distribution with mean ax_i + b.

Beta regression is a variation when Y_i has a beta distribution instead of a normal distribution. If the y_i satisfy 0 \leq y_i \leq 1 they are clearly not from a normal distribution, but might be from a beta distribution. In beta regression you assume that Y_i is distributed like B(\alpha_i, \beta_i) where the mean \mu_i = \alpha_i/(\alpha_i + \beta_i) is ax_i + b, or perhaps a function of ax_i + b. The theory of beta regression tells you how to take a set of (x_i, y_i) and compute the coefficients a and b.

But in our situation we are not given (x_i, y_i) from a beta distribution. The beta distribution is the unknown (latent) prior distribution. So it’s not obvious how to apply beta regression to get \alpha(p) and \beta(p).


Empirical Bayes is a technique that can work well for problems with large amounts of data. It uses a Bayesian approach that combines a prior and a particular item’s data to get a posterior probability for that item. For us the data is sales/impressions for the item, and the posterior is the probability that an impression for that item results in a sale.

The prior is based on a set of parameters, which for a beta distribution is \{\alpha, \beta\}. The parameters are chosen to make the best possible fit of the posterior to a subset of the data. But what subset?

There’s a tradeoff. If the subset is too large, it won’t be representative of the item under study. If it is too small, the estimates of the parameters will be very noisy.

If the subset is parametrized by a continuous variable (price in our example), you don’t need to decide how to make the tradeoff. You use the entire data set to build a model of how the parameters vary with the variable. And then when computing the posterior of an item, you use the parameters given by the model. In our example, I use the data to compute constants c_1, \ldots c_4. If the item has price p, k sales and n impressions, then the parameters of the prior are \alpha = c_1p^{c_2} and \beta = c_3p^{c_4} and the estimated probablity of a sale (the posterior) is (c_1 p^{c_2} + k)/(c_1 p^{c_2} + c_3 p^{c_4} + n).

Powered by QuickLaTeX

Statistical Anomaly Detection

Complex systems can fail in many ways and I find it useful to divide failures into two classes. The first consists of failures that can be anticipated (e.g. disk failures), will happen again, and can be directly checked for. The second class is failures that are unexpected.  This post is about that second class.

The tool for unexpected failures is statistics, hence the name for this post. A statistical anomaly detector hunts for things that seem off and then sends an alert. The power of statistical anomalies is that they can detect novel problems.  The downside is that they must be followed up to track down the root cause.  All this might seem abstract, but I will give a concrete example below.

To avoid confusion, I find it helps to carefully define some terminology:

  • Monitored Signals:  Results of continuous scans that look for trouble.
  • Disruption: Behavior indicating that the site is not working as designed.  A human will need to investigate to find the root cause.
  • Anomaly: Unusual behavior in the monitored signals, suggesting a disruption.
  • Alert: Automatic signal that an anomaly has occurred. It is common to send an alert after detecting an anomaly.

The obstacle to building useful statistical detectors  is false positives:  an alert that doesn’t  correspond to a disruption.   If the anomaly detector has too many false positives, users will turn off alerts or direct them to a spam folder or just generally ignore them.  An ignored detector is not very useful.

In this post I will describe a statistical anomaly detector in use at eBay; my paper at the recent USENIX HotCloud 2015 workshop has more details.  The monitored signals in our search system are based on the search results from a set of reference queries that are repeatedly issued.  A statistical anomaly detector needs a way to encode the results of the monitoring as a set of numbers.  We do this by computing metrics on the results.  For each reference query, we compute about 50 metrics summarizing the items returned by the query.  Two examples of metrics are the number of items returned and the median price of the returned items.   There are 3000 reference queries and 50 metrics, therefore 150,000 numbers total.  Currently,  reference queries are issued every 4 hours, or 6 times a day, so there are 900,000 numbers per day.   In these days of terabyte databases, this is very small.  But the problem of sifting through those numbers to find anomalies, and do it with a low false-positive rate, is challenging.

I’ll outline our approach using diagrams, and refer you to my HotCloud paper for details.  First, here is a picture of the monitored signals—that is, the metrics collected:


Each (query, metric) pair is a number that can be tracked over time, resulting in a time series. That’s 150,000 times series, and it’s reasonable to expect that during every 4-hour collection window, at least one of them will appear anomalous just by chance.  So alerting on each time series is no good, because it will result in many false positives.

Our approach is to aggregate, and it starts with something very simple:  examining each time series and computing the deviation between the most recent value and what you might have expected by extrapolating previous values.  I call this the surprise, where the bigger the deviation the more the surprise. Here’s a figure  illustrating that there is a surprise for each (query, metric, time) triple.


The idea for our anomaly detector is this: at each collection time T we expect a few (query,metric, T) triples to have a large surprise. We signal an alert if an unusually high number of triples have a large surprise.   To make this idea quantitative,  fix a metric,  look at the surprise for all 3000 queries, and compute the 90th percentile of surprise at time TS90(T).


This gives a new time series in T, one for each metric.  Hypothetical time series for the first two metrics are illustrated below.


We’ve gone from 150,000 time series down to 50.  Aggregation like this is  a very useful technique in anomaly detection.

Our final anomaly detector uses a  simple test on this aggregated time series. We define an anomaly to occur when the current value of any of the 50 series is more than 3σ from the median of that series.  Here’s an example using eBay data with the metric median sale price. There is clearly an anomaly at time T=88.


For more details, see my previously cited HotCloud paper, The Importance of Features for Statistical Anomaly Detection.