In this post I finally get to the punch line and construct a set of approximations to (actually ) that cover a range of speeds and accuracies. As execution time decreases, so does the accuracy. You can choose an approximation with a tradeoff that meets your needs.

When making a fast but approximate function, the design parameter is the form of the approximating function. Is it a polynomial of degree 3? Or the ratio of two linear functions? The table below has a systematic list of possibilities: the top half of the table uses only multiplication, the bottom half uses division.

Table 1: Approximating Functions

The *rewritten* column rewrites the expression in a more efficient form. The expressions are used in the approximation procedure as follows: to get the approximation to , first is reduced to the interval , then is substituted into the expression. As I explained in Part I and Part II, this procedure gives good accuracy and avoids floating-point cancellation problems.

For each form, the minimax values of are determined—that is, the values that minimize the maximum relative error. The *bits* column gives the bits of accuracy, computed as where is the maximum relative error. This value is computed using the evaluation program from my Part II post, invoked as .

The *“cost”* column is the execution time. In olden days, floating-point operations were a lot more expensive than integer instructions, and they were executed sequentially. So cost could be estimated by counting the number of floating-point additions and multiplications. This is no longer true, so I estimate cost using the evaluation program. I put cost in quotes since the numbers apply only to my MacBook. The numbers are normalized so that the cost of log2f in the standard C library is 1.

*a*,

*b*,

*c*, and

*d*. Using division, you get similar accuracy (11.6) with line 8, which has only three free parameters. Similarly, line 6 has about the same accuracy as line 10, and again line 10 has one less free parameter than line 6.

It’s hard to grasp the cost and time numbers in a table. The scatter plot below is a visualization with lines 2–10 each represented by a dot.

You can see one anomaly—the two blue dots that are (roughly) vertically stacked one above the other. They have similar cost but very different accuracy. One is the blue dot from line 9 with cost .43 and 7.5 bits of accuracy. The other is from line 3 which has about the same cost (0.42) but 8.5 bits of accuracy. So clearly, line 9 is a lousy choice. I’m not sure I would have guessed that without doing these calculations.

which means that the denominators must be roughly equal. Multiplying each by , this becomes . The only way this can be true is if and are very large, so that it’s as if the term doesn’t exist and the approximation reduces to the quotient of two linear functions. And that is what happens. The optimal coefficients are on the order of , which makes (now writing in terms of ) . Plugging in the optimal values , , gives , , which are very similar to the coefficients in line 7 shown in Table 2 later in this post. In other words, the optimal rational function in line 9 is almost identical to the one in line 7. Which explains why the bit accuracy is the same.

In the next plot I remove line 9, and add lines showing the trend.

The lines show that formulas using division outperform the multiplication-only formulas, and that the gain gets greater as the formulas become more complex (more costly to execute).

You might wonder: if one division is good, are two divisions even better? Division adds new power because a formula using division can’t be rewritten using just multiplication and addition. But a formula with two divisions can be rewritten to have only a single division, for example

Two divisions add no new functionality, but could be more efficient. In the example above, a division is traded for two multiplications. In fact, using two divisions gives an alternative way to write line 10 of Table 1. On my Mac, that’s a bad tradeoff: the execution time of the form with 2 divisions increases by 40%.

Up till now I’ve been completely silent about how I computed the minimax coefficients of the functions. Or to put it another way, how I computed the values of , , , etc. in Table 1. This computation used to be done using the Remez algorithm, but now there is a simpler way that reduces to solving a convex optimization problem. That in turn can then be solved using (for example) CVX, a Matlab-based modeling system.

Here’s how it works for line 8. I want to find the minimax approximation to . As discussed in the first post of this series, it’s the relative error that should be minimized. This means solving

This is equivalent to finding the smallest for which there are , , and satisfying

For the last equation, pick . Of course this is not exactly equivalent to being true for all but it is an excellent approximation. The notation may be a little confusing, because are constants, and , and are the variables. Now all I need is a package that will report if

has a solution in , , . Because then binary search can be used to find the minimal . Start with that you know has no solution and that is large enough to guarantee a solution. Then ask if the above has a solution for . If it does, replace ; otherwise, . Continue until has the desired precision.

The set of satisfying the equation above is convex (this is easy to check), and so you can use a package like CVX for Matlab to quickly tell you if it has a solution. Below is code for computing the coefficients of line 8 in Table 1. This matlab/CVX code is modified from http://see.stanford.edu/materials/lsocoee364a/hw6sol.pdf.

It is a peculiarity of CVX that it can report for a value of , but then report for a smaller value of . So when seeing I presume there is a solution, then decrease the upper bound and also record the corresponding values of , , and . I do this for each step of the binary search. I decide which to use by making an independent calculation of the minimax error for each set .

You might think that using a finer grid (that is, increasing so that there are more ) would give a better answer, but it is another peculiarity of CVX that this is not always the case. So in the independent calculation, I evaluate the minimax error on a very fine grid that is independent of the grid size given to CVX. This gives a better estimate of the error, and also lets me compare the answers I get using different values of . Here is the CVX code:

format long format compact verbose=true bisection_tol = 1e-6; m=500; lo=0.70; % check values a little bit below 0.75 hi=1.5; xi = linspace(lo, hi, m)'; yi = log2(xi); Xi = linspace(lo, hi, 10000); % pick large number so you can compare different m Xi = Xi(Xi ~= 1); Yi = log2(Xi); xip = xi(xi >= 1); % those xi for which y = x-1 is positive xin = xi(xi = 0.75); xinn = xi(xi = 1); yin = yi(xi = 0.75); yinn = yi(xi = 0.75); Xin = Xi(Xi = bisection_tol gamma = (l+u)/2; cvx_begin % solve the feasibility problem cvx_quiet(true); variable A; variable B; variable C; subject to abs(A*(xip - 1).^2 + B*(xip - 1) - yip .* (xip - 1 + C)) <= ... gamma * yip .* (xip-1 + C) abs(A*(xin - 1).^2 + B*(xin - 1)- yin .* (xin - 1 + C)) <= ... -gamma * yin .* (xin-1 + C) abs(A*(2*xinn - 1).^2 + B*(2*xinn - 1) - (1 + yinn) .* (2*xinn - 1 + C)) <= ... -gamma * yinn .* (2*xinn - 1 + C) cvx_end if verbose fprintf('l=%7.5f u=%7.5f cvx_status=%s\n', l, u, cvx_status) end if strcmp(cvx_status,'Solved') | strcmp(cvx_status, 'Inaccurate/Solved') u = gamma; A_opt(k) = A; B_opt(k) = B; C_opt(k) = C; lo = (A*(2*Xin - 1).^2 + B*(2*Xin - 1)) ./ (2*Xin - 1 + C) - 1; hi = (A*(Xip - 1).^2 + B*(Xip -1)) ./ (Xip - 1 + C); fx = [lo, hi]; [maxRelErr(k), maxInd(k)] = max(abs( (fx - Yi)./Yi )); k = k + 1; else l = gamma; end end [lambda_opt, k] = min(maxRelErr); A = A_opt(k) B = B_opt(k) C = C_opt(k) lambda_opt -log2(lambda_opt)

Here are the results of running the above code for the expressions in the first table. I don’t bother giving all the digits for line 9, since it is outperformed by line 7.

Table 2: Coefficients for Table 1

So, what’s the bottom line? If you don’t have specific speed or accuracy requirements, I recommend choosing either line 3 or line 7. Run both through the evaluation program to get the cost for your machine and choose the one with the lowest cost. On the other hand, if you have specific accuracy/speed tradeoffs, recompute the cost column of Table 1 for your machine, and pick the appropriate line. The bits column is machine independent as long as the machine uses IEEE arithmetic.

*A*= 0.1501692,

*B*= 3.4226132,

*C*= 5.0225057,

*D*= 4.1130283,

*E*= 3.4813372.

Finally, I’ll close by giving the C code for line 8 (almost a repeat of code from the first posting). This is bare code with no sanity checking on the input parameter . I’ve marked the lines that need to be modified if you want to use it for a different approximating expression.

float fastlog2(float x) // compute log2(x) by reducing x to [0.75, 1.5) { /** MODIFY THIS SECTION **/ // (x-1)*(a*(x-1) + b)/((x-1) + c) (line 8 of table 2) const float a = 0.338953; const float b = 2.198599; const float c = 1.523692; #define FN fexp + signif*(a*signif + b)/(signif + c) /** END SECTION **/ float signif, fexp; int exp; float lg2; union { float f; unsigned int i; } ux1, ux2; int greater; // really a boolean /* * Assume IEEE representation, which is sgn(1):exp(8):frac(23) * representing (1+frac)*2^(exp-127). Call 1+frac the significand */ // get exponent ux1.f = x; exp = (ux1.i & 0x7F800000) >> 23; // actual exponent is exp-127, will subtract 127 later greater = ux1.i & 0x00400000; // true if signif > 1.5 if (greater) { // signif >= 1.5 so need to divide by 2. Accomplish this by // stuffing exp = 126 which corresponds to an exponent of -1 ux2.i = (ux1.i & 0x007FFFFF) | 0x3f000000; signif = ux2.f; fexp = exp - 126; // 126 instead of 127 compensates for division by 2 signif = signif - 1.0; lg2 = FN; } else { // get signif by stuffing exp = 127 which corresponds to an exponent of 0 ux2.i = (ux1.i & 0x007FFFFF) | 0x3f800000; signif = ux2.f; fexp = exp - 127; signif = signif - 1.0; lg2 = FN; } // last two lines of each branch are common code, but optimize better // when duplicated, at least when using gcc return(lg2); }

## Addendum

In response to comments, here is some sample code to compute logarithms via table lookup. A single precision fraction has only 23 bits, so if you are willing to have a table of 2^{23} floats (2^{25} bytes) you can write a logarithm that is very accurate and very fast. The one thing to watch out for is floating-point cancellation, so you need to split the table into two parts (see the `log_biglookup()`

code block below).

The `log_lookup()`

sample uses a smaller table. It uses linear interpolation, because ordinary table lookup results in a step function. When *x* is near 0, log(1+*x*) ≈ *x* is linear, and any step-function approximation will have a very large relative error. But linear interpolation has a small relative error. In yet another variation, `log_lookup_2()`

uses a second lookup table to speed up the linear interpolation.

float log_lookup(float x) // lookup using NDIGIT_LOOKUP bits of the fraction part { int exp; float lg2, interp; union { float f; unsigned int i; } ux1, ux2; unsigned int frac, frac_rnd; /* * Assume IEEE representation, which is sgn(1):exp(8):frac(23) * representing (1+frac)*2ˆ(exp-127) Call 1+frac the significand */ // get exponent ux1.f = x; exp = ((ux1.i & 0x7F800000) >> 23); // -127 done later // top NDIGITS frac = (ux1.i & 0x007FFFFF); frac_rnd = frac >> (23 - NDIGITS_LOOKUP); // for interpolating between two table values ux2.i = (frac & REMAIN_MASK) << NDIGITS_LOOKUP; ux2.i = ux2.i | 0x3f800000; interp = ux2.f - 1.0f; if (frac_rnd < LOOKUP_TBL_LN/2) { lg2 = tbl_lookup_lo[frac_rnd] + interp*(tbl_lookup_lo[frac_rnd+1] - tbl_lookup_lo[frac_rnd]); return(lg2 + (exp - 127)); } else { lg2 = tbl_lookup_hi[frac_rnd] + interp*(tbl_lookup_hi[frac_rnd+1] - tbl_lookup_hi[frac_rnd]); return(-lg2 + (exp - 126)); } } static float log_lookup_2(float x) // use a second table, tbl_interp[] { int exp; float lg2; union { float f; unsigned int i; } ux1; unsigned int frac, frac_rnd, ind; /* * Assume IEEE representation, which is sgn(1):exp(8):frac(23) * representing (1+frac)*2ˆ(exp-127) Call 1+frac the significand */ // get exponent ux1.f = x; exp = ((ux1.i & 0x7F800000) >> 23); // -127 done later // top NDIGITS frac = (ux1.i & 0x007FFFFF); frac_rnd = frac >> (23 - NDIGITS_LOOKUP_2); // for interpolating between two table values ind = frac & REMAIN_MASK_2; // interp = tbl_inter[ind] if (frac_rnd < LOOKUP_TBL_LN_2/2) { lg2 = tbl_lookup_lo[frac_rnd] + tbl_interp[ind]*(tbl_lookup_lo[frac_rnd+1] - tbl_lookup_lo[frac_rnd]); return(lg2 + (exp - 127)); } else { lg2 = tbl_lookup_hi[frac_rnd] + tbl_interp[ind]*(tbl_lookup_hi[frac_rnd+1] - tbl_lookup_hi[frac_rnd]); return(-lg2 + (exp - 126)); } } static float log_biglookup(float x) // full lookup table with 2^23 entries { int exp; float lg2; union { float f; unsigned int i; } ux1; unsigned int frac; /* * Assume IEEE representation, which is sgn(1):exp(8):frac(23) * representing (1+frac)*2ˆ(exp-127) Call 1+frac the significand */ // get exponent ux1.f = x; exp = ((ux1.i & 0x7F800000) >> 23); // -127 done later frac = (ux1.i & 0x007FFFFF); if (frac < TWO_23/2) { lg2 = tbl_lookup_big_lo[frac]; return(lg2 + (exp - 127)); } else { lg2 = tbl_lookup_big_hi[frac - TWO_23/2]; return(-lg2 + (exp - 126)); } } #define NDIGITS_LOOKUP 14 #define LOOKUP_TBL_LN 16384 // 2^NDIGITS #define REMAIN_MASK 0x1FF // mask with REMAIN bits where REMAIN = 23 - NDIGITS static float *tbl_lookup_lo; static float *tbl_lookup_hi; static void init_lookup() { int i; tbl_lookup_lo = (float *)malloc(( LOOKUP_TBL_LN/2 + 1)*sizeof(float)); tbl_lookup_hi = (float *)malloc(( LOOKUP_TBL_LN + 1)*sizeof(float)); tbl_lookup_hi = tbl_lookup_hi - LOOKUP_TBL_LN/2; for (i = 0; i <= LOOKUP_TBL_LN/2; i++) // <= not < tbl_lookup_lo[i] = log2f(1 + i/(float)LOOKUP_TBL_LN); for (i = LOOKUP_TBL_LN/2; i < LOOKUP_TBL_LN; i++) // log2, not log2f tbl_lookup_hi[i] = 1.0 - log2(1 + i/(float)LOOKUP_TBL_LN); tbl_lookup_hi[LOOKUP_TBL_LN] = 0.0f; } /* two tables */ #define NDIGITS_LOOKUP_2 12 #define LOOKUP_TBL_LN_2 4096 // 2^NDIGITS #define TWO_REMAIN_2 2048 // 2^REMAIN, where REMAIN = 23 - NDIGITS #define REMAIN_MASK_2 0x7FF // mask with REMAIN bits static float *tbl_interp; static void init_lookup_2() { int i; tbl_lookup_lo = (float *)malloc(( LOOKUP_TBL_LN_2/2 + 1)*sizeof(float)); tbl_lookup_hi = (float *)malloc(( LOOKUP_TBL_LN_2/2 + 1)*sizeof(float)); tbl_lookup_hi = tbl_lookup_hi - LOOKUP_TBL_LN_2/2; tbl_interp = (float *)malloc(TWO_REMAIN_2*sizeof(float)); // lookup for (i = 0; i <= LOOKUP_TBL_LN_2; i++) tbl_lookup_lo[i] = log2f(1 + i/(float)LOOKUP_TBL_LN_2); for (i = LOOKUP_TBL_LN_2/2; i < LOOKUP_TBL_LN_2; i++) // log2, not log2f tbl_lookup_hi[i] = 1.0 - log2(1 + i/(float)LOOKUP_TBL_LN_2); tbl_lookup_hi[LOOKUP_TBL_LN_2] = 0.0f; for (i = 0; i < TWO_REMAIN_2; i++) tbl_interp[i] = i/(float)TWO_REMAIN_2; } #define TWO_23 8388608 // 2^23 static float *tbl_lookup_big_lo; static float *tbl_lookup_big_hi; static void init_biglookup() { int i; tbl_lookup_big_lo = (float *)malloc((TWO_23/2) * sizeof(float)); tbl_lookup_big_hi = (float *)malloc((TWO_23/2) * sizeof(float)); for (i = 0; i < TWO_23/2; i++) { tbl_lookup_big_lo[i] = log2f(1 + i/(float)TWO_23); // log2, not log2f tbl_lookup_big_hi[i] = 1.0 - log2(1 + (i+ TWO_23/2)/(float)TWO_23); } }

Powered by QuickLaTeX

opcvxHello.

You code is missing a *signif = signif – 1.0; * here:

} else {

// get signif by stuffing exp = 127 which corresponds to an exponent of 0

ux2.i = (ux1.i & 0x007FFFFF) | 0x3f800000;

signif = ux2.f;

fexp = exp – 127;

<.————–

lg2 = FN;

}

David GoldbergPost authorIt’s been corrected. Thanks for catching this error.

Kevin HallThis is an interesting read. I have a good background in math and physics and had previously done a lot of similar work numerically. The solution I immediately jumped to isn’t too different from yours in principle (quadratic fitting of a curve over a range and reducing the floating point number to fit in that range and adjusting at the end). The thing I realized though is that fitting the curve on the interval [2.0, 4.0) is a bit easier and doesn’t require the branching. Here’s my very short solution:

/* Variant A: Minimize the maximum difference from the true values over the interval */

float fastlog2_a(float f)

{

unsigned* up = (unsigned*)&f;

signed exp = signed(*up >> 23)-128;

*up &= 0x7FFFFF;

*up |= 0x40000000;

f = (-0.6392553165f)+f*(0.9888857503f+f*(-0.0827401982f));

return f + (float)exp;

}

According to my tests using MSVC 2010 in release mode optimizing for speed, it’s not only more accurate for most values (and leads to a smaller maximum error), but over twice as fast. I didn’t spend time trying to figure out if changing the structure of the functions could improve performance at all. This was just my first attempt. It’s possible that some performance could be gained by using a union instead of pointers or rearranging the quadratic formula.

If mathematically I missed some important detail or edge case, I’d be very interested in knowing what I missed.

You can massage the values for different purposes. The next function is identical in structure, but the coefficients are chosen to minimize the squared difference between returned values and true values.

/* Variant A: Minimize the squared difference between returned values

and true values over the interval */

float fastlog2_b(float f)

{

unsigned* up = (unsigned*)&f;

signed exp = signed(*up >> 23)-128;

*up &= 0x7FFFFF;

*up |= 0x40000000;

f = (-0.6493571041f)+f*(0.9976358628f+f*(-0.0842411241f));

return f + (float)exp;

}

Kevin HallSorry, I was comparing to the function in part I. I’ll have to compare to the function in part III.

David GoldbergPost authorMy goal is to write a fast log2 library routine with a spec. The spec gives a number N, and guarantees the answer is always accurate to N correct bits. So using (for example) the approximation in line 2 of the table, N=5.5 and my routine will always have 5.5 bits correct. Your routine varies in its accuracy. For example when x=1.001, it produces 0.0088 whereas the correct answer is 0.0014. For this value of x no bits are correct.

Naturally, a routine satisfying a spec has a speed penalty. I think my implementation of fastlog shows that you can both satisfy a spec and have still have a big performance gain over the built-in full-precision library log routine.

Kevin HallYeah, sorry. I was comparing against this code that I saw on the blog at one point (but can’t find now):

I did try to improve on your final function a bit. I will post a different reply on that.

Good work. Interesting read. Good exercise for me trying to do this stuff again (keeps me in practice).

Kevin HallSo, I wanted to see how I could improve your final implementation. There are two things I tried:

(1) Reduce Minimax Error: I noticed the constant coefficients didn’t exhaust the precision of single-precision floating point numbers, so I tried adding precision to see how much that could improve things. To my surprise, your results from Matlab was not optimal even given the precision listed (only slightly though and the best coefficients I could find were only slightly different). [.75, 1.5) remained the optimal range. The optimal coefficients I found were: a=0.3387930000 b=2.198637307 c=1.523692000. This reduced the minimax error from 0.000129699707 to 0.000114440918, a 12% reduction.

(2) Avoid the Branch: I thought branching might be expensive. Using the fact that log(ab) = log(a)+log(b), I figured after reducing the argument to [1,2), the number can be multiplied by .75 then the final answer can have log2(4/3) added back in. Using VS2010 in release mode optimized for speed, the code listed below appears to run in about 2/3 the time of the code you listed. Standard disclaimer: your mileage may vary depending on CPU, compiler, etc…. (Using a union instead of pointers may also change performance).

In the end, I do believe that my function maintains the same properties as yours does mathematically. (Again, please correct me if I missed something.) I just wanted to share back my experiments and what I learned with the community. I appreciate after all that you shared with the community. Thanks again.

Anyway, here’s my final function:

float myfastlog2_withdiv(float x)

{

unsigned* up = (unsigned*)&x;

signed exp = signed(*up >> 23)-127;

*up &= 0x7FFFFF;

*up |= 0x3F800000;

x *= .75f;

x -= 1.0f;

x = x*(0.3387930000f * x + 2.198637307f)/(x + 1.523692000f);

return x + (float)exp + 0.4150374998f;

}

David GoldbergPost authorYour new routine still doesn’t consistently deliver 7 correct bits. For example myfastlog2_withdiv(0.9999) = -0.000030 = -3e-5 but the correct answer is -0.000144 = -14e-5. After adjusting the exponents to be equal, it returns -3 instead of -14.

Kevin HallAhh…. I see. Another thing I didn’t catch. You’re minimizing the ratio of error to actual value whereas I was trying to minimize actual error. Different goals — probably useful in different applications too. I also see why you require the branching now. The fitting polynomial is closest to the actual value at .25 and .75 offset from the beginning of the domain, so you want the point of x=1.0f to be close to one of these points.

Cool work. It’s been a learning experience. Thanks.

Mark TrollIs it possible to use the representation of a floating point number in the computer to extract an approximation of the log2 of that number?

David GoldbergPost authorI’ll interpret your question to ask whether you could extract bits and use them to do a table lookup. You can, and I’ve added an addendum to this post with some sample code.