Fast Approximate Logarithms, Part III: The Formulas

In this post I finally get to the punch line and construct a set of approximations to \log (actually \log_2) that cover a range of speeds and accuracies. As execution time decreases, so does the accuracy. You can choose an approximation with a tradeoff that meets your needs.

When making a fast but approximate \log_2 function, the design parameter is the form of the approximating function. Is it a polynomial of degree 3? Or the ratio of two linear functions? The table below has a systematic list of possibilities: the top half of the table uses only multiplication, the bottom half uses division.

Rendered by
Table 1: Approximating Functions

The rewritten column rewrites the expression in a more efficient form. The expressions are used in the approximation procedure as follows: to get the approximation to \log_2 x, first x is reduced to the interval 0.75 \leq x < 1.5, then y = x-1 is substituted into the expression. As I explained in Part I and Part II, this procedure gives good accuracy and avoids floating-point cancellation problems.

For each form, the minimax values of a, b, \ldots are determined—that is, the values that minimize the maximum relative error. The bits column gives the bits of accuracy, computed as -\log_2(\epsilon) where \epsilon is the maximum relative error. This value is computed using the evaluation program from my Part II post, invoked as \mathit{eval\: 10\: 20\: 4194304}.

The “cost” column is the execution time. In olden days, floating-point operations were a lot more expensive than integer instructions, and they were executed sequentially. So cost could be estimated by counting the number of floating-point additions and multiplications. This is no longer true, so I estimate cost using the evaluation program. I put cost in quotes since the numbers apply only to my MacBook. The numbers are normalized so that the cost of log2f in the standard C library is 1.

The table shows that using division increases accuracy about the same amount as does adding an additional free parameter. For example, line 4 has 11.3 bits of accuracy with four free parameters a, b, c, and d. Using division, you get similar accuracy (11.6) with line 8, which has only three free parameters. Similarly, line 6 has about the same accuracy as line 10, and again line 10 has one less free parameter than line 6.

It’s hard to grasp the cost and time numbers in a table. The scatter plot below is a visualization with lines 2–10 each represented by a dot.


You can see one anomaly—the two blue dots that are (roughly) vertically stacked one above the other. They have similar cost but very different accuracy. One is the blue dot from line 9 with cost .43 and 7.5 bits of accuracy. The other is from line 3 which has about the same cost (0.42) but 8.5 bits of accuracy. So clearly, line 9 is a lousy choice. I’m not sure I would have guessed that without doing these calculations.

Having seen the problem with line 9, it is easy to explain it using the identity {\scriptstyle \log(1/x) = -\log(x)}. If {\scriptstyle \log_2 x} is approximated by {\scriptstyle f(x) = (Ax - A)/(x^2 + Bx + C)} on {\scriptstyle 0.75 \leq x \leq 1.5}, then {\scriptstyle f(1/x) \approx -f(x)\:} in {\scriptstyle \:0.75 \leq x \leq 1.333\:}, and this works out to be

    \[ \textstyle \frac{Ax - A}{x^2 + Bx + C} = \frac{Ax - A}{Cx + B + 1/x} } \]

which means that the denominators must be roughly equal. Multiplying each by {\scriptstyle x}, this becomes {\scriptstyle x^3 + Bx^2 + Cx \approx Cx^2 + Bx + 1}. The only way this can be true is if {\scriptstyle B} and {\scriptstyle C} are very large, so that it’s as if the {\scriptstyle x^3} term doesn’t exist and the approximation {\scriptstyle f} reduces to the quotient of two linear functions. And that is what happens. The optimal coefficients are on the order of {\scriptstyle 10^7}, which makes (now writing in terms of {\scriptstyle y = x-1}) {\scriptstyle ay/(y^2 + by + c) \,\approx\, ay/(by + c) \,=\, (a/b)y/(y + c/b)}. Plugging in the optimal values {\scriptstyle a = 1.523872 \times 10^8}, {\scriptstyle b = 5.127964 \times 10^7}, {\scriptstyle c = 1.051129 \times 10^8} gives {\scriptstyle a/b = 2.97169}, {\scriptstyle c/b = 2.049798}, which are very similar to the coefficients in line 7 shown in Table 2 later in this post. In other words, the optimal rational function in line 9 is almost identical to the one in line 7. Which explains why the bit accuracy is the same.

In the next plot I remove line 9, and add lines showing the trend.


The lines show that formulas using division outperform the multiplication-only formulas, and that the gain gets greater as the formulas become more complex (more costly to execute).

You might wonder: if one division is good, are two divisions even better? Division adds new power because a formula using division can’t be rewritten using just multiplication and addition. But a formula with two divisions can be rewritten to have only a single division, for example

    \[ \frac{ay + b}{y + c + \frac{d}{y}} = \frac{(ay + b)y}{y(y+c) + d} \]

Two divisions add no new functionality, but could be more efficient. In the example above, a division is traded for two multiplications. In fact, using two divisions gives an alternative way to write line 10 of Table 1. On my Mac, that’s a bad tradeoff: the execution time of the form with 2 divisions increases by 40%.

Up till now I’ve been completely silent about how I computed the minimax coefficients of the functions. Or to put it another way, how I computed the values of a, b, c, etc. in Table 1. This computation used to be done using the Remez algorithm, but now there is a simpler way that reduces to solving a convex optimization problem. That in turn can then be solved using (for example) CVX, a Matlab-based modeling system.

Here’s how it works for line 8. I want to find the minimax approximation to \log_2. As discussed in the first post of this series, it’s the relative error that should be minimized. This means solving

    \[ \min_{A,B,C}\; \max_{0.75 \leq x \leq 1.5} \left | \frac{\frac{A(x-1)^2 + B(x-1)}{(x-1)+C} - \log_2 x}{\log_2 x} \right | \]

This is equivalent to finding the smallest \lambda for which there are A, B, and C satisfying

    \begin{alignat*}{2} {\scriptstyle \left | \frac{\frac{A(x-1)^2 + B(x-1)}{(x-1)+C} - \log_2 x}{\log_2 x} \right |} & {\scriptstyle  \leq \lambda} &  {\scriptstyle  \mbox{\footnotesize for all } 0.75 \leq x \leq 1.5} \\ {\scriptstyle  \left | \frac{A(x-1)^2+B(x-1) }{(x-1)+C} - \log_2 x \right | } & {\scriptstyle  \leq \lambda \log_2 x \;\;\;} &  {\scriptstyle  0.75 \leq x \leq 1.5} \\ {\scriptstyle  \left | A(x-1)^2 + B(x-1) - ((x-1)+C) \log_2 x \right | } & {\scriptstyle  \leq \lambda ((x-1)+C) \log_2 x \;\;\;} &  {\scriptstyle  0.75 \leq x \leq 1.5} \\ {\scriptstyle \left | A(x_i-1)^2 + B(x_i-1) - ((x_i-1) + C) \log_2 x_i \right | } & {\scriptstyle  \leq \lambda ((x_i-1) + C) \log_2 x_i \;\;\;} &  {\scriptstyle 1 \leq i \leq m} \\ \end{alignat*}

For the last equation, pick 0.75 \leq x_1 < x_2 < \cdots x_m \leq 1.5. Of course this is not exactly equivalent to being true for all 0.75 \leq x \leq 1.5 but it is an excellent approximation. The notation may be a little confusing, because x_i are constants, and A, B and C are the variables. Now all I need is a package that will report if

    \[ {\scriptstyle \left | A(x_i-1)^2 + B(x_i-1) - ((x_i-1) + C) \log_2 x_i \right | \leq \lambda ((x_i-1) + C) \log_2 x_i \;\;\; 1 \leq i \leq m } \]

has a solution in A, B, C. Because then binary search can be used to find the minimal \lambda. Start with \lambda_{\scriptsize{\text{lo}}} = 0 that you know has no solution and \lambda_{\scriptsize{\text{hi}}} = 1 that is large enough to guarantee a solution. Then ask if the above has a solution for \lambda = (\lambda_{\scriptsize{\text{lo}}} + \lambda_{\scriptsize{\text{hi}}})/2. If it does, replace \lambda_{\scriptsize{\text{hi}}} \leftarrow \lambda; otherwise, \lambda_{\scriptsize{\text{lo}}} \leftarrow \lambda. Continue until \lambda has the desired precision.

The set of (A, B, C) satisfying the equation above is convex (this is easy to check), and so you can use a package like CVX for Matlab to quickly tell you if it has a solution. Below is code for computing the coefficients of line 8 in Table 1. This matlab/CVX code is modified from

It is a peculiarity of CVX that it can report \mathit{ Inaccurate/Solved} for a value of \lambda, but then report \mathit{Solved} for a smaller value of \lambda. So when seeing \mathit{ Inaccurate/Solved} I presume there is a solution, then decrease the upper bound and also record the corresponding values of A_k, B_k, and C_k. I do this for each step k of the binary search. I decide which k to use by making an independent calculation of the minimax error for each set (A_k, B_k, C_k).

You might think that using a finer grid (that is, increasing m so that there are more x_i) would give a better answer, but it is another peculiarity of CVX that this is not always the case. So in the independent calculation, I evaluate the minimax error on a very fine grid X_i that is independent of the grid size m given to CVX. This gives a better estimate of the error, and also lets me compare the answers I get using different values of m. Here is the CVX code:

format long format compact verbose=true bisection_tol = 1e-6; m=500; lo=0.70; % check values a little bit below 0.75 hi=1.5; xi = linspace(lo, hi, m)'; yi = log2(xi); Xi = linspace(lo, hi, 10000); % pick large number so you can compare different m Xi = Xi(Xi ~= 1); Yi = log2(Xi); xip = xi(xi >= 1); % those xi for which y = x-1 is positive xin = xi(xi = 0.75); xinn = xi(xi = 1); yin = yi(xi = 0.75); yinn = yi(xi = 0.75); Xin = Xi(Xi = bisection_tol gamma = (l+u)/2; cvx_begin % solve the feasibility problem cvx_quiet(true); variable A; variable B; variable C; subject to abs(A*(xip - 1).^2 + B*(xip - 1) - yip .* (xip - 1 + C)) <= ... gamma * yip .* (xip-1 + C) abs(A*(xin - 1).^2 + B*(xin - 1)- yin .* (xin - 1 + C)) <= ... -gamma * yin .* (xin-1 + C) abs(A*(2*xinn - 1).^2 + B*(2*xinn - 1) - (1 + yinn) .* (2*xinn - 1 + C)) <= ... -gamma * yinn .* (2*xinn - 1 + C) cvx_end if verbose fprintf('l=%7.5f u=%7.5f cvx_status=%s\n', l, u, cvx_status) end if strcmp(cvx_status,'Solved') | strcmp(cvx_status, 'Inaccurate/Solved') u = gamma; A_opt(k) = A; B_opt(k) = B; C_opt(k) = C; lo = (A*(2*Xin - 1).^2 + B*(2*Xin - 1)) ./ (2*Xin - 1 + C) - 1; hi = (A*(Xip - 1).^2 + B*(Xip -1)) ./ (Xip - 1 + C); fx = [lo, hi]; [maxRelErr(k), maxInd(k)] = max(abs( (fx - Yi)./Yi )); k = k + 1; else l = gamma; end end [lambda_opt, k] = min(maxRelErr); A = A_opt(k) B = B_opt(k) C = C_opt(k) lambda_opt -log2(lambda_opt)

Here are the results of running the above code for the expressions in the first table. I don’t bother giving all the digits for line 9, since it is outperformed by line 7.

Rendered by
Table 2: Coefficients for Table 1

So, what’s the bottom line? If you don’t have specific speed or accuracy requirements, I recommend choosing either line 3 or line 7. Run both through the evaluation program to get the cost for your machine and choose the one with the lowest cost. On the other hand, if you have specific accuracy/speed tradeoffs, recompute the cost column of Table 1 for your machine, and pick the appropriate line. The bits column is machine independent as long as the machine uses IEEE arithmetic.

If you want a rational function with more accuracy than line 10, the next choice is cubic/quadratic which gives 20.7 bits of accuracy. That would be {\scriptstyle \left(A(x-1)^3 + B(x-1)^2 + C(x-1)\right) / \left( (x-1)^2 + D(x-1) + E \right) with coefficients A = 0.1501692, B = 3.4226132, C = 5.0225057, D = 4.1130283, E = 3.4813372.

Finally, I’ll close by giving the C code for line 8 (almost a repeat of code from the first posting). This is bare code with no sanity checking on the input parameter x. I’ve marked the lines that need to be modified if you want to use it for a different approximating expression.

float fastlog2(float x) // compute log2(x) by reducing x to [0.75, 1.5) { /** MODIFY THIS SECTION **/ // (x-1)*(a*(x-1) + b)/((x-1) + c) (line 8 of table 2) const float a = 0.338953; const float b = 2.198599; const float c = 1.523692; #define FN fexp + signif*(a*signif + b)/(signif + c) /** END SECTION **/ float signif, fexp; int exp; float lg2; union { float f; unsigned int i; } ux1, ux2; int greater; // really a boolean /* * Assume IEEE representation, which is sgn(1):exp(8):frac(23) * representing (1+frac)*2^(exp-127). Call 1+frac the significand */ // get exponent ux1.f = x; exp = (ux1.i & 0x7F800000) >> 23; // actual exponent is exp-127, will subtract 127 later greater = ux1.i & 0x00400000; // true if signif > 1.5 if (greater) { // signif >= 1.5 so need to divide by 2. Accomplish this by // stuffing exp = 126 which corresponds to an exponent of -1 ux2.i = (ux1.i & 0x007FFFFF) | 0x3f000000; signif = ux2.f; fexp = exp - 126; // 126 instead of 127 compensates for division by 2 signif = signif - 1.0; lg2 = FN; } else { // get signif by stuffing exp = 127 which corresponds to an exponent of 0 ux2.i = (ux1.i & 0x007FFFFF) | 0x3f800000; signif = ux2.f; fexp = exp - 127; signif = signif - 1.0; lg2 = FN; } // last two lines of each branch are common code, but optimize better // when duplicated, at least when using gcc return(lg2); }

Powered by QuickLaTeX

Fast Approximate Logarithms, Part II: Rounding Error

In Part I, I assumed all computations were done exactly. It’s time to bring in floating-point rounding error. There’s a key fact about rounding error that might seem surprising: If x \approx 1, then the computation of x-1 in floating-point has no rounding error. The surprise comes for those who’ve been taught to beware of rounding error when subtracting two similar quantities. For example, the computation in floating-point of x^2-1 might be far from the true value of x^2-1 when x \approx 1. That’s because x^2 can have rounding error, so the subtraction might cancel most of the good bits leaving mainly the incorrect (due to rounding) bits. But in x-1, neither x nor 1 have rounding error, so the computation of x-1 is exactly equal to x-1 when x \approx 1.

Here’s the explanation. Rounding error in x+y can happen in two ways. First if x > y and x has a different exponent from y, then x will have its fractional part shifted right to make the exponents match, and so x might drop some bits. Second, even if the exponents are the same, there may be rounding error if the addition of the fractional parts has a carry-out from the high order bit. In the case of x − 1, the exponents are the same if 1 ≤ x < 2. And if 1/2 ≤ x < 1, the larger number is 1 and it does not drop bits when shifted. So there is no rounding error in x − 1 if 1/2 < x < 2.

The rule of thumb is that when approximating a function f with f(a) = 0, severe rounding error can be reduced if the approximation is in terms of $x-a$. For us, \log 1 = 0 so a=1, and the rule suggests polynomials written in terms of x-1 rather than x. By the key fact above, there is no rounding error in x-1 when x \approx 1.

Let me apply that to two different forms of the quadratic polynomials used in Part I: the polynomial can be written in terms of x or x-1.

    \begin{eqnarray*} \log_2x & \approx & s_2 x^2 + s_1x + s_0 \\ \log_2x & \approx & \widehat{s}_2 (x-1)^2 + \widehat{s}_1 (x-1) + \widehat{s}_0 \end{eqnarray*}

If they are to be used on the interval [0.75, 1.5) and I want to minimize relative error, it is crucial that the polynomial be 0 when x=1, so they become

    \begin{eqnarray*} \log_2x & \approx & s_2 x^2 + s_1x + (-s_2 - s_1) \\ \log_2x & \approx & \widehat{s}_2 (x-1)^2 + \widehat{s}_1 (x-1) \end{eqnarray*}

The second equation has no constant term, so they both cost the same amount to evaluate, in that they involve the same number of additions and multiplications.

But one is much more accurate. You can see that empirically using an evaluation program (code below) that I will be using throughout to compare different approximations. I invoked the program as \mathit{eval\: 10\: 40960\: 1024} and \mathit{eval\: 10\: 10\: 4194304} and got the following:

using x    bits  5.5 at x=0.750000  2.06 nsecs nsec/bit=0.372  bits/nsec=2.69
using x-1  bits  5.5 at x=0.750000  2.10 nsecs nsec/bit=0.380  bits/nsec=2.63

SPACING IS 1/4194304 = 2^-22
using x    bits  1.7 at x=1-2.4e-07 2.08 nsecs nsec/bit=1.222  bits/nsec=0.82
using x-1  bits  5.5 at x=0.750000  2.12 nsecs nsec/bit=0.384  bits/nsec=2.61

When the approximating polynomials are evaluated at points spaced 1/1024 apart, they have similar performance. The accuracy of both is 5.5 bits, and the one using x-1 is slightly slower. But when they are evaluated at points spaced 2^{-22} apart, the polynomial using x has poor accuracy when x is slightly below 1. Specifically, the accuracy is only 1.7 bits when x \approx 1 - 2.4 \times 10^{-7}.

To see why, note that when x \approx 1, \widehat{s}_2 (x-1)^2 + \widehat{s}_1 (x-1) is summing two numbers that have rounding error, but are of different sizes, since (x-1)^2 \ll x-1. But s_2 x^2 + s_1x + s_0 is summing two numbers of similar size, since $s_0 = -s_1 - s_2$ and the sum of the first two terms is about s_1 + s_2. This is the bad case of subtracting two nearby numbers (cancellation), because they both have rounding error.

I suppose it is an arguable point whether full accuracy for all x is worth a time performance hit of about 2%. I will offer this argument: you can reason about your program if you know it has (in this case) 5.5 bits of accuracy on every input. You don’t want to spend a lot of time tracking down unexpectedly low accuracy in your code that came about because you used a log library function with poor precision on a small set of inputs.

Here’s some more information on the output of the evalution program displayed above. The first number is accuracy in bits measured in the usual way as -\log_2 E where E is the maximum relative error. Following \mathit{x=} is the value of x where the max error E occured. The execution time (e.g. 2.06 nsecs for the first line) is an estimate of the time it takes to do a single approximation to \log_2, including reducing the argument x to the interval [0.75, 1.5). The last two numbers are self explanatory.

Estimating execution time is tricky. For example on my MacBook, if the argument x must be brought into the cache, it will significantly affect the timings. That’s why the evaluation program brings \mathit{xarr} and \mathit{yarr} into the cache before beginning the timing runs.

For polynomials, using x-1 has almost the same cost and better accuracy, so there is a good argument that it is superior to using x. Things are not so clear when the approximation is a rational function rather than a polynomial. For example, q(x) =\frac{a(x-1)+b}{c(x-1)+d}. Because q(1) = 0, the numerator is actually a(x-1). And because you can multiply numerator and denominator by anything (as long as it’s the same anything), it further simplifies to q(x) = \frac{a(x-1)}{(x-1)+e}. This will have no floating-point cancellation, and will have good accuracy even when x \approx 1. But there’s a rewrite of this expression that is faster:

    \[ \frac{a(x-1)}{(x-1)+e} = a - \frac{ae}{(x-1) + e} \]

Unfortunately, this brings back cancellation, because when x \approx 1 there will be cancellation between a and the fraction. Because there’s cancellation anyway, you might as well make a further performance improvement eliminating the need to compute x-1, namely

    \[ \frac{a(x-1)}{(x-1)+e} = a - \frac{ae}{x + (e-1)} \]

Both sides have a division. In addition, the left hand side has a multiplication and 2 additions. The right hand side has no multiplications and 2 additions (ae is a constant and doesn’t involve a run-time multiplication, similarly for e-1). So there is one less multiplication, which should be faster. But at the cost of a rounding error problem when x \approx 1.

using x-1  bits  7.5 at x=0.750000  2.17 nsecs nsec/bit=0.289  bits/nsec=3.46
using x    bits  7.5 at x=0.750000  2.07 nsecs nsec/bit=0.275  bits/nsec=3.64

SPACING IS 1/4194304 = 2^-22
using x-1  bits  7.5 at x=0.750000  2.24 nsecs nsec/bit=0.298  bits/nsec=3.36
using x    bits  1.4 at x=1+2.4e-07 2.09 nsecs nsec/bit=1.522  bits/nsec=0.66

As expected, the rational function that has one less multiplication (the line marked using x) is faster, but has poor accuracy when x is near 1. There’s a simple idea for a fix. When |x-1| is small, use the Taylor series, \log_2 x \approx \log_2 e \left( (x-1) - \frac{1}{2} (x-1)^2 + \cdots\right). Using (\log_2 e) (x-1) is a subtraction and a multiplication, which is most likely cheaper than a division and two additions, What is the size cutoff? The error in the Taylor series is easy to compute: it is the next term in the series, (x-1)^2/2, so the relative error is about (x-1)/2. And I want to maintain an accuracy of 7.5 bits, or 2^{-7.5} \approx 0.0055. So the cutoff is 1 \pm \delta, where \delta/2 = 0.0055 or \delta = 0.011.

On my MacBook, the most efficient way to implement |x-1| < \delta appears to be x < 1 + \delta \small{\mathsf{\;AND\;}} x > 1 - \delta. In the evaluation program, most of the x are greater than 1, so only the first of the inequalities is executed. Despite this, adding the check still has a high cost, but no more accuracy than using x-1.

using x-1  bits  7.5 at x=0.750000  2.17 nsecs nsec/bit=0.289  bits/nsec=3.46
using x    bits  7.5 at x=0.750000  2.07 nsecs nsec/bit=0.275  bits/nsec=3.64
cutoff     bits  7.5 at x=0.750000  2.58 nsecs nsec/bit=0.343  bits/nsec=2.91

SPACING IS 1/4194304 = 2^-22
using x-1  bits  7.5 at x=0.750000  2.24 nsecs nsec/bit=0.298  bits/nsec=3.36
using x    bits  1.4 at x=1+2.4e-07 2.09 nsecs nsec/bit=1.522  bits/nsec=0.66
cutoff     bits  7.5 at x=0.989000  2.60 nsecs nsec/bit=0.347  bits/nsec=2.88

In Part I of this series, I noted that testing whether x was in the range 0.75 \leq x < 1.5 can be done with a bit operation rather than a floating-point one. The same idea could be used here. Instead of using the Taylor series when |x-1| < 0.011 or 0.989 \leq x \leq 1.011, use it in a slightly smaller interval

    \begin{alignat*}{3} & 0.992 && < x && < 1.007812 \\ & 1 - 2^{-7} && < x && < 1 + 2^{-7} \end{alignat*}

The latter can be converted to bit operations on $f$, the fraction part of x, as follows:

    \begin{eqnarray*} 1 \leq x < 1 + 2^{-7} & \small{\mathsf{\;\; OR \;\;}} & 1 - 2^{-7} < x < 1 \\ 1 \leq x < 2 \;\small{\mathsf{\; AND \;}} f < 2^{-7} & \small{\mathsf{\;\; OR \;\;}} & \frac{1}{2} \leq x < 1 \;\small{\mathsf{\; AND \;}} 1 - 2^{-6} < 2x-1 = f\\ \mathit{\; exp \;} = 0 \;\small{\mathsf{\; AND \;}} f < 2^{-7} & \small{\mathsf{\;\; OR \;\;}} & \mathit{\; exp \;} = -1 \;\small{\mathsf{\; AND \;}} f > 1 - 2^{-6} \end{eqnarray*}

As bit operations, this is

(exp == 0 && (f & 111111100...) == 0) 
   OR (exp = -1 && (f & 11111100...) == 111111000...)

When I tested this improvement (\mathit{fcutoff} in the table below). it was faster, but still slower than using x-1, at least on my MacBook.

using x-1  bits  7.5 at x=0.750000  2.17 nsecs nsec/bit=0.289  bits/nsec=3.46
using x    bits  7.5 at x=0.750000  2.07 nsecs nsec/bit=0.275  bits/nsec=3.64
cutoff     bits  7.5 at x=0.750000  2.58 nsecs nsec/bit=0.343  bits/nsec=2.91
fcutoff    bits  7.5 at x=0.750000  2.46 nsecs nsec/bit=0.327  bits/nsec=3.06

SPACING IS 1/4194304 = 2^-22
using x-1  bits  7.5 at x=0.750000  2.24 nsecs nsec/bit=0.298  bits/nsec=3.36
using x    bits  1.4 at x=1+2.4e-07 2.09 nsecs nsec/bit=1.522  bits/nsec=0.66
cutoff     bits  7.5 at x=0.989000  2.60 nsecs nsec/bit=0.347  bits/nsec=2.88
fcutoff    bits  7.5 at x=0.750001  2.44 nsecs nsec/bit=0.325  bits/nsec=3.08

Bottom line: having special case code when x \approx 1 appears to significantly underperform computing in terms of x-1.

In the first post, I recommended reducing to [0.75, 1.5) instead of [1, 2) because you get one extra degree of freedom, which in turns gives greater accuracy. Rounding error gives another reason for preferring [0.75, 1.5). When x = 1 - \delta, reduction to [1, 2) will have cancellation problems. Recall the g function that was optimal for the interval [1, 2), g(x) = -(1/3)(x-1)^2 + (4/3)(x-1). When x = 1 - \delta, x must be multiplied by two to move into [1, 2), and then to compensate, the result is g(2x) - 1. When x \approx 1, g(2x) \approx 1, and so you get cancellation. Below are the results of running the evaluation program on g. If there was no rounding error, g would be accurate to 3.7 bits. As you get closer to 1 (\delta \rightarrow 0) the accuracy drops.

g    bits  3.5 at x=1-3.8e-06  2.05 nsecs nsec/bit=0.592  bits/nsec=1.69

g    bits  2.9 at x=1-9.5e-07  2.05 nsecs nsec/bit=0.706  bits/nsec=1.42

g    bits  2.9 at x=1-9.5e-07  2.05 nsecs nsec/bit=0.706  bits/nsec=1.42

g    bits  1.7 at x=1-2.4e-07  2.05 nsecs nsec/bit=1.203  bits/nsec=0.83

The goal of this series of posts is to show that you can create logarithm routines that are much faster than the library versions and have a minimum guaranteed accuracy for all x. To do this requires paying attention to rounding error. Summarizing what I’ve said so far, my method for minimizing rounding error problems is to reduce x to the interval [0.75, 1.5) and write the approximating expression using x-1, for example a(x-1)^2 + b(x-1)). More generally, the approximating expression would be a polynomial

    \[ a_n(x-1)^n + a_{n-1}(x-1)^{n-1} + \cdots + a_1(x-1) \]

or rational function

    \[ \frac{ b_n(x-1)^n + b_{n-1}(x-1)^{n-1} + \cdots + b_1(x-1)}{ (x-1)^m + c_{m-1}(x-1)^{m-1} + \cdots + c_1(x-1) + c_0} \]

I close by giving the code for the evaluation program that was used to compare the time and accuracy of the different approximations:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <math.h> 

 *   Usage: eval [hi reps spacing]
 *       Evaluates an approximation to log2 in the interval [0.125, hi]
 *       For timing purposes, repeats the evaluation reps times.
 *       The evaluation is done on points spaced 1/spacing apart.	       

main(argc, argv)
    char **argv;
    float x;
    struct timeval start, stop;
    float lo, hi, delta;
    int i, j, n, repetitions, one_over_delta;
    double xd;
    float *xarr, *lg2arr, *yarr;

    // parameters
    lo = 0.125;
    hi = 10.0;
    one_over_delta = 4194304.0; // 2^22
    repetitions = 1;
    if (argc > 1) {
        hi = atof(argv[1]);
        repetitions = atoi(argv[2]);
        one_over_delta = atoi(argv[3]);
    delta = 1.0/one_over_delta;

    // setup
    n = ceil((hi - lo)/delta) + 1;
    xarr = (float *)malloc(n*sizeof(float));
    yarr = (float *)malloc(n*sizeof(float));
    lg2arr = (float *)malloc(n*sizeof(float));
    i = 0;
    for (xd = lo; xd <= hi; xd += delta) {
         x = xd;
         if (x == 1.0) // relative error would be infinity
         xarr[i] = x;
         lg2arr[i++] = log2(x);
     if (i >= n)  // assert (i < n)
         fprintf(stderr, "Help!!!\n");
     n = i;
     /* cache-in xarr[i], yarr[i] */
     yarr[0] = 0.0;
     for (i = 1; i < n; i++) {
         yarr[i] = xarr[i] + yarr[i-1];
     fprintf(stderr, "cache-in: %f\n\n", yarr[n-1]); // to foil optimizer
     gettimeofday(&start, 0);
     for (j = 0; j < repetitions; j++) {
         for (i = 0; i < n; i++) {
             yarr[i] = approx_fn(xarr[i]);
     gettimeofday(&stop, 0);
     finish(&start, &stop, "name ", n, repetitions, xarr, yarr, lg2arr);
 // convert x to string, with special attention when x is near 1
 char *format(float x) {
     static char buf[64];
     float y;
     if (fabs(x - 1) > 0.0001)
        sprintf(buf, "%f", x);
    else {
        y = x-1;
        if (y < 0)
             sprintf(buf, "1%.1e", y);
             sprintf(buf, "1+%.1e", y);
 finish(struct timeval *start, struct timeval *stop,
        char *str, int n, int repetitions, float *xarr, float *yarr, float *lg2arr)
     double elapsed; // nanosecs
     float max, rel;
     int maxi, i;
     double bits;
     elapsed = 1e9*(stop->tv_sec - start->tv_sec) + 1000.0*(stop->tv_usec - start->tv_usec);
    max = 0.0;
    for (i = 0; i < n; i++ ) {
         rel = fabs( (yarr[i] - lg2arr[i])/lg2arr[i]);
         if (rel > max) {
            max = rel;
            maxi = i;
    bits = -log2(max);
    elapsed = elapsed/(n*repetitions);
    printf("%s bits %4.1f at x=%s  %.2f nsecs nsec/bit=%.3f  bits/nsec=%.2f\n",
           str, bits, format(xarr[maxi]), elapsed, elapsed/bits, bits/elapsed);

Powered by QuickLaTeX

Mobile First – A Retrospective

Responsive_Web_DesignMost engineers would agree that simply having a mobile-first mindset is not enough to build a high-quality responsive website — we also need relevant mobile-first guidelines and principles. We recently explored the challenges of scaling a mobile-oriented site to tablet and desktop. What we found was not always pretty, but the lessons learned were valuable. In this article, we will explore those lessons, in hopes that we can all improve how we build across mobile, tablet, and desktop.

Building a fully scalable website requires a strong focus on code quality. Concepts such as modularity, encapsulation, and testability become extremely important as you move across domains. Whether we are scaling up to desktop or down to mobile, we need the code to stay consistent and maintainable. Every hacked, poorly planned, or rushed piece of code we might add reduces our ability to write elegant, scalable, responsive code.

Perhaps creating a responsive app is not high on your team’s priority list right now. But one day it will be — and the conversion time frame might be very tight when that day comes.

Ideally, all you need to do is add media query CSS and everything just works. But the only way that can happen is if the code readily adapts to responsive changes.

Below are some suggestions and fixes that will make conversion to responsive easier. Some are specific to responsive design while others are general good practices.

Media queries

Yes, we all know about media queries. How hard can they be? Sprinkle some on any page and you have a responsive website, right?

Using media queries on your pages is essential; they allow you to overwrite CSS values based on screen size. This technique might sound simple, but in a larger project it can quickly get out of hand. A few major problems can get in the way of using media queries properly:

  • Colliding media queries: It is easy to make the mistake of writing media queries that overwrite each other if you do not stick to a common pattern. We recommend using the same boilerplate throughout all projects, and have created one here.
  • Setting element styles from JS: This is a tempting, but inferior, approach to building responsive websites. When an element relies on JS logic to set its width, it is unable to properly use media queries. If the JS logic is setting width as an inline property, the width cannot be overwritten in CSS without using !important. In addition, you have to now maintain an ever-growing set of JS logic.
  • Media queries not at the bottom: If your queries are not loaded last, they will not override their intended targets. Every module might have its own CSS file, and the overall ordering might not place it at the bottom, which leads us to our next point.
  • CSS namespacing for encapsulation: If you are writing a module, its CSS selectors should be properly encapsulated via namespace. We recommend prefixing class names with the module name, such as navbar-parent. Following this pattern will prevent conflicts with other modules, and will ensure that media queries at the bottom of your module’s CSS file override their intended targets.
  • Too many CSS selectors: CSS specificity rules require media queries to use the same specificity in order to override. It is easy to get carried away in LESS, which allows you to nest CSS multiple levels deep. While it can be useful to go one or two levels deep for encapsulation, usually this is unnecessarily complicating your code. We recommend favoring namespacing over nested specifiers as it is cleaner and easier to maintain.
  • Using !important to override styles: Adding !important to your styles reduces maintainability. It is better to avoid relying on !important overrides and instead use CSS namespacing to prevent sharing between modules.

Responsive or adaptive?

Both responsive and adaptive web design techniques contain powerful tools, but it is important to understand the differences between the two. Responsive techniques usually include media queries, fluid grids, and CSS percentage values. Adaptive techniques, on the other hand, are focused more on JavaScript logic, and the adding or removing of features based on device detection or screen size.

So, which should you use? Responsive or adaptive? The answer depends on the feature you are trying to implement. It can be tempting to jump straight into applying adaptive techniques to your feature, but in many cases it may not be required. Worse, applying adaptive techniques can quickly over-complicate your design. An example of this that we saw in many places is the use of JavaScript logic to set CSS style attributes.

Use JavaScript sparingly

When styling your UI, JavaScript should be avoided whenever possible. Dynamic sizing, for example, is better done through media queries. For most UI designs, you will be deciding on layouts based on screen size, not on device type. Confusing the need for device detection with screen size can lead us to apply adaptive where responsive would be superior.

Rethink any design that requires CSS attributes to change based on device detection; in almost all cases it will be better to rely on screen size alone, via media queries. So, when should we use adaptive Javascript techniques?

When to use adaptive

Adaptive web design techniques are powerful, as they allow for selective loading of resources based on user agent or screen size. Logic that checks for desktop browsers, for example, can load high-resolution images instead of their mobile-optimized counterparts. Loading additional resources and features for larger screens can also be useful. Desktop browsers, for example, could show more functionality due to the increased screen size, browser capability, or bandwidth.

Ideally, additional resources will be lazy-loaded for their intended platforms. Lazily loading modules helps with site speed for mobile web, while still allowing for a full set of functionality for desktop and tablet web. This technique can be applied by checking the user agent on the client or server. If done on the server, only resources supported by the user’s platform should be returned. Alternatively, client-based lazy loading can use Ajax requests to load additional resources if they are supported. This effect can be achieved using client-side JavaScript, based on browser support or user agent. Client-side detection is generally preferred, as it allows feature detection based on actual browser functionality instead of potentially complicated user agent checks.

Simple flex grid example

A responsive flex grid doesn’t have to be complicated. In our live demo page, we show a simple implementation that creates a horizontally scrolling section of image containers. The images are centered, allowed to expand up to 100% of their container, and will maintain their original aspect ratio. In addition, the container height values are set to 100%, allowing us to adjust the height in the parent wrapper only, and keeping our media query overrides simple and easy to read.

The html and css source code use the concepts mentioned above. We plan to add more boilerplate patterns; please don’t hesitate to add your own as well. Pull requests are welcomed!

Best practices

We hope that the information above will come in handy when you are working on your next mobile-first web project. Below is a summary of what we mentioned above and other helpful tips.

On using media queries

  • Most responsive layout can and should be done with media queries. JS manipulation of CSS (maybe with the exception of adding/removing classes) should be avoided. Setting width in JS is not as maintainable or dynamic compared to CSS.
  • Use media query boilerplate to ensure you do not have contradicting media queries or have media queries that are always skipped.
  • Put media queries at the bottom. Media queries override CSS and should be the final overrides, whether page level or module level.
  • If your regular CSS rules have many selectors, your media query CSS rules will have to as well, due to CSS specificity rules. Use as few selectors as possible when defining CSS rules.

 On writing CSS

  • Use CSS classes, not CSS IDs, to avoid CSS specificity issues.
  • Use the fewest number of selectors possible to define your selector.
  • Reuse classes. If an element has the same look on different parts of the page, do not create two different classes. Make a generic class and reuse it.
  • Encapsulate your CSS selectors by using proper namespacing to prevent conflicts.

e.g., class=”module-name-parent”

  • It is very rare that you need to use !important. Before you use it, ask yourself whether you can instead add another class (parent or same level). And then ask yourself whether the rule you are trying to override has unnecessary selectors.

 On writing LESS

  • Use LESS nesting only where needed. Nesting is good for organization, but it is also a recipe for CSS specificity issues.
  • Check that you do not have a CSS rule that looks like this:
    #wrapper #body-content #content #left-side #text {
        border: 1px solid #000;
  • Work with the design team and define LESS variables using good names. Then, use these LESS variables everywhere possible.
  • If you are using a set of CSS rules repeatedly, make it a LESS mixin.

 On adding element wrappers

  • Most dom structures are more complex than necessary.
  • Add a wrapper only when needed. Do not add a wrapper when proper CSS can do the same thing.
  • If you remove the wrapper and the layout does not change, you do not need it. Now, do a global search for this wrapper’s references (JS, CSS, rhtml, jsp, tag) and remove them.

On lazy loading

  • Add a placeholder to your component for lazy loading.
  • Lazy-loaded sections will start off empty, so make sure you reserve the correct amount of space for this behavior. Otherwise, you will see the page shift as modules load in.
  • Use media queries for the empty section so that it closely matches the filled size.

On cleaning up

  • If you are playing around with CSS to attempt a layout and it starts working, remember to remove the unnecessary CSS rules. Many of them are probably not needed anymore. Remove the unnecessary wrappers as well.

Image source: