Continue Reading »]]>

When making a fast but approximate function, the design parameter is the form of the approximating function. Is it a polynomial of degree 3? Or the ratio of two linear functions? The table below has a systematic list of possibilities: the top half of the table uses only multiplication, the bottom half uses division.

Table 1: Approximating Functions

The *rewritten* column rewrites the expression in a more efficient form. The expressions are used in the approximation procedure as follows: to get the approximation to , first is reduced to the interval , then is substituted into the expression. As I explained in Part I and Part II, this procedure gives good accuracy and avoids floating-point cancellation problems.

For each form, the minimax values of are determined—that is, the values that minimize the maximum relative error. The *bits* column gives the bits of accuracy, computed as where is the maximum relative error. This value is computed using the evaluation program from my Part II post, invoked as .

The *“cost”* column is the execution time. In olden days, floating-point operations were a lot more expensive than integer instructions, and they were executed sequentially. So cost could be estimated by counting the number of floating-point additions and multiplications. This is no longer true, so I estimate cost using the evaluation program. I put cost in quotes since the numbers apply only to my MacBook. The numbers are normalized so that the cost of log2f in the standard C library is 1.

The table shows that using division increases accuracy about the same amount as does adding an additional free parameter. For example, line 4 has 11.3 bits of accuracy with four free parameters *a*, *b*, *c*, and *d*. Using division, you get similar accuracy (11.6) with line 8, which has only three free parameters. Similarly, line 6 has about the same accuracy as line 10, and again line 10 has one less free parameter than line 6.

It’s hard to grasp the cost and time numbers in a table. The scatter plot below is a visualization with lines 2–10 each represented by a dot.

You can see one anomaly—the two blue dots that are (roughly) vertically stacked one above the other. They have similar cost but very different accuracy. One is the blue dot from line 9 with cost .43 and 7.5 bits of accuracy. The other is from line 3 which has about the same cost (0.42) but 8.5 bits of accuracy. So clearly, line 9 is a lousy choice. I’m not sure I would have guessed that without doing these calculations.

Having seen the problem with line 9, it is easy to explain it using the identity . If is approximated by on , then in , and this works out to be

which means that the denominators must be roughly equal. Multiplying each by , this becomes . The only way this can be true is if and are very large, so that it’s as if the term doesn’t exist and the approximation reduces to the quotient of two linear functions. And that is what happens. The optimal coefficients are on the order of , which makes (now writing in terms of ) . Plugging in the optimal values , , gives , , which are very similar to the coefficients in line 7 shown in Table 2 later in this post. In other words, the optimal rational function in line 9 is almost identical to the one in line 7. Which explains why the bit accuracy is the same.

In the next plot I remove line 9, and add lines showing the trend.

The lines show that formulas using division outperform the multiplication-only formulas, and that the gain gets greater as the formulas become more complex (more costly to execute).

You might wonder: if one division is good, are two divisions even better? Division adds new power because a formula using division can’t be rewritten using just multiplication and addition. But a formula with two divisions can be rewritten to have only a single division, for example

Two divisions add no new functionality, but could be more efficient. In the example above, a division is traded for two multiplications. In fact, using two divisions gives an alternative way to write line 10 of Table 1. On my Mac, that’s a bad tradeoff: the execution time of the form with 2 divisions increases by 40%.

Up till now I’ve been completely silent about how I computed the minimax coefficients of the functions. Or to put it another way, how I computed the values of , , , etc. in Table 1. This computation used to be done using the Remez algorithm, but now there is a simpler way that reduces to solving a convex optimization problem. That in turn can then be solved using (for example) CVX, a Matlab-based modeling system.

Here’s how it works for line 8. I want to find the minimax approximation to . As discussed in the first post of this series, it’s the relative error that should be minimized. This means solving

This is equivalent to finding the smallest for which there are , , and satisfying

For the last equation, pick . Of course this is not exactly equivalent to being true for all but it is an excellent approximation. The notation may be a little confusing, because are constants, and , and are the variables. Now all I need is a package that will report if

has a solution in , , . Because then binary search can be used to find the minimal . Start with that you know has no solution and that is large enough to guarantee a solution. Then ask if the above has a solution for . If it does, replace ; otherwise, . Continue until has the desired precision.

The set of satisfying the equation above is convex (this is easy to check), and so you can use a package like CVX for Matlab to quickly tell you if it has a solution. Below is code for computing the coefficients of line 8 in Table 1. This matlab/CVX code is modified from http://see.stanford.edu/materials/lsocoee364a/hw6sol.pdf.

It is a peculiarity of CVX that it can report for a value of , but then report for a smaller value of . So when seeing I presume there is a solution, then decrease the upper bound and also record the corresponding values of , , and . I do this for each step of the binary search. I decide which to use by making an independent calculation of the minimax error for each set .

You might think that using a finer grid (that is, increasing so that there are more ) would give a better answer, but it is another peculiarity of CVX that this is not always the case. So in the independent calculation, I evaluate the minimax error on a very fine grid that is independent of the grid size given to CVX. This gives a better estimate of the error, and also lets me compare the answers I get using different values of . Here is the CVX code:

format long format compact verbose=true bisection_tol = 1e-6; m=500; lo=0.70; % check values a little bit below 0.75 hi=1.5; xi = linspace(lo, hi, m)'; yi = log2(xi); Xi = linspace(lo, hi, 10000); % pick large number so you can compare different m Xi = Xi(Xi ~= 1); Yi = log2(Xi); xip = xi(xi >= 1); % those xi for which y = x-1 is positive xin = xi(xi = 0.75); xinn = xi(xi = 1); yin = yi(xi = 0.75); yinn = yi(xi = 0.75); Xin = Xi(Xi = bisection_tol gamma = (l+u)/2; cvx_begin % solve the feasibility problem cvx_quiet(true); variable A; variable B; variable C; subject to abs(A*(xip - 1).^2 + B*(xip - 1) - yip .* (xip - 1 + C)) <= ... gamma * yip .* (xip-1 + C) abs(A*(xin - 1).^2 + B*(xin - 1)- yin .* (xin - 1 + C)) <= ... -gamma * yin .* (xin-1 + C) abs(A*(2*xinn - 1).^2 + B*(2*xinn - 1) - (1 + yinn) .* (2*xinn - 1 + C)) <= ... -gamma * yinn .* (2*xinn - 1 + C) cvx_end if verbose fprintf('l=%7.5f u=%7.5f cvx_status=%s\n', l, u, cvx_status) end if strcmp(cvx_status,'Solved') | strcmp(cvx_status, 'Inaccurate/Solved') u = gamma; A_opt(k) = A; B_opt(k) = B; C_opt(k) = C; lo = (A*(2*Xin - 1).^2 + B*(2*Xin - 1)) ./ (2*Xin - 1 + C) - 1; hi = (A*(Xip - 1).^2 + B*(Xip -1)) ./ (Xip - 1 + C); fx = [lo, hi]; [maxRelErr(k), maxInd(k)] = max(abs( (fx - Yi)./Yi )); k = k + 1; else l = gamma; end end [lambda_opt, k] = min(maxRelErr); A = A_opt(k) B = B_opt(k) C = C_opt(k) lambda_opt -log2(lambda_opt)

Here are the results of running the above code for the expressions in the first table. I don’t bother giving all the digits for line 9, since it is outperformed by line 7.

Table 2: Coefficients for Table 1

So, what’s the bottom line? If you don’t have specific speed or accuracy requirements, I recommend choosing either line 3 or line 7. Run both through the evaluation program to get the cost for your machine and choose the one with the lowest cost. On the other hand, if you have specific accuracy/speed tradeoffs, recompute the cost column of Table 1 for your machine, and pick the appropriate line. The bits column is machine independent as long as the machine uses IEEE arithmetic.

If you want a rational function with more accuracy than line 10, the next choice is cubic/quadratic which gives 20.7 bits of accuracy. That would be with coefficients *A* = 0.1501692, *B* = 3.4226132, *C* = 5.0225057, *D* = 4.1130283, *E* = 3.4813372.

Finally, I’ll close by giving the C code for line 8 (almost a repeat of code from the first posting). This is bare code with no sanity checking on the input parameter . I’ve marked the lines that need to be modified if you want to use it for a different approximating expression.

float fastlog2(float x) // compute log2(x) by reducing x to [0.75, 1.5) { /** MODIFY THIS SECTION **/ // (x-1)*(a*(x-1) + b)/((x-1) + c) (line 8 of table 2) const float a = 0.338953; const float b = 2.198599; const float c = 1.523692; #define FN fexp + signif*(a*signif + b)/(signif + c) /** END SECTION **/ float signif, fexp; int exp; float lg2; union { float f; unsigned int i; } ux1, ux2; int greater; // really a boolean /* * Assume IEEE representation, which is sgn(1):exp(8):frac(23) * representing (1+frac)*2^(exp-127). Call 1+frac the significand */ // get exponent ux1.f = x; exp = (ux1.i & 0x7F800000) >> 23; // actual exponent is exp-127, will subtract 127 later greater = ux1.i & 0x00400000; // true if signif > 1.5 if (greater) { // signif >= 1.5 so need to divide by 2. Accomplish this by // stuffing exp = 126 which corresponds to an exponent of -1 ux2.i = (ux1.i & 0x007FFFFF) | 0x3f000000; signif = ux2.f; fexp = exp - 126; // 126 instead of 127 compensates for division by 2 signif = signif - 1.0; lg2 = FN; } else { // get signif by stuffing exp = 127 which corresponds to an exponent of 0 ux2.i = (ux1.i & 0x007FFFFF) | 0x3f800000; signif = ux2.f; fexp = exp - 127; signif = signif - 1.0; lg2 = FN; } // last two lines of each branch are common code, but optimize better // when duplicated, at least when using gcc return(lg2); }

Powered by QuickLaTeX

]]>Continue Reading »]]>

Here’s the explanation. Rounding error in *x+y* can happen in two ways. First if *x > y* and *x* has a different exponent from *y*, then *x* will have its fractional part shifted right to make the exponents match, and so *x* might drop some bits. Second, even if the exponents are the same, there may be rounding error if the addition of the fractional parts has a carry-out from the high order bit. In the case of *x − 1*, the exponents are the same if 1 ≤ *x* < 2. And if 1/2 ≤ *x* < 1, the larger number is 1 and it does not drop bits when shifted. So there is no rounding error in *x − 1* if 1/2 < *x* < 2.

The rule of thumb is that when approximating a function with , severe rounding error can be reduced if the approximation is in terms of . For us, so , and the rule suggests polynomials written in terms of rather than . By the key fact above, there is no rounding error in when .

Let me apply that to two different forms of the quadratic polynomials used in Part I: the polynomial can be written in terms of or .

If they are to be used on the interval and I want to minimize relative error, it is crucial that the polynomial be 0 when , so they become

The second equation has no constant term, so they both cost the same amount to evaluate, in that they involve the same number of additions and multiplications.

But one is much more accurate. You can see that empirically using an evaluation program (code below) that I will be using throughout to compare different approximations. I invoked the program as and and got the following:

SPACING IS 1/1024 using x bits 5.5 at x=0.750000 2.06 nsecs nsec/bit=0.372 bits/nsec=2.69 using x-1 bits 5.5 at x=0.750000 2.10 nsecs nsec/bit=0.380 bits/nsec=2.63 SPACING IS 1/4194304 = 2^-22 using x bits 1.7 at x=1-2.4e-07 2.08 nsecs nsec/bit=1.222 bits/nsec=0.82 using x-1 bits 5.5 at x=0.750000 2.12 nsecs nsec/bit=0.384 bits/nsec=2.61

When the approximating polynomials are evaluated at points spaced 1/1024 apart, they have similar performance. The accuracy of both is 5.5 bits, and the one using is slightly slower. But when they are evaluated at points spaced apart, the polynomial using has poor accuracy when is slightly below 1. Specifically, the accuracy is only 1.7 bits when .

To see why, note that when , is summing two numbers that have rounding error, but are of different sizes, since . But is summing two numbers of similar size, since and the sum of the first two terms is about . This is the bad case of subtracting two nearby numbers (cancellation), because they both have rounding error.

I suppose it is an arguable point whether full accuracy for all is worth a time performance hit of about 2%. I will offer this argument: you can reason about your program if you know it has (in this case) 5.5 bits of accuracy on *every* input. You don’t want to spend a lot of time tracking down unexpectedly low accuracy in your code that came about because you used a log library function with poor precision on a small set of inputs.

Here’s some more information on the output of the evalution program displayed above. The first number is accuracy in bits measured in the usual way as where is the maximum relative error. Following is the value of where the max error occured. The execution time (e.g. 2.06 nsecs for the first line) is an estimate of the time it takes to do a single approximation to , including reducing the argument to the interval . The last two numbers are self explanatory.

Estimating execution time is tricky. For example on my MacBook, if the argument must be brought into the cache, it will significantly affect the timings. That’s why the evaluation program brings and into the cache before beginning the timing runs.

For polynomials, using has almost the same cost and better accuracy, so there is a good argument that it is superior to using . Things are not so clear when the approximation is a rational function rather than a polynomial. For example, . Because , the numerator is actually . And because you can multiply numerator and denominator by anything (as long as it’s the same anything), it further simplifies to . This will have no floating-point cancellation, and will have good accuracy even when . But there’s a rewrite of this expression that is faster:

Unfortunately, this brings back cancellation, because when there will be cancellation between and the fraction. Because there’s cancellation anyway, you might as well make a further performance improvement eliminating the need to compute , namely

Both sides have a division. In addition, the left hand side has a multiplication and 2 additions. The right hand side has no multiplications and 2 additions ( is a constant and doesn’t involve a run-time multiplication, similarly for ). So there is one less multiplication, which should be faster. But at the cost of a rounding error problem when .

SPACING IS 1/1024 using x-1 bits 7.5 at x=0.750000 2.17 nsecs nsec/bit=0.289 bits/nsec=3.46 using x bits 7.5 at x=0.750000 2.07 nsecs nsec/bit=0.275 bits/nsec=3.64 SPACING IS 1/4194304 = 2^-22 using x-1 bits 7.5 at x=0.750000 2.24 nsecs nsec/bit=0.298 bits/nsec=3.36 using x bits 1.4 at x=1+2.4e-07 2.09 nsecs nsec/bit=1.522 bits/nsec=0.66

As expected, the rational function that has one less multiplication (the line marked *using x)* is faster, but has poor accuracy when is near 1. There’s a simple idea for a fix. When is small, use the Taylor series, . Using is a subtraction and a multiplication, which is most likely cheaper than a division and two additions, What is the size cutoff? The error in the Taylor series is easy to compute: it is the next term in the series, , so the relative error is about . And I want to maintain an accuracy of 7.5 bits, or . So the cutoff is , where or .

On my MacBook, the most efficient way to implement appears to be . In the evaluation program, most of the are greater than 1, so only the first of the inequalities is executed. Despite this, adding the check still has a high cost, but no more accuracy than using .

SPACING IS 1/1024 using x-1 bits 7.5 at x=0.750000 2.17 nsecs nsec/bit=0.289 bits/nsec=3.46 using x bits 7.5 at x=0.750000 2.07 nsecs nsec/bit=0.275 bits/nsec=3.64 cutoff bits 7.5 at x=0.750000 2.58 nsecs nsec/bit=0.343 bits/nsec=2.91 SPACING IS 1/4194304 = 2^-22 using x-1 bits 7.5 at x=0.750000 2.24 nsecs nsec/bit=0.298 bits/nsec=3.36 using x bits 1.4 at x=1+2.4e-07 2.09 nsecs nsec/bit=1.522 bits/nsec=0.66 cutoff bits 7.5 at x=0.989000 2.60 nsecs nsec/bit=0.347 bits/nsec=2.88

In Part I of this series, I noted that testing whether was in the range can be done with a bit operation rather than a floating-point one. The same idea could be used here. Instead of using the Taylor series when or , use it in a slightly smaller interval

The latter can be converted to bit operations on , the fraction part of x, as follows:

As bit operations, this is

(exp == 0 && (f & 111111100...) == 0) OR (exp = -1 && (f & 11111100...) == 111111000...)

When I tested this improvement ( in the table below). it was faster, but still slower than using , at least on my MacBook.

SPACING IS 1/1024 using x-1 bits 7.5 at x=0.750000 2.17 nsecs nsec/bit=0.289 bits/nsec=3.46 using x bits 7.5 at x=0.750000 2.07 nsecs nsec/bit=0.275 bits/nsec=3.64 cutoff bits 7.5 at x=0.750000 2.58 nsecs nsec/bit=0.343 bits/nsec=2.91 fcutoff bits 7.5 at x=0.750000 2.46 nsecs nsec/bit=0.327 bits/nsec=3.06 SPACING IS 1/4194304 = 2^-22 using x-1 bits 7.5 at x=0.750000 2.24 nsecs nsec/bit=0.298 bits/nsec=3.36 using x bits 1.4 at x=1+2.4e-07 2.09 nsecs nsec/bit=1.522 bits/nsec=0.66 cutoff bits 7.5 at x=0.989000 2.60 nsecs nsec/bit=0.347 bits/nsec=2.88 fcutoff bits 7.5 at x=0.750001 2.44 nsecs nsec/bit=0.325 bits/nsec=3.08

Bottom line: having special case code when appears to significantly underperform computing in terms of .

In the first post, I recommended reducing to instead of because you get one extra degree of freedom, which in turns gives greater accuracy. Rounding error gives another reason for preferring . When , reduction to will have cancellation problems. Recall the function that was optimal for the interval , . When , must be multiplied by two to move into , and then to compensate, the result is . When , , and so you get cancellation. Below are the results of running the evaluation program on . If there was no rounding error, would be accurate to 3.7 bits. As you get closer to 1 () the accuracy drops.

SPACING IS 1/2^19 g bits 3.5 at x=1-3.8e-06 2.05 nsecs nsec/bit=0.592 bits/nsec=1.69 SPACING IS 1/2^20 g bits 2.9 at x=1-9.5e-07 2.05 nsecs nsec/bit=0.706 bits/nsec=1.42 SPACING IS 1/2^21 g bits 2.9 at x=1-9.5e-07 2.05 nsecs nsec/bit=0.706 bits/nsec=1.42 SPACING IS 1/2^22 g bits 1.7 at x=1-2.4e-07 2.05 nsecs nsec/bit=1.203 bits/nsec=0.83

The goal of this series of posts is to show that you can create logarithm routines that are much faster than the library versions and have a minimum guaranteed accuracy for all . To do this requires paying attention to rounding error. Summarizing what I’ve said so far, my method for minimizing rounding error problems is to reduce to the interval and write the approximating expression using , for example ). More generally, the approximating expression would be a polynomial

or rational function

I close by giving the code for the evaluation program that was used to compare the time and accuracy of the different approximations:

#include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <math.h> /* * Usage: eval [hi reps spacing] * Evaluates an approximation to log2 in the interval [0.125, hi] * For timing purposes, repeats the evaluation reps times. * The evaluation is done on points spaced 1/spacing apart. */ int main(argc, argv) char **argv; { float x; struct timeval start, stop; float lo, hi, delta; int i, j, n, repetitions, one_over_delta; double xd; float *xarr, *lg2arr, *yarr; // parameters lo = 0.125; hi = 10.0; one_over_delta = 4194304.0; // 2^22 repetitions = 1; if (argc > 1) { hi = atof(argv[1]); repetitions = atoi(argv[2]); one_over_delta = atoi(argv[3]); } delta = 1.0/one_over_delta; // setup n = ceil((hi - lo)/delta) + 1; xarr = (float *)malloc(n*sizeof(float)); yarr = (float *)malloc(n*sizeof(float)); lg2arr = (float *)malloc(n*sizeof(float)); i = 0; for (xd = lo; xd <= hi; xd += delta) { x = xd; if (x == 1.0) // relative error would be infinity continue; xarr[i] = x; lg2arr[i++] = log2(x); } if (i >= n) // assert (i < n) fprintf(stderr, "Help!!!\n"); n = i; /* cache-in xarr[i], yarr[i] */ yarr[0] = 0.0; for (i = 1; i < n; i++) { yarr[i] = xarr[i] + yarr[i-1]; } fprintf(stderr, "cache-in: %f\n\n", yarr[n-1]); // to foil optimizer gettimeofday(&start, 0); for (j = 0; j < repetitions; j++) { for (i = 0; i < n; i++) { yarr[i] = approx_fn(xarr[i]); } } gettimeofday(&stop, 0); finish(&start, &stop, "name ", n, repetitions, xarr, yarr, lg2arr); exit(0); } // convert x to string, with special attention when x is near 1 char *format(float x) { static char buf[64]; float y; if (fabs(x - 1) > 0.0001) sprintf(buf, "%f", x); else { y = x-1; if (y < 0) sprintf(buf, "1%.1e", y); else sprintf(buf, "1+%.1e", y); } return(buf); } void finish(struct timeval *start, struct timeval *stop, char *str, int n, int repetitions, float *xarr, float *yarr, float *lg2arr) { double elapsed; // nanosecs float max, rel; int maxi, i; double bits; elapsed = 1e9*(stop->tv_sec - start->tv_sec) + 1000.0*(stop->tv_usec - start->tv_usec); max = 0.0; for (i = 0; i < n; i++ ) { rel = fabs( (yarr[i] - lg2arr[i])/lg2arr[i]); if (rel > max) { max = rel; maxi = i; } } bits = -log2(max); elapsed = elapsed/(n*repetitions); printf("%s bits %4.1f at x=%s %.2f nsecs nsec/bit=%.3f bits/nsec=%.2f\n", str, bits, format(xarr[maxi]), elapsed, elapsed/bits, bits/elapsed); }

Powered by QuickLaTeX

]]>Continue Reading »]]>

Building a fully scalable website requires a strong focus on code quality. Concepts such as modularity, encapsulation, and testability become extremely important as you move across domains. Whether we are scaling up to desktop or down to mobile, we need the code to stay consistent and maintainable. Every hacked, poorly planned, or rushed piece of code we might add reduces our ability to write elegant, scalable, responsive code.

Perhaps creating a responsive app is not high on your team’s priority list right now. But one day it will be — and the conversion time frame might be very tight when that day comes.

Ideally, all you need to do is add media query CSS and everything just works. But the only way that can happen is if the code readily adapts to responsive changes.

Below are some suggestions and fixes that will make conversion to responsive easier. Some are specific to responsive design while others are general good practices.

Yes, we all know about media queries. How hard can they be? Sprinkle some on any page and you have a responsive website, right?

Using media queries on your pages is essential; they allow you to overwrite CSS values based on screen size. This technique might sound simple, but in a larger project it can quickly get out of hand. A few major problems can get in the way of using media queries properly:

**Colliding media queries:**It is easy to make the mistake of writing media queries that overwrite each other if you do not stick to a common pattern. We recommend using the same boilerplate throughout all projects, and have created one here.**Setting element styles from JS:**This is a tempting, but inferior, approach to building responsive websites. When an element relies on JS logic to set its width, it is unable to properly use media queries. If the JS logic is setting width as an inline property, the width cannot be overwritten in CSS without using`!important`

. In addition, you have to now maintain an ever-growing set of JS logic.**Media queries not at the bottom:**If your queries are not loaded last, they will not override their intended targets. Every module might have its own CSS file, and the overall ordering might not place it at the bottom, which leads us to our next point.**CSS namespacing for encapsulation:**If you are writing a module, its CSS selectors should be properly encapsulated via namespace. We recommend prefixing class names with the module name, such as*navbar-parent*. Following this pattern will prevent conflicts with other modules, and will ensure that media queries at the bottom of your module’s CSS file override their intended targets.**Too many CSS selectors:**CSS specificity rules require media queries to use the same specificity in order to override. It is easy to get carried away in LESS, which allows you to nest CSS multiple levels deep. While it can be useful to go one or two levels deep for encapsulation, usually this is unnecessarily complicating your code. We recommend favoring namespacing over nested specifiers as it is cleaner and easier to maintain.**Using**Adding`!important`

to override styles:`!important`

to your styles reduces maintainability. It is better to avoid relying on`!important`

overrides and instead use CSS namespacing to prevent sharing between modules.

Both responsive and adaptive web design techniques contain powerful tools, but it is important to understand the differences between the two. Responsive techniques usually include media queries, fluid grids, and CSS percentage values. Adaptive techniques, on the other hand, are focused more on JavaScript logic, and the adding or removing of features based on device detection or screen size.

So, which should you use? Responsive or adaptive? The answer depends on the feature you are trying to implement. It can be tempting to jump straight into applying adaptive techniques to your feature, but in many cases it may not be required. Worse, applying adaptive techniques can quickly over-complicate your design. An example of this that we saw in many places is the use of JavaScript logic to set CSS style attributes.

When styling your UI, JavaScript should be avoided whenever possible. Dynamic sizing, for example, is better done through media queries. For most UI designs, you will be deciding on layouts based on *screen size*, not on device type. Confusing the need for device detection with screen size can lead us to apply adaptive where responsive would be superior.

Rethink any design that requires CSS attributes to change based on device detection; in almost all cases it will be better to rely on screen size alone, via media queries. So, when should we use adaptive Javascript techniques?

Adaptive web design techniques are powerful, as they allow for selective loading of resources based on user agent or screen size. Logic that checks for desktop browsers, for example, can load high-resolution images instead of their mobile-optimized counterparts. Loading additional resources and features for larger screens can also be useful. Desktop browsers, for example, could show more functionality due to the increased screen size, browser capability, or bandwidth.

Ideally, additional resources will be lazy-loaded for their intended platforms. Lazily loading modules helps with site speed for mobile web, while still allowing for a full set of functionality for desktop and tablet web. This technique can be applied by checking the user agent on the client or server. If done on the server, only resources supported by the user’s platform should be returned. Alternatively, client-based lazy loading can use Ajax requests to load additional resources if they are supported. This effect can be achieved using client-side JavaScript, based on browser support or user agent. Client-side detection is generally preferred, as it allows feature detection based on actual browser functionality instead of potentially complicated user agent checks.

A responsive flex grid doesn’t have to be complicated. In our live demo page, we show a simple implementation that creates a horizontally scrolling section of image containers. The images are centered, allowed to expand up to 100% of their container, and will maintain their original aspect ratio. In addition, the container height values are set to 100%, allowing us to adjust the height in the parent wrapper only, and keeping our media query overrides simple and easy to read.

The html and css source code use the concepts mentioned above. We plan to add more boilerplate patterns; please don’t hesitate to add your own as well. Pull requests are welcomed!

We hope that the information above will come in handy when you are working on your next mobile-first web project. Below is a summary of what we mentioned above and other helpful tips.

- Most responsive layout can and should be done with media queries. JS manipulation of CSS (maybe with the exception of adding/removing classes) should be avoided. Setting width in JS is not as maintainable or dynamic compared to CSS.
- Use media query boilerplate to ensure you do not have contradicting media queries or have media queries that are always skipped.
- Put media queries at the bottom. Media queries override CSS and should be the final overrides, whether page level or module level.
- If your regular CSS rules have many selectors, your media query CSS rules will have to as well, due to CSS specificity rules. Use as few selectors as possible when defining CSS rules.

- Use CSS classes, not CSS IDs, to avoid CSS specificity issues.
- Use the fewest number of selectors possible to define your selector.
- Reuse classes. If an element has the same look on different parts of the page, do not create two different classes. Make a generic class and reuse it.
- Encapsulate your CSS selectors by using proper namespacing to prevent conflicts.

e.g., `class=”module-name-parent”`

- It is very rare that you need to use
`!important`

. Before you use it, ask yourself whether you can instead add another class (parent or same level). And then ask yourself whether the rule you are trying to override has unnecessary selectors.

- Use LESS nesting only where needed. Nesting is good for organization, but it is also a recipe for CSS specificity issues.
- Check that you do not have a CSS rule that looks like this:

#wrapper #body-content #content #left-side #text { border: 1px solid #000; }

- Work with the design team and define LESS variables using good names. Then, use these LESS variables everywhere possible.
- If you are using a set of CSS rules repeatedly, make it a LESS mixin.

- Most dom structures are more complex than necessary.
- Add a wrapper only when needed. Do not add a wrapper when proper CSS can do the same thing.
- If you remove the wrapper and the layout does not change, you do not need it. Now, do a global search for this wrapper’s references (JS, CSS, rhtml, jsp, tag) and remove them.

- Add a placeholder to your component for lazy loading.
- Lazy-loaded sections will start off empty, so make sure you reserve the correct amount of space for this behavior. Otherwise, you will see the page shift as modules load in.
- Use media queries for the empty section so that it closely matches the filled size.

- If you are playing around with CSS to attempt a layout and it starts working, remember to remove the unnecessary CSS rules. Many of them are probably not needed anymore. Remove the unnecessary wrappers as well.

Image source: http://upload.wikimedia.org/wikipedia/commons/e/e2/Responsive_Web_Design.png

]]>Continue Reading »]]>

You can find code for approximate logs on the web, but they rarely come with an evaluation of how they compare to the alternatives, or in what sense they might be optimal. That is the gap I’m trying to fill here. The first post in this series covers the basics, but even if you are familiar with this subject I think you will find some interesting nuggets. The second post considers rounding error, and the final post gives the code for a family of fast log functions.

A very common way to to compute log (meaning ) is by using the formula to reduce the problem to computing . The reason is that for arbitrary is easily reduced to the computation of for in the interval ; details below. So for the rest of this series I will exclusively focus on computing . The red curve in the plot below shows on . For comparison, I also plot the straight line .

If you’ve taken a calculus course, you know that has a Taylor series about which is . Combining with gives ( for **T**aylor)

How well does approximate ?

The plot shows that the approximation is very good when , but is lousy for near 2—so is a flop over the whole interval from 1 to 2. But there is a function that does very well over the whole interval: . I call it for better. It is shown below in red ( in blue). The plot makes it look like a very good approximation.

A better way to see quality of the approximation is to plot the error . The largest errors are around and .

Now that I’ve shown you an example, let me get to the first main topic: how do you evaluate different approximations? The conventional answer is *minimax*. Minimax is very conservative—it only cares about the worst (max) error. It judges the approximation over the entire range by its error on the worst point. As previously mentioned, in the example above the worst error occurs at , or perhaps at , since the two have very similar errors. The term minimax means you want to minimize the max error, in other words find the function with the minimum max error. The max error here is very close to 0.0050, and it is the smallest you can get with a quadratic polynomial. In other words, solves the minimax problem.

Now, onto the first of the nuggets mentioned at the opening. One of the most basic facts about is that , whether it’s or or . This means there’s a big difference between ordinary and relative error when .

As an example, take . The error in is quite small: . But most likely you care much more about relative error: , which is huge, about . It’s relative error that tells you how many bits are correct. If and agree to bits, then is about . Or putting it another way, if the relative error is , then the approxmation is good to about bits.

The function that solved the minimax problem solved it for ordinary error. But it is a lousy choice for relative error. The reason is that its ordinary error is about near . As it follows that , and so the relative error will be roughly . But no problem. I can compute a minimax polynomial with respect to relative error; I’ll call it for relative. The following table compares the coefficients of the Taylor method , minimax for ordinary error and minimax for relative error :

The coefficients of and are similar, at least compared to , but is a function that is always good to at least 5 bits, as the plot of relative error (below) shows.

Here’s a justification for my claim that is good to 5 bits. The max relative error for occurs at , , and . For example, at

If you’re a nitpicker, you might question whether this is good to 5 bits as claimed. But if you round each expression to 5 bits, each is .

Unfortunately, there’s a big problem we’ve overlooked. What happens outside the interval [1,2)? Floating-point numbers are represented as with . This leads to the fact mentioned above: . So you only need to compute on the interval . When you use for and reduce to this range for other , you get

The results are awful for just below 1. After seeing this plot, you can easily figure out the problem. The relative error of for is about 0.02, and is almost the same as ordinary error (since the denominator is close to ). Now take an just below 1. Such an is multiplied by 2 to move it into [1,2), and the approximation to is , where the compensates for changing to . The ordinary error is still about 0.02. But is very small for , so the ordinary error of 0.02 is transformed to , which is enormous. At the very least, a candidate for small relative error must satisfy . But . This can be fixed by finding the polynomial that solves the minimax problem for all . The result is a polynomial for global.

One surprise about *g* is that its coefficients appear to be simple rational numbers, suggesting there might be a simple proof that this polynomial is optimal. And there is an easy argument that it is *locally* optimal. Since *g*(*x*) = *Cx*^{2} + *Bx *+ *A* must satisfy *g*(1) = 0 and *g(*2) = 1 it is of the form g_{C}(*x*) = *C*(*x*-1)^{2} + (1−*C*)(*x−*1). When *x* > 1 the relative error is ε(*x*) = (g_{C}(x)− log_{2}(*x*)) ⁄ log_{2}(*x*) and lim_{x→ 1+ }*ε*(x) = (1−*C*)log2 − 1. When x < 1 then *ε*(*x*) = (g_{C}(2*x*) − 1− log_{2} (*x*)) ⁄ log_{2}(*x*) and lim _{x→1− }*ε*(x) = 2(1+C)log2 − 1. The optimal g_{C} has these two limits equal, that is (1−*C*)log2 − 1 = 2(1+*C*)log2 − 1, which has the solution *C* = −1/3.

Globally (over all ), the blue curve does dramatically better, but of course it comes at a cost. Its relative error is not as good as over the interval [1, 2). That’s because it’s required to satisfy in order to have a small relative error at . The extra requirement reduces the degrees of freedom, and so does less well on [1, 2].

Finally, I come to the second nugget. The discussion so far suggests rethinking the basic strategy. Why reduce to the interval [1,2)? Any interval will do. What about using [0.75, 1.5)? It is easy to reduce to this interval (as I show below), and it imposes only a single requirement: that . This gives an extra degree of freedom that can be used to do a better job of approximating . I call the function based on reduction [0.75, 1.5) for shift, since the interval has been shifted.

The result is a thing of beauty! The error of is significantly less than the error in . But you might wonder about the cost: isn’t it more expensive to reduce to [0.75, 1.5) instead of [1.0, 2.0)? The answer is that the cost is small. A floating-point number is represented as , with stored in the right-most 23 bits. To reduce to [0.75, 1.5) requires knowing when , and that is true exactly when the left-most of the 23 bits is one. In other words, it can be done with a simple bit check, not a floating-point operation.

Here is more detail. To reduce to , I first need code to reduce to the interval . There are library routines for this of course. But since I’m doing this whole project for speed, I want to be sure I have an efficient reduction, so I write my own. That code combined with the further reduction to is below. Naturally, everything is written in single-precision floating-point. You can see that the extra cost of reducing to [0.75, 1.5) is a bit-wise operation to compute the value , and a test to see if is nonzero. Both are integer operations.

The code does not check that , much less check for infinities or NaNs. This may be appropriate for a fast version of log.

float fastlog2(float x) // compute log2(x) by reducing x to [0.75, 1.5) { // a*(x-1)^2 + b*(x-1) approximates log2(x) when 0.75 <= x < 1.5 const float a = -.6296735; const float b = 1.466967; float signif, fexp; int exp; float lg2; union { float f; unsigned int i; } ux1, ux2; int greater; // really a boolean /* * Assume IEEE representation, which is sgn(1):exp(8):frac(23) * representing (1+frac)*2^(exp-127) Call 1+frac the significand */ // get exponent ux1.f = x; exp = (ux1.i & 0x7F800000) >> 23; // actual exponent is exp-127, will subtract 127 later greater = ux1.i & 0x00400000; // true if signif > 1.5 if (greater) { // signif >= 1.5 so need to divide by 2. Accomplish this by // stuffing exp = 126 which corresponds to an exponent of -1 ux2.i = (ux1.i & 0x007FFFFF) | 0x3f000000; signif = ux2.f; fexp = exp - 126; // 126 instead of 127 compensates for division by 2 signif = signif - 1.0; // < lg2 = fexp + a*signif*signif + b*signif; // < } else { // get signif by stuffing exp = 127 which corresponds to an exponent of 0 ux2.i = (ux1.i & 0x007FFFFF) | 0x3f800000; signif = ux2.f; fexp = exp - 127; signif = signif - 1.0; // <<-- lg2 = fexp + a*signif*signif + b*signif; // <<-- } // lines marked <<-- are common code, but optimize better // when duplicated, at least when using gcc return(lg2); }

You might worry that the conditional test *if greater * will slow things down. The test can be replaced with an array lookup. Instead of doing a bitwise with 0x3f000000 in one branch and 0x3f800000 in the other, you can have a single branch that uses . Similarly for the other difference in the branches, *exp−126* versus *exp−127*. This was not faster on my MacBook.

In summary:

- To study an approximation , don’t plot and directly, instead plot their difference.
- The measure of goodness for is its maximum error.
- The best is the one with the smallest max error (minimax).
- For a function like , ordinary and relative error are quite different. The proper yardstick for the quality of an approximation to is the number of correct bits, which is relative error.
- Computing requires reducing to an interval but you don’t need . There are advantages to picking instead.

In the next post, I’ll examine rounding error, and how that affects good approximations.

Powered by QuickLaTeX

]]>Continue Reading »]]>

One of the projects we are currently working on at Shutl involves building an iOS application. The application is essentially quite simple; it acts as a client for our API, adding animations, visuals, and notifications.

Testing is a key part of our development process, so when we started developing the application, one of the first steps was to find a testing framework that suited our needs. XCode provides XCTest as a testing framework that works good for unit testing. Unfortunately, if you want to test the behavior of your app from a user perspective, XCTest’s abilities are very limited.

Because we are mostly a Ruby shop, we’re familiar with using cucumber.

That’s how we came across Frank, a handy framework which enables you to write functional tests for your iOS applications using cucumber.

The way Frank works is that you “frankify” your iOS app, which then lets you use the accessibility features of iOS to emulate a user using an iOS device. You can launch the app, rotate the device, and interact with the screen in most of the ways a real user can.

If you’re familiar with CSS selectors, interacting with elements on the screen should look very familiar, albeit with a slightly different syntax. Frank also provides custom selectors and predefined steps for some of the most common interactions.

For instance if you want to select a label with the content “I am a label” you could use this:

check_element_exists('label marked:"I am a label"')

There are also predefined steps provided for more complex instructions like clicking a button with the content “Click me”:

When I touch the button marked "Click me"

At first we considered testing against a live QA server but soon experienced problems with this setup. We needed predictable data for our tests, and this is difficult to achieve as the data stored in a live QA environment changes all the time. Combine this with availability issues and you’ve got yourself an unworkable solution.

After some thought, the route we decided to take was to mock these services and return fixtures.

The idea is to keep all the logic that directly interacts with the server inside one unique class or struct. It will provide necessary functions such as `fetchUser`

and `updateResource`

that can be invoked from wherever they’re needed. This allows us to easily implement alternate versions of these functions without affecting the rest of the code.

In the example code below, we have two different implementations. The first one, shown here, uses our remote API to retrieve data from the server.

```
static func requestSuperHeroName(name: String, gender: String, completionHandler: (String) -> Void) {
let url = NSURL(string: "http://localhost:4567?name=\(name)&gender=\(gender)")
let request = NSURLRequest(URL: url!)
NSURLConnection.sendAsynchronousRequest(request, queue: NSOperationQueue.mainQueue()) {(response, data, error) in
if let name = NSString(data: data, encoding: NSUTF8StringEncoding) {
completionHandler(name)
}
}
}
```

The second implementation – our test mock – is simply returning a hard-coded value with the same structure as the ones returned by the server.

```
static func requestSuperHeroName(name: String, gender: String, completionHandler: (String) -> Void) {
completionHandler("Super \(name)")
}
```

Next we’ll define two different targets, one using the real client and another one using the test mock client, and we’ll use the second – mocked – target to create our frankified app.

Here is a walk-through of how such an app could be implemented by using our sample app “What superhero are you?”. You provide the app with your name and gender, and it uses a highly advanced algorithm to determine which superhero you are.

- Set your app up with two targets. One will be using the real backend, and the other one will be using the mocked backend.

- Frankify your app.

- Write your first test. Our first feature looks like this:
Feature: As a user I want to use the app So I can determine which superhero I am Scenario: Put in my name and gender and have my superhero have it return which superhero I am Given I launch the app When I enter my name And I choose my gender And I touch the button marked "Which superhero am I?" Then I want to see which superhero I am

And the related steps:

When(/^I enter my name$/) do fill_in('Name', with: 'Jon') end When(/^I choose my sex$/) do touch("view:'UISegmentLabel' marked:'Male'") end Then(/^I want to see which superhero I am$/) do sleep 1 check_element_exists("view:'UILabel' marked:'Super Jon'") end

- Make the tests pass!

- Done!

## Conclusions

This solution works, but it’s not without its limitations. The most significant one being that you need to return some sensible data in your mocks. In our test app we work with very simple logic, and it did the trick. We return fixed responses, which means that there is no way of testing more complex interactions. These can and should be covered by unit and integration tests, which come with their own problems.

It can also be hard to test certain user actions, like swiping something on the screen. The more customized your app’s interface is, the harder it will be to test it with Frank. Almost anything can be done, but the solution will most likely feel hacked. Also we have yet to find a way of testing web UIs.

Frank is not a magic bullet for functional testing in Swift, but so far we’ve found it a useful addition to our codebase, and we’re liking it!

### Links

What Superhero are you? on Github

Testing with Frank

*(CC image by Chris Harrison)*

(CC image by Esther Vargas)

Continue Reading »]]>

The podcast, How eBay’s Search Technology Helps Users Find Your Listings, touches on how eBay’s evolving machine learning search technology helps users find the listings they are looking for. Dan describes the parts of each listing that eBay’s algorithm searches to contextually find the best listings for users’ searches, as well as recommend additional listings users might be interested in.

The discussion covers the following topics:

- Whether eBay search utilizes users’ prior search behavior to influence search results
- The top areas in a product listing that are crawled first
- Why search is so important to eBay
- What eBay is looking forward to in the future of our search technology
- How eBay is treating mobile search

Continue Reading »]]>

I’ve written in the past that I believe that retrospectives should be a creative process, and I like to engage the brain using interesting visuals and ideas. I’ve attempted to employ this philosophy at Shutl (an eBay Inc. company) by trying to use a different theme for every retrospective I’ve run. (A recent example of a theme I found through funretrospectives.com is the catapult retro.)

Then a few weeks ago, I made a comment to one of our engineers, Volker, that you could pretty much take any situation you can think of and turn it into a retrospective idea; thus the challenge of a Zombie Apocalypse- themed retro was born!

I was first introduced to retrospectives in 2007. Back then, a typical retro would follow the starfish format (or some variation). However, over the past few years I’ve started to see some limitations with such formats. In an attempt to address the more common anti-patterns, I’ve been moving towards a slightly adapted format. I now try to incorporate action items into the brainstorming section, both to streamline the time taken and to focus the group on constructive conversation. This format achieves a few things:

- Shortens the overall time taken by having the group identify not only what’s helping/hindering the team, but also what they can carry forward to improve their performance in the future
- Ensures a more constructive mindset by increasing focus, during the brainstorming itself, on suggestions that address hindrances
- Helps create more achievable solutions by modifying the typical “action item” phase of the retro to instead be a refinement phase, where previously suggested actions are analyzed and prioritized

With the above goals in mind, I started by scribbling and sketching out some ideas in my notepad; after a short while I had come up with a basic draft for the structure of the retro:

I bandied the idea around in my head for a day or so. The finished product looked like this:

The picture above was drawn on a large whiteboard and divided into three color-coded columns (with a fourth column for action items, complete with a reminder that our final actions require a “what,” a “who,” and a “when”).

*This is you, huddled in the corner, with your stockpile of weaponry at the ready, bravely fighting off the ravenous horde crashing through your doorway.*

What’s your ammo? On green stickies, write down all those things that are fueling your team’s successes and working in your favor.

*This is the zombie horde—a relentless army of endless undead marching towards your destruction.*

Use pink stickies to identify the problems that you are facing (including potential future problems).

*This is your perimeter—the security measures you’ve installed to resist the horde and ensure your survival.*

As you’re identifying the issues you face and the current behaviors that are fueling your success, think about what actions you can take today to either address these issues or ensure continued success. The idea is to try to come up with a solution or suggestion for every problem that you can see on a pink sticky.

I tried out the format on the team. I gave them about seven minutes for the brainstorming, with the usual guidelines around collaboration: encouraging people to talk to each other and to look at each other’s suggestions. As a countdown timer, I personally use the 3-2-1 dashboard widget, but there are plenty of others you can use.

We then had a round of grouping and voting (each team member got three votes), with a reminder to vote on things you want to **discuss**, not just things you agree with (e.g., you could strongly disagree with a point on the board, and vote for it to start a discussion). Due to the nature of the board (if things go well), groups of pink stickies should have corresponding orange ones to direct the discussion towards action items.

I wrote down all action items that came up, and gave the team a caveat that we’d have five minutes at the end to review the actions, prioritize them, and pick the ones that we actually wanted to address; this keeps the discussions flowing. We ended up with some conflicting action items*—*which was fine; the idea was to get all the potential actions down, and then at the end decide which we felt were the most valuable. During this final review of the actions, we also assigned owners and deadlines. Then we were done!

Here’s what the final board looked like after our 45-minute retro was complete:

Next challenge: what crazy (yet **effective**) retrospective formats can you come up with?

Continue Reading »]]>

**Have you ever imagined what would happen if you let software developers work on what they want? Well, we did it. For one day. And here are the results…**

“OK, listen: there is no backlog today.”

When we first heard these words from Megan instead of the usual beginning of standup, we didn’t know what to expect. Further explanation wasn’t elaborate either. There was only one rule: you need to demonstrate something at the end of the day.

We had different reactions. We were happy (“Great! A break from the day-to-day tasks!”), shocked (“What did they do with my safe and predictable to-do column! Help!”), and insecure (“Can I really finish something to show in just one day, with no planning, estimating, or design?”).

So that was it. For one full (work) day, all developers in our team at Shutl (an eBay Inc. company) were supposed to forget about ongoing projects, deadlines, and unfinished tasks from the day before. We could work on whatever we wanted. We could pair or work individually. We could work on a DevOps task, on a working feature, or just on a prototype. We could develop a feature on an existing application or create a brand new project.

The first thing we did was a small brainstorm where we described our ideas. It was not obligatory, but it helped in forming pairs and getting some encouragement for our little projects. Then we just started coding.

Now, let me give some background behind this idea. You may have heard about “programmer anarchy” in context of development processes and company culture. In a few words: letting engineers make decisions on what and how they develop in order to meet critical requirements, and getting rid of “managers of programmers” from your development process. Fred George, the inventor of the idea, implemented it in a couple of companies. There was also a big buzz about how Github works with no managers (or rather with everyone being a manager).

These are great examples to read and think about. There are different opinions about this philosophy. Certainly, developing a culture and process that leaves all decisions to developers requires courage, time, money, and a certain kind of people in your team. You have to think very carefully before applying developer anarchy as a day-to-day rule.

We asked ourselves if there was anything we could do without changing our processes and getting rid of our managers, but still gain inspiration from the concepts of developer anarchy? We reckoned we could, and Developer Anarchy Days were born!

Introducing Developer Anarchy Days required very little preparation or changes in our organization. No planning or product management before it began was required. We did have some discussions prior to the event on whether it should be a spontaneously picked day or a planned and scheduled action. We decided for mix of both. Team members would get a ‘warning’ email some days in advance so that they could start thinking about it, but the actual day was a surprise.

The concept is very lightweight and open to interpretation. The premise is simple. Give your developers a day without a backlog or predefined tasks and let their creativity take over. This method has benefits to whatever team composition you may have. Less experienced developers get a chance to expand their skills and their self-confidence as they gain experience in owning and delivering something in a short time frame. More experienced developers get a chance to try out some new technologies they’ve been itching to experiment with. Pairing is always an option (and encouraged), so that there is someone to help and learn from.

What if the team is not an agile team at all? Well, that’s actually a great opportunity to taste a bit of agility. What can be more agile than delivering in just one day?

It depends on how you define wasted time. If you see it as any time not spent directly on delivering pre-defined business requirements/stories, then yes, it is wasted time. You could say the same about avoiding technical debt, working on chores, organizing meetings, or playing ping-pong after lunch. As with any other culture-related thing, it is hard to say. You may waste one day on building software no one will ever look at again. On the other hand, you may learn something, make developers more motivated, invent internal tools that improve efficiency, and even develop some great new innovations to help achieve business goals.

Yes and no. It’s probably not enough time to develop something production-ready, but that’s not the intention. It’s more about trying something new, developing a prototype, creating a small internal tool, or just presenting new ideas to the team. For that, we’ve found that one day is enough.

You can make it longer and spend a couple of days on building small projects in small teams. This may be more effective in complex and usable projects, but also requires more preparation, such as some planning considering ongoing project roadmaps and probably announcing the event earlier so everyone can prepare potential ideas for the projects.

Developer Anarchy Days have a lot in common with hackathons, hackfests, codefests, or hack-days. They’re all about intensively developing and presenting creative ideas. The main difference is that hackathons are usually bigger events in the form of competition, very often involving participants from outside of the company. They require proper event organization, including marketing, proper venue, food, and infrastructure. Usually, the promotional aspect of it is very important. You don’t need all this to organize a Developer Anarchy Day.

- Developers show that they are able to make decisions and explore creative ideas
- Engineers get a chance to come up with ideas from a technological perspective – something that businesses may sometimes miss
- Developers feel more motivated, because they are doing something of their own
- Developers experience how it is when they have to not only deliver something on time but also limit the project to something they can show and sell to others
- Developers can feel like a product manager and understand their job better
- The event breaks the routine of everyday (sometimes monotonous) deliveries
- The event gives everyone an opportunity to finally do stuff that we thought would be nice, but doesn’t bring any direct or indirect business value (e.g. internal tools)
- Finally, the event allows time to try some new technology or crazy idea!

OK, let’s go back to Shutl and our very first Developer Anarchy Day. It was a busy day, but definitely a fun one. Everyone felt responsible for finishing what they began on time. After all, we all had to present something. Some of us were pairing; some decided to give it a go by themselves. Although we love pairing, it is good to get away from it from time to time.

First thing the next morning, we presented our work. The variety and creativity of our little projects was beyond all expectations! Here are couple of examples.

As Shutl has a service-oriented architecture, our everyday work (as everyone’s DevOps) involves logging into multiple boxes. One of our engineers spent Developer Anarchy Day building a super useful command line tool that automates the process of logging in to specific environments without having to ssh into multiple boxes and remember server names. We’ve used it every day since, making our lives easier.

Every day we gather lots of feedback from our customers. The stars they give in their reviews though are a bit impersonal. You can learn much more by analyzing the language of the feedback comments. A pair of Shutl developers spent a day building a language sentiment analyzer that allowed us to get a sense of the general mood of our customers, based on the words they used.

Another Shutl engineer decided to be more DevOps for that day. He experimented with some new tools and demonstrated immutable deployments with CloudFormation and Chef.

Looking for common or possible use cases of our services, we realized that it would be really convenient to use Shutl to pick up and deliver items sent by private sellers on Gumtree or eBay. We have Shutl.it, which allows customers to deliver items from point A to B. The idea was to create a shareable link that pre-fills Shutl.it with pick-up information so any retailer or private seller can offer Shutl as an easy delivery option.

We definitely had fun and learned something. Actually, we now use “Easy login” every day and “Predefined orders” inspired some things on our roadmap.

No surprise here. It was genuinely positive. What can be better for us nine-to-five workers than a little bit of anarchy, especially when it lasts only one day, after which we quickly revert back to comfort and security of prioritized backlog and product management. We all agreed that we want to repeat anarchy on a regular basis. And we do. It has become an important part of our work culture.

]]>Continue Reading »]]>

eBay provides a platform that enables millions of buyers and sellers to conduct commerce transactions. To help optimize eBay end users’ experience, we perform analysis of user interactions and behaviors. Over the past years, batch-oriented data platforms like Hadoop have been used successfully for user behavior analytics. More recently, we have newer use cases that demand collection and processing of vast numbers of events in near real time (within seconds), in order to derive actionable insights and generate signals for immediate action. Here are examples of such use cases:

- Real-time reporting and dashboards
- Business activity monitoring
- Personalization
- Marketing and advertising
- Fraud and bot detection

We identified a set of systemic qualities that are important to support these large-scale, real-time analytics use cases:

- Scalability – Scaling to millions of events per second
- Latency – Sub-second event processing and delivery
- Availability – No cluster downtime during software upgrade, stream processing rule updates , and topology changes
- Flexibility – Ease in defining and changing processing logic, event routing, and pipeline topology
- Productivity – Support for complex event processing (CEP) and a 4GL language for data filtering, mutation, aggregation, and stateful processing
- Data accuracy – 99.9% data delivery
- Cloud deployability – Node distribution across data centers using standard cloud infrastructure

Given our unique set of requirements, we decided to develop our own distributed CEP framework. Pulsar CEP provides a Java-based framework as well as tooling to build, deploy, and manage CEP applications in a cloud environment. Pulsar CEP includes the following capabilities:

- Declarative definition of processing logic in SQL
- Hot deployment of SQL without restarting applications
- Annotation plugin framework to extend SQL functionality
- Pipeline flow routing using SQL
- Dynamic creation of stream affinity using SQL
- Declarative pipeline stitching using Spring IOC, thereby enabling dynamic topology changes at runtime
- Clustering with elastic scaling
- Cloud deployment
- Publish-subscribe messaging with both push and pull models
- Additional CEP capabilities through Esper integration

On top of this CEP framework, we implemented a real-time analytics data pipeline.

Pulsar’s real-time analytics data pipeline consists of loosely coupled stages. Each stage is functionally separate from its neighboring stage. Events are transported asynchronously across a pipeline of these loosely coupled stages. This model provides higher reliability and scalability. Each stage can be built and operated independently from its neighboring stages, and can adopt its own deployment and release cycles. The topology can be changed without restarting the cluster.

Here is some of the processing we perform in our real-time analytics pipeline:

- Enrichment – Decorate events with additional attributes. For example, we can add geo location information to user interaction events based on the IP address range.
- Filtering and mutation – Filter out irrelevant attributes and events, or transform the content of an event.
- Aggregation – Count the number of events, or add up metrics along a set of dimensions over a time window.
- Stateful processing – Group multiple events into one, or generate a new event based on a sequence of events and processing rules. An example is our sessionization stage, which tracks user session-based metrics by grouping a sequence of user interaction events into web sessions.

The Pulsar pipeline can be integrated with different systems. For example, summarized events can be sent to a persistent metrics store to support ad-hoc queries. Events can also be sent to some form of visualization dashboard for real-time reporting, or to backend systems that can react to event signals.

In Pulsar, our approach is to treat the event stream like a database table. We apply SQL queries and annotations on live streams to extract summary data as events are moving.

The following are a few examples of how common processing can be expressed in Pulsar.

**Event filtering and routing**

insert into SUBSTREAM select D1, D2, D3, D4 from RAWSTREAM where D1 = 2045573 or D2 = 2047936 or D3 = 2051457 or D4 = 2053742; // filtering @PublishOn(topics=“TOPIC1”) // publish sub stream at TOPIC1 @OutputTo(“OutboundMessageChannel”) @ClusterAffinityTag(column = D1); // partition key based on column D1 select * FROM SUBSTREAM;

**Aggregate computation**

// create 10-second time window context create context MCContext start @now end pattern [timer:interval(10)]; // aggregate event count along dimension D1 and D2 within specified time window context MCContext insert into AGGREGATE select count(*) as METRIC1, D1, D2 FROM RAWSTREAM group by D1,D2 output snapshot when terminated; select * from AGGREGATE;

**TopN computation**

// create 60-second time window context create context MCContext start @now end pattern [timer:interval(60)]; // sort to find top 10 event counts along dimensions D1, D2, and D3 // within specified time window context MCContext insert into TOPITEMS select count(*) as totalCount, D1, D2, D3 from RawEventStream group by D1, D2, D3 order by count(*) limit 10; select * from TOPITEMS;

Pulsar CEP processing logic is deployed on many nodes (CEP cells) across data centers. Each CEP cell is configured with an inbound channel, outbound channel, and processing logic. Events are typically partitioned based on a key such as user id. All events with the same partitioned key are routed to the same CEP cell. In each stage, events can be partitioned based on a different key, enabling aggregation across multiple dimensions. To scale to more events, we just need to add more CEP cells into the pipeline. Using Apache ZooKeeper, Pulsar CEP automatically detects the new cell and rebalances the event traffic. Similarly, if a CEP cell goes down, Pulsar CEP will reroute traffic to other nodes.

Pulsar CEP supports multiple messaging models to move events between stages. For low delivery latency, we recommend the push model when events are sent from a producer to a consumer with at-most-once delivery semantics. If a consumer goes down or cannot keep up with the event traffic, it can signal the producer to temporarily push the event into a persistent queue like Kafka; subsequently, the events can be replayed. Pulsar CEP can also be configured to support the pull model with at-least-once delivery semantics. In this case, all events will be written into Kafka, and a consumer will pull from Kafka.

Pulsar has been deployed in production at eBay and is processing all user behavior events. We have open-sourced the Pulsar code, we plan to continue to develop the code in the open, and we welcome everyone’s contributions. Below are some features we are working on. We would love to get your help and suggestions.

- Real-time reporting API and dashboard
- Integration with Druid or other metrics stores
- Persistent session store integration
- Support for long rolling-window aggregation

Please visit http://gopulsar.io for source code, documentation, and more information.

]]>Continue Reading »]]>

Empirical Bayes seems like the wave of the future to me, but it seemed that way 25 years ago and the wave still hasn’t washed in, despite the fact that it is an area of enormous potential importance.

Hopefully this post will be one small step in helping Empirical Bayes to wash in! The case study I’ll present comes from ranking the items that result from a search query. One feature that is useful for ranking items is their historical popularity. On eBay, some items are available in multiple quantities. For these, popularity can be measured by the number of times an item is sold divided by the number of times it is displayed, which I will call sales/impressions (S/I). By the way, everything I say applies to any ratio of counts, not just sales and impressions.

The problem I want to discuss is what to do if the denominator is small. Suppose that items typically have 1 sale per 100 impressions. Now suppose that a particular item gets a sale just after being listed. This is a typical item that has a long-term S/I of about 0.01, but by chance it got its sale early, say after the 3rd impression. So S/I is 1/3, which is huge. It looks like an enormously popular item, until you realize that the denominator I is small: it has received only 3 impressions. One solution is to pass the problem downstream, and give the ranker both S/I and I. Let the ranker figure out how much to discount S/I when I is small. Passing the buck might make sense in some situations, but I will show that it’s not necessary, and that it’s possible to pass a meaningful value even when I is small.

How to do that? Informally, I want a default value of S/I, and I want to gradually move from that default to the actual S/I as I increases. Your first reaction is probably to do this by picking a number (say 100), and if I < 100 use the default, otherwise S/I. But once you start to wonder whether 100 is the right number, you might as well go all the way and do things in a principled way using probabilities.

Jumping to the bottom line: the formula will be (S + α)/(I + γ). This clearly satisfies the desire to be near S/I when S and I are large. It also implies that the default value is α/γ, since that’s what you get when S=I=0. In the rest of this post I will explain two things. First, how to pick α and γ (there is a right way and a wrong way). And second, where the shape of the formula (S + α)/(I +γ) comes from. If you’re familiar with Laplace smoothing then you might think of using (S+1)/(I+1), and our formula is a generalization of that. But it still begs the question — why a formula of this form, rather than, for example, a weighted sum .

The formula (S + α)/(I +γ) comes by imagining that at each impression, there is a probability of an associated sale, and then returning the best estimate of that probability instead of returning S/I. I’ll start with the simplest way of implementing this idea (although it is too simple to work well).

Suppose the probability of a sale has a fixed universal value , so that whenever a user is shown an item, there is a probability that the item is sold. This is a hypothetical model of how users behave, and it’s straightforward to test if it fits the data. Simply pick a set of items, each with an observed sale count and impression count. If the simple model is correct, then an item with impressions will receive sales according to the binomial formula:

Here is the number of impressions and the number of sales. As mentioned earlier, this whole discussion also works for other meanings of and , such as is clicks and is impressions. To test the simple model, I can compare two sets of data. The first is the observed pairs . In other words, I retrieve historical info for each item, and record impressions and sales. I construct the second set by following the simple model: I take the actual number of impressions , and randomly generate the number of sales according to the formula above. Below is a histogram of the two data sets. Red is simulated (the model), and blue is actual. The match is terrible.

Here is some more detail on the plot: Only items with a nonzero sale count are shown. In the simulation there are 21% items with S=0, but the actual data has 47%.

So we need to go to a more sophisticated model. Instead of a fixed value of , imagine drawing from a probability distribution and plugging it into the inset equation, which is then used to get the random . As you can see in the plot below, the two histograms have a much more similar shape than the previous plot, and so this model does a better job of matching the actual data.

Now it all boils down to finding the distribution for . Since , that means finding a probability distribution on the interval [0, 1]. The most common such distribution is the Beta distribution, which has two parameters, and . By assuming a Beta distribution, I reduce the problem to finding and (and yes, this α is the same one as in the formula (S + α)/(I +γ)). This I will do by finding the values of and that best explain the observed values of and . Being more precise, associated to each of historical items is a sale count and an impression count , with .

I was perhaps a bit flip in suggesting the Beta distribution because it is commonly used. The real reason for selecting Beta is that it makes the computations presented in the

Detailssection below much simpler. In the language of Bayesian statistics, the Beta distribution is conjugate to the binomial.

At this point you can fall into a very tempting trap. Each is a number between 0 and 1, so all the values form a histogram on [0,1]. The possible values of follow the density function for the Beta distribution and so also form a histogram on [0,1]. Thus you might think you could simply pick the values of and that make the two histograms match as closely as possible. This is wrong, wrong, wrong. The values are from a discrete distribution and often take on the value 0. The values of come from a continuous distribution (Beta) and are never 0, or more precisely, the probability that is 0. The distributions of and of are incompatible.

In my model, I’m given and I spit out by drawing from a Beta distribution. The Beta is invisible (latent) and *indirectly* defines the model. I’ll give a name to the output of the model: . Restating, fix an and make a random variable that produces value with the probability controlled indirectly by the Beta distribution. I need to match the observed (empirical) values of to X, not to Beta. This is the empirical Bayes part. I’ll give an algorithm that computes and later.

But first let me close the loop, and explain how all this relates to (S + α)/(I + γ). Instead of reporting S/I, I will report the probability of a sale. Think of the probability as a random variable — call it . I will report the mean value of the random variable . How to compute that? I heard a story about a math department that was at the top of a tall building whose windows faced the concrete wall of an adjacent structure. Someone had spray-painted on the wall “don’t jump, integrate by parts.” If it had been a statistics department, it might have said “don’t jump, use Baye’s rule.”

Baye’s rule implies a conditional probability. I want not the expected value of , but the expected value of conditional on impressions and sales. I can compute that from the conditional distribution . To compute this, flip the two sides of the | to get . This is , which is just the inset equation at the beginning of this post!

Now you probably know that in Baye’s rule you can’t just flip the two sides, you also have to include the prior. The formula is really . And is what we decided to model using the Beta distribution with parameters and . These are all the ingredients for Empirical Bayes. I need , I evaluate it using Baye’s rule, the rule requires a prior, and I use empirical data to pick the prior. In empirical Bayes, I select the prior that best explains the empirical data. For us, the empirical data is the observed values of . When you do the calculations (below) using the Beta distribution as the prior, you get that the mean of P is (S + α)/(I + γ) where γ = α + β.

How does this compare with the simplistic method of using S/I when I > δ, and η otherwise? The simplistic formula involves two constants δ and η just as the principled formula involves two constants α and γ. But the principled method comes with an algorithm for computing α and γ given below. The algorithm is a few lines of R code (using the `optimx`

package).

I’ll close by filling in the details. First I’ll explain how to compute and .

I have empirical data on items. Associated with the -th item () is a pair , where might be the number of sales and the number of impressions, but the same reasoning works for clicks instead of sales. A model for generating the is that for each impression there is a probability that the impression results in a sale. So given , the probability that is . Then I add in that the probability is itself random, drawn from a parametrized prior distribution with density function . I generate the in a series of independent steps. At step , I draw from , and then generate according to the binomial probability distribution on :

Using this model, the probability of seeing given is computed by averaging over the different possible values of , giving

I’d like to find the parameter that best explains the observed and I can do that by maximizing the probability of seeing all those . The probability seeing is , the probability of seeing the whole set is and the log of that probability is . This is a function of , and I want to find the value of that maximizes it. This log probability is conventionally called the *log-likelihood*.

Since I’m assuming is a beta distribution, with , then becomes

The calculation above uses the definition of the beta function and the formula for the beta integral

If you don’t want to check my calculations, is just the beta-binomial distribution, and you can find its formula in many books and web pages.

Restating, to find , is to maximize the log-likelihood , specifically

And since the first term doesn’t involve or , you only need to maximize

The method I used to maximize that expression was the `optimx`

package in R.

The final missing piece is why, when I replace S/I with the probability that an impression leads to a sale, the formula is .

I have an item with an unknown probability of sale . All that I do know is that it got sales out of impressions. If is the random variable representing the sale probability of an item, and is a random variable representing the sale/impression of an item, I want , which I write as for short. Evaluate this using Baye’s rule,

The term can be ignored. This is not deep, but can be confusing. In fact, any factor involving only and (like ) can be ignored. That’s because so if it follows that can be recovered from using In other words, I can simply ignore a and reconstruct it at the very end by making sure that .

I know that

For us, the prior is a beta distribution, . Some algebra then gives

The symbol ignores constants involving only and . Since the rightmost term integrates to 1, the proportionality is an equality:

For an item with I want to know the value of , but this formula gives the probability density for . To get a single value I take the mean, using the fact that the mean of is . So the estimate for is

This is just (S + α)/(I + γ) with γ = α + β.

There’s room for significant improvement. For each item on eBay, you have extra information like the price. The price has a big effect on S/I, and so you might account for that by dividing items into a small number of groups (perhaps low-price, medium-price and high-price), and computing , for each. There’s a better way, which I will discuss in a future post.

]]>