Home / New WURFL API
Luca Passani
New WURFL API
Introduction
Java API
PHP API
|
|
New WURFL API and why it is important to upgrade
by Luca Passani
passani at eunet dot no
I have mentioned the new WURFL API on WMLProgramming many times over the past year. Now it is here.
We are launching the new API for Java and PHP.
Let me say straight out that the basic API you get "out of the box" has not changed much.
What has changed is the
engine under the hood, i.e. the heuristics which will perform the magic of finding the best
possible WURFL ID for every UA string thrown at it.
In addition, the Java API leverages the power of the Spring framework to allow for advanced
usage of the API by enabling you to plug in different implementations of some basic components,
should the need arise.
Two-Step UA String Analysis
The algorithm used
in the old API for matching incoming HTTP requests against the list of WURFL User-Agent
is very simple: in case a direct match isn't found, the incoming UA string is simply
progressively shortened one character at the time starting from the bottom.
At each iteration, substring match against existing UAs in WURFL was attempted.
The first succesful match decrees the winner: a WURFL ID is found. I'll call this algorithm RIS
(for "Reduction to Initial String").
RIS has worked reasonably well for the WURFL community
all of this time, but had a few shortcomings.
First, RIS was not very efficient. We could have lived with that.
After all caching the result solved the problem.
Secondly, RIS had to rely on a fixed threshold. When would one stop reducing the
string while looking for a match? after all matching "NokiaN70" would probably
be good enough for most uses, but matching "Nokia" alone would mean nothing.
Worse than nothing actually. It would be better to let the service
owner know that a given UA had not been recognized. Nine characters was chosen as a
catch-all threshold for RIS, but it goes without saying that the rug turned out to be
too short in some cases and too long in others for many UAs which had no direct counterpart
in WURFL.
But there is one extra reason which has made RIS less and less ideal as the general purpose
matcher: the introduction of
"Mozilla-ites" user-agents (i.e. User-agents that disguise as web browser) made RIS a
poor matching algorithm in many cases (admittedly, not all, but definitely some).
It was clear to me that RIS ability to cover all UA strings was deteriorating and it was time
to think of something new. Discussion on WMLProgramming only confirmed this, but also
brought me to the conclusion that no single algorithm could be used optimally to match
arbitrary User-Agent strings. That's when I started thinking about two-step matching, i.e.
a matching algorithm that:
- will do a first pass and try to understand what kind of UA strings
we are dealing with (Mozilla-ite? iPhonite? Microsofite?) and, subsequently,
- will pull the right UA "handler" from a pool of handlers, each of which
will possess the specific knowledge to handle the string belonging to
a given family of UA strings.
As usual, a specific example is worth one thousand words.
Example of Two Step Analisys
Let's consider BlackBerries. Those devices are normally very well-behaved and introduce themselves as something like:
BlackBerry8800/4.2.1 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/134
which is, obviously, an optimal target for RIS matching. Unfortunately, RIM engineers
(which like any engineer anywhere on the planet should be prohibited from messing
with the design of UIs meant for consumers) had the briliant idea of allowing users to change
the UA string in the browser settings. This is a nuisance, because, now and again,
people will hit your service with funny UA strings such as the following:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) BlackBerry8800/4.2.1 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/134
or with one which disguises as Microsoft Mobile Explorer.
It goes without saying that straight RIS would behave poorly in this case and would typically come up with
the WURFL ID of either a web browser or microsoft mobile explorer. This is not what we, the brethren
of mobile development, want.
How can two-step analysis make a difference? it can. And also rather elegantly if you ask me.
Always referring to the same example (BlackBerry inpersonating MS mobile), the first
step will realize that the UA string contains the "BlackBerry" substring.
This is enough to declare that the "BlackBerry handler"
will need to take care of this baby.
In the second step it won't be hard for the BlackBerry handler to recognize the undue addition
at the beginning of the string, normalize it by simply removing the inital part of the string up
to the BlackBerry token, and proceed to feed the clean string to RIS for regular
handling. Bingo! blackberry8800_ver1_sub421102 is returned!
The Prodigal Son has come home.
RIS On Steroids
I just mentioned RIS, but do not be fooled. It's not the same RIS we know from before.
The RIS we know from before had a fixed threshold of 9 chars. The new RIS used in the new API
will have the threshold calculated dynamically for each UA. The rules according to which the
threshold is calculated are not fixed. Each and every handler which uses RIS will know how
to calculate the threshold for the family of devices it is handling.
Not Too Distant Strings
As mentioned above, RIS may behave very poorly with Mozilla-ites, i.e. devices which
advertise themselves with UA-Strings starting with "Mozilla/". In particular, if you need to
identify web browsers in order to direct them somewhere else, RIS is likely to produce many
false positives.
How can this problem be solved? the main observation is that many mozilla-ites carry very distinctive
substrings somewhere in the middle of the UA-string. While RIS will erase those substrings
in the desperate attempt to find a match, thus getting trapped into a blind alley,
a smarter algorithm will realize that two strings in their entirities are actually pretty similar
to one another. Such a smarter algorithms exist. One of them is called "Levensthein Distance"
(LD, for friends).
According to Levensthein, the distance betwen "Luca" (Passani) and "Luciano" (Pavarotti) is 3.
This fact is probably not very interesting to you. What you may find more interesting is the fact
that the following UA (not in WURFL at the time of this writing):
Mozilla/5.0 (iPhone; U; CPU like Mac OS X; de-de) AppleWebKit/420.1 (KHTML, like Gecko)
Version/3.0 Mobile/3B48b Safari/419.3
will be found to be very close to this (in WURFL):
Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420.1 (KHTML, like Gecko)
Version/3.0 Mobile/3B48b Safari/419.3
This will let the API conclude that apple_iphone_ver1_sub3b48a is a good enough match
for the German iPhone. An excellent choice, if you ask me.
Just for the record, there's a concept of threshold also for the Levensthein Distance algorithm.
After all, all strings are at some distance of another string, so there must be a cut-off level
at some point (i.e. choose the closest, but only as long as it is close enough to make sense).
Just like RIS, also the LD threshold may be calculated dynamically by each handler.
While the threshold may vary, 3 to 4 chars are typical values for reasonable LD thresholds.
Conclusive vs. Recovery Heuristics
The heuristics introduced with the new API can be grouped in two kinds: Conclusive and Recovery.
Conclusive heuristics will return the WURFL ID of an actual device root. If this is not possible (and
"generic" is all the API has when the handler returns from evaluating a UA string),
recovery heuristics will kick in.
Recovery heuristics will try to figure something out about some basic device capabilities.
Browser name and version may mean a lot. Similarly, an unrecognized UA that starts with "Mozilla/"
is stating that XHTML of some kind is supported. A recovery heuristic will make sure that
"generic_xhtml" is returned in place of "generic" for that device (not that it makes a whole lot of difference, now that XHTML has become the default mark-up for unrecognised devices, but you get the spirit).
Matching Web Browsers
When WURFL was created, my only goal was to match mobile browsers. After all,
anyone foolish enough to hit a mobile site with a web browser did not deserve
much of our time.
As years went by, though, more and more developers wanted to use WURFL to match web browsers too
(typical scenario: their companies wanted a single entry point for their web and mobile presence):
people would show up on WMLprogramming and ask why their Opera browser
was being detected as a Nintendo Wii (nintendo_wii_ver1, UA: Opera/9.10 (Nintendo Wii; U; ; 1621; en)). HTC vs. MS Internet Explorer was another classic query.
Of course, this was a consequence of using RIS exclusively in the old API.
The new API helps here. It has extra logic to handle web browsers. Just make sure that
the new Web Patch is loaded, and
those web browsers will be given the place they deserve on the right side of the
"mobile vs. web" fence.
Transcoders
If you have been around for some time, you already know what transcoders are
and why many think transcoders are
disgusting. The new API does what it can to detect transcoders and find the original UA-string.
Do Yourself A Favor and Upgrade
After all these words, I hope you got the message. Upgrading is up to you, but hopefully
you have collected enough data to believe that doing it is a great favor you can make
to your customers and yourself!
If you would like to update your own copy of WURFL and would like to do it in
a way that RIS and LD perform optimally, I warmly recomment that you become
intimitely familiar with the
guidelines for WURFL contributors available at:
http://www.wurflpro.com/static/top.htm.
As usual, you can direct your questions and comments to
WMLProgramming at YahooGroups.
Enjoy! Luca Passani
|
|
|
|