URL Parsing With WSGI And Paste¶
- author
Ian Bicking <ianb@colorstudy.com>
- revision
$Rev$
- date
$LastChangedDate$
Contents
Introduction and Audience¶
This document is intended for web framework authors and integrators, and people who want to understand the internal architecture of Paste.
If you have questions about this document, please contact the paste
mailing list
or try IRC (#pythonpaste
on freenode.net). If there’s something that
confused you and you want to give feedback, please submit an issue.
URL Parsing¶
Note
Sometimes people use “URL”, and sometimes “URI”. I think URLs are a subset of URIs. But in practice you’ll almost never see URIs that aren’t URLs, and certainly not in Paste. URIs that aren’t URLs are abstract Identifiers, that cannot necessarily be used to Locate the resource. This document is all about locating.
Most generally, URL parsing is about taking a URL and determining what “resource” the URL refers to. “Resource” is a rather vague term, intentionally. It’s really just a metaphor – in reality there aren’t any “resources” in HTTP; there are only requests and responses.
In Paste, everything is about WSGI. But that can seem too fancy.
There are four core things involved: the request (personified in the
WSGI environment), the response (personified inthe
start_response
callback and the return iterator), the WSGI
application, and the server that calls that application. The
application and request are objects, while the server and response are
really more like actions than concrete objects.
In this context, URL parsing is about mapping a URL to an
application and a request. The request actually gets modified as
it moves through different parts of the system. Two dictionary keys
in particular relate to URLs – SCRIPT_NAME
and PATH_INFO
–
but any part of the environment can be modified as it is passed
through the system.
Dispatching¶
Note
WSGI isn’t object oriented? Well, if you look at it, you’ll notice
there’s no objects except built-in types, so it shouldn’t be a
surprise. Additionally, the interface and promises of the objects
we do see are very minimal. An application doesn’t have any
interface except one method – __call__
– and that method
does things, it doesn’t give any other information.
Because WSGI is action-oriented, rather than object-oriented, it’s more important what we do. “Finding” an application is probably an intermediate step, but “running” the application is our ultimate goal, and the only real judge of success. An application that isn’t run is useless to us, because it doesn’t have any other useful methods.
So what we’re really doing is dispatching – we’re handing the request and responsibility for the response off to another object (another actor, really). In the process we can actually retain some control – we can capture and transform the response, and we can modify the request – but that’s not what the typical URL resolver will do.
Motivations¶
The most obvious kind of URL parsing is finding a WSGI application.
Typically when a framework first supports WSGI or is integrated into Paste, it is “monolithic” with respect to URLs. That is, you define (in Paste, or maybe in Apache) a “root” URL, and everything under that goes into the framework. What the framework does internally, Paste does not know – it probably finds internal objects to dispatch to, but the framework is opaque to Paste. Not just to Paste, but to any code that isn’t in that framework.
That means that we can’t mix code from multiple frameworks, or as easily share services, or use WSGI middleware that doesn’t apply to the entire framework/application.
An example of someplace we might want to use an “application” that isn’t part of the framework would be uploading large files. It’s possible to keep track of upload progress, and report that back to the user, but no framework typically is capable of this. This is usually because the POST request is completely read and parsed before it invokes any application code.
This is resolvable in WSGI – a WSGI application can provide its own code to read and parse the POST request, and simultaneously report progress (usually in a way that another WSGI application/request can read and report to the user on that progress). This is an example where you want to allow “foreign” applications to be intermingled with framework application code.
Finding Applications¶
OK, enough theory. How does a URL parser work? Well, it is a WSGI application, and a WSGI server, in the typical “WSGI middleware” style. Except that it determines which application it will serve for each request.
Let’s consider Paste’s URLParser
(in paste.urlparser
). This
class takes a directory name as its only required argument, and
instances are WSGI applications.
When a request comes in, the parser looks at PATH_INFO
to see
what’s left to parse. SCRIPT_NAME
represents where we are now;
it’s the part of the URL that has been parsed.
There’s a couple special cases:
The empty string:
URLParser serves directories. When
PATH_INFO
is empty, that means we got a request with no trailing/
, like say/blog
If URLParser serves theblog
directory, then this won’t do – the user is requesting theblog
page. We have to redirect them to/blog/
.
A single /
:
So, we got a trailing
/
. This means we need to serve the “index” page. In URLParser, this is some file namedindex
, though that’s really an implementation detail. You could create an index dynamically (like Apache’s file listings), or whatever.
Otherwise we get a string like /path...
. Note that PATH_INFO
must start with a /
, or it must be empty.
URLParser pulls off the first part of the path. E.g., if
PATH_INFO
is /blog/edit/285
, then the first part is blog
.
It appends this to SCRIPT_NAME
, and strips it off PATH_INFO
(which becomes /edit/285
).
It then searches for a file that matches “blog”. In URLParser, this means it looks for a filename which matches that name (ignoring the extension). It then uses the type of that file (determined by extension) to create a WSGI application.
One case is that the file is a directory. In that case, the application is another URLParser instance, this time with the new directory.
URLParser actually allows per-extension “plugins” – these are just
functions that get a filename, and produce a WSGI application. One of
these is make_py
– this function imports the module, and looks
for special symbols; if it finds a symbol application
, it assumes
this is a WSGI application that is ready to accept the request. If it
finds a symbol that matches the name of the module (e.g., edit
),
then it assumes that is an application factory, meaning that when
you call it with no arguments you get a WSGI application.
Another function takes “unknown” files (files for which no better
constructor exists) and creates an application that simply responds
with the contents of that file (and the appropriate Content-Type
).
In any case, URLParser
delegates as soon as it can. It doesn’t
parse the entire path – it just finds the next application, which
in turn may delegate to yet another application.
Here’s a very simple implementation of URLParser:
class URLParser(object):
def __init__(self, dir):
self.dir = dir
def __call__(self, environ, start_response):
segment = wsgilib.path_info_pop(environ)
if segment is None: # No trailing /
# do a redirect...
for filename in os.listdir(self.dir):
if os.path.splitext(filename)[0] == segment:
return self.serve_application(
environ, start_response, filename)
# do a 404 Not Found
def serve_application(self, environ, start_response, filename):
basename, ext = os.path.splitext(filename)
filename = os.path.join(self.dir, filename)
if os.path.isdir(filename):
return URLParser(filename)(environ, start_response)
elif ext == '.py':
module = import_module(filename)
if hasattr(module, 'application'):
return module.application(environ, start_response)
elif hasattr(module, basename):
return getattr(module, basename)(
environ, start_response)
else:
return wsgilib.send_file(filename)
Modifying The Request¶
Well, URLParser is one kind of parser. But others are possible, and aren’t too hard to write.
Lets imagine a URL like /2004/05/01/edit
. It’s likely that
/2004/05/01
doesn’t point to anything on file, but is really more
of a “variable” that gets passed to edit
. So we can pull them off
and put them somewhere. This is a good place for a WSGI extension.
Lets put them in environ["app.url_date"]
.
We’ll pass one other applications in – once we get the date (if any)
we need to pass the request onto an application that can actually
handle it. This “application” might be a URLParser or similar system
(that figures out what /edit
means).
class GrabDate(object):
def __init__(self, subapp):
self.subapp = subapp
def __call__(self, environ, start_response):
date_parts = []
while len(date_parts) < 3:
first, rest = wsgilib.path_info_split(environ['PATH_INFO'])
try:
date_parts.append(int(first))
wsgilib.path_info_pop(environ)
except (ValueError, TypeError):
break
environ['app.date_parts'] = date_parts
return self.subapp(environ, start_response)
This is really like traditional “middleware”, in that it sits between the server and just one application.
Assuming you put this class in the myapp.grabdate
module, you
could install it by adding this to your configuration:
middleware.append('myapp.grabdate.GrabDate')
Object Publishing¶
Besides looking in the filesystem, “object publishing” is another
popular way to do URL parsing. This is pretty easy to implement as
well – it usually just means use getattr
with the popped
segments. But we’ll implement a rough approximation of Quixote’s URL parsing:
class ObjectApp(object):
def __init__(self, obj):
self.obj = obj
def __call__(self, environ, start_response):
next = wsgilib.path_info_pop(environ)
if next is None:
# This is the object, lets serve it...
return self.publish(obj, environ, start_response)
next = next or '_q_index' # the default index method
if next in obj._q_export and getattr(obj, next, None):
return ObjectApp(getattr(obj, next))(
environ, start_reponse)
next_obj = obj._q_traverse(next)
if not next_obj:
# Do a 404
return ObjectApp(next_obj)(environ, start_response)
def publish(self, obj, environ, start_response):
if callable(obj):
output = str(obj())
else:
output = str(obj)
start_response('200 OK', [('Content-type', 'text/html')])
return [output]
The publish
object is a little weak, and functions like
_q_traverse
aren’t passed interesting information about the
request, but this is only a rough approximation of the framework.
Things to note:
The object has standard attributes and methods –
_q_exports
(attributes that are public to the web) and_q_traverse
(a way of overriding the traversal without having an attribute for each possible path segment).The object isn’t rendered until the path is completely consumed (when
next
isNone
). This means_q_traverse
has to consume extra segments of the path. In this version_q_traverse
is only given the next piece of the path; Quixote gives it the entire path (as a list of segments).publish
is really a small and lame way to turn a Quixote object into a WSGI application. For any serious framework you’d want to do a better job than what I do here.It would be even better if you used something like Adaptation to convert objects into applications. This would include removing the explicit creation of new
ObjectApp
instances, which could also be a kind of fall-back adaptation.
Anyway, this example is less complete, but maybe it will get you thinking.