|You are here: Home > Dive Into Python > HTML Processing > Putting it all together||<< >>|
Python for experienced programmers
It’s time to put everything we’ve learned so far to good use. I hope you were paying attention.
def translate(url, dialectName="chef"): import urllib sock = urllib.urlopen(url) htmlSource = sock.read() sock.close()
|The translate function has an optional argument dialectName, which is a string that specifies the dialect we’ll be using. We’ll see how this is used in a minute.|
|Hey, wait a minute, there’s an import statement in this function! That’s perfectly legal in Python. You’re used to seeing import statements at the top of a program, which means that the imported module is available anywhere in the program. But you can also import modules within a function, which means that the imported module is only available within the function. If you have a module that is only ever used in one function, this is an easy way to make your code more modular. (When you find that your weekend hack has turned into an 800-line work of art and decide to split it up into a dozen reusable modules, you’ll appreciate this.)|
|Now we get the source of the given URL.|
parserName = "%sDialectizer" % dialectName.capitalize() parserClass = globals()[parserName] parser = parserClass()
|capitalize is a string method we haven’t seen before; it simply capitalizing the first letter of a string and forces everything else to lowercase. Combined with some string formatting, we’ve taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class. If dialectName is the string 'chef', parserName will be the string 'ChefDialectizer'.|
|We have the name of a class as a string (parserName), and we have the global namespace as a dictionary (globals()). Combined, we can get a reference to the class which the string names. (Remember, classes are objects, and they can be assigned to variables just like any other object.) If parserName is the string 'ChefDialectizer', parserClass will be the class ChefDialectizer.|
|Finally, we have a class object (parserClass), and we want an instance of the class. Well, we already know how to do that: call the class like a function. The fact that the class is being stored in a local variable makes absolutely no difference; we just call the local variable like a function, and out pops an instance of the class. If parserClass is the class ChefDialectizer, parser will be an instance of the class ChefDialectizer.|
Why bother? After all, there are only 3 Dialectizer classes; why not just use a case statement? (Well, there’s no case statement in Python, but why not just use a series of if statements?) One reason: extensibility. The translate function has absolutely no idea how many Dialectizer classes we’ve defined. Imagine if we defined a new FooDialectizer tomorrow; translate would work by passing 'foo' as the dialectName.
Even better, imagine putting FooDialectizer in a separate module, and importing it with from module import. We’ve already seen that this includes it in globals(), so translate would still work without modification, even though FooDialectizer was in a separate file.
Now imagine that the name of the dialect is coming from somewhere outside the program, maybe from a database or from a user-inputted value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a URL and a dialect name (both strings) in the query string of a web page request, and output the “translated” web page.
Finally, imagine a Dialectizer framework with a plug-in architecture. You could put each Dialectizer class in a separate file, leaving only the translate function in dialect.py. Assuming a consistent naming scheme, the translate function could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven’t seen dynamic importing yet, but I promise to cover in a later chapter.) To add a new dialect, you would simply add an appropriately-named file in the plug-ins directory (like foodialect.py which contains the FooDialectizer class). Calling the translate function with the dialect name 'foo' would find the module foodialect.py, import the class FooDialectizer, and away we go.
|After all that imagining, this is going to seem pretty boring, but the feed function is what does the entire transformation. We had the entire HTML source in a single string, so we only had to call feed once. However, you can call feed as often as you want, and the parser will just keep parsing. So if we were worried about memory usage (or we knew we were going to be dealing with very large HTML pages), we could set this up in a loop, where we read a few bytes of HTML and fed it to the parser. The result would be the same.|
|Because feed maintains an internal buffer, you should always call the parser’s close method when you’re done (even if you fed it all at once, like we did). Otherwise you may find that your output is missing the last few bytes.|
|Remember, output is the function we defined on BaseHTMLProcessor that joins all the pieces of output we’ve buffered and returns them in a single string.|
And just like that, we’ve “translated” a web page, given nothing but a URL and the name of a dialect.
<< Regular expressions 101
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |