Wednesday, September 12, 2007

Downloading in Python?

Currently, I'm using a rather ugly way of downloading files in my Python application by executing a "wget" process and extracting progress and speed information from wget's stderr output. Of course, this is ugly and it depends on many factors (did you know RedHat/Fedora's wget has a different output format?

Ideally, I'd like to replace the current code with something completely Python-ish (using urllib2?). The ideal solution would:
  • Be completely written in Python
  • Download URLs to specified local file names
  • Be threaded (or runnable in threads)
  • Report current speed (e.g. 40kb/s) and progress (40% done)
  • Cancelling in-progress downloads should be possible
  • Optionally provide the "estimated time left"
  • Handle URL redirections, HTTP authentication, etc..
Does anyone know of a (lightweight) library that does this or has anybody some code to share so that I don't have to start writing this thing from scratch? If not, some hints on how to best do these things would be very nice.

8 comments:

Christian said...

urllib2 should get you pretty far, if not meet all those requirements. There is also a third-party lib out there called httplib2 that has some pretty advanced features. Good luck!

Unknown said...

Fedora/Red Hat has urlgrabber, which they currently use for yum. I haven't looked in-depth at its capabilities, but I'm fairly sure it can be extended easily enough.

Kumar McMillan said...

in theory, twisted.web.resource is exactly designed for something like this: asynchronous web resource management (i.e. file downloads with progress bars, etc). However, good luck finding documentation. The only thing I can get with Google is this majorly outdated email with a useless response. If you don't mind digging into code without docs it might be worth giving twisted a try.

Unknown said...

I have great success with Twisted for this. Both for uploading and downloading.

garylinux said...

try PycURL

Joseph said...

urlib.urlretrieve should do the trick nicely. See the "reporthook" function for a way to update status information. It does not have a way to explicitly cancel the download, but you could always throw an exception in your "reporthook" function and that should kill the download.

If that doesn't work here is a some code which does the same thing but is more flexible. (Its adapted untested out an existing project): Here

Jordan G said...

urllib2 will do everything you need. A poorly written, quick example:

from urllib2 import urlopen
f=urllib.open('http://www.google.com')
wf=open('writefile', 'w')
size=f.info()['content-length']
count=0.0
while count<0:
data=f.read(500)
count+=500
wf.write(data)
print "%d%% complete" % (round((count/size)*100))

To cancel, just break the loop.
Its not that difficult to compute the speed or time remaining.
You could also optimize it by having the read and write in seperate threads, with the read feeding a write queue.

I donno about http authentication, since I've never really played with it. Hope this gives ya some ideas.

thp said...

Thanks for all your helpful comments and suggestions.

Although Fedora's urlgrabber library seems neat, I couldn't find a way to easily cancel a download.

So I first had a look how gPodder's Maemo port did the wget replacement, and found they used urllib's urlretrieve, just like game_ender suggested.

I went ahead and subclassed some of urllib's classes to add support for HTTP authentication, HTTP proxy (none, specified or from environment) and added support for calculating the current speed and even bandwidth limiting works, kind of. The code will be available in the upcoming gPodder release and can also be found here: gpodder.download module. All the code needs is urllib from Python's standard library.

Thanks for your help and suggestions :)