|
techAdmin
Back to top
Status: Site Admin Joined: 26 Sep 2003 Posts: 1040 Location: East Coast, West Coast? I know it's one of them. |
There's a nice generalized wget howto.
For our purposes, we won't need all this information, but I'm going to quote the main part because... well, because I'm tired of people taking their sites down and links dying: :: Quote :: wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i ~/mp3blogs.txt
And here's what this all means: -r -H -l1 -np These options tell wget to download recursively. That means it goes to a URL, downloads the page there, then follows every link it finds. The -H tells the app to span domains, meaning it should follow links that point away from the blog. And the -l1 (a lowercase L with a numeral one) means to only go one level deep; that is, don't follow links on the linked site. In other words, these commands work together to ensure that you don't send wget off to download the entire Web -- or at least as much as will fit on your hard drive. Rather, it will take each link from your list of blogs, and download it. The -np switch stands for "no parent", which instructs wget to never follow a link up to a parent directory. We don't, however, want all the links -- just those that point to audio files we haven't yet seen. Including -A.mp3 tells wget to only download files that end with the .mp3 extension. And -N turns on timestamping, which means wget won't download something with the same name unless it's newer. To keep things clean, we'll add -nd, which makes the app save every thing it finds in one directory, rather than mirroring the directory structure of linked sites. And -erobots=off tells wget to ignore the standard robots.txt files. Normally, this would be a terrible idea, since we'd want to honor the wishes of the site owner. However, since we're only grabbing one file per site, we can safely skip these and keep our directory much cleaner. Also, along the lines of good net citizenship, we'll add the -w5 to wait 5 seconds between each request as to not pound the poor blogs. Finally, -i ~/mp3blogs.txt is a little shortcut. Typically, I'd just add a URL to the command line with wget and start the downloading. But since I wanted to visit multiple mp3 blogs, I listed their addresses in a text file (one per line) and told wget to use that as the input. Let's take a look at the core parts though: wget -r -l1 -A.mp3 <url> This will download from the given <url> all files of type .mp3 for one level in the site, down from the given url. This can be a really handy device, also good for example for .htm or .html pages. Here's a concrete example: say you want to download all files of type .mp3 going down two directory levels, but you do not want wget to recreate the directory structures, just get the files: wget -r -l2 -nd -Nc -A.mp3 <url> -r makes it recursive -l2 makes it 2 levels -nd is no directories -Nc only downloads files you have not already downloaded -A.mp3 means all mp3 files on page |
|
All times are GMT - 8 Hours |
|
Contact Us
Hosting: Pair Networks: 0.048
Forum Software © 2001–2009 phpBB
techForum Style © 2003–2009 techpatterns.com
info
Hosting: Pair Networks: 0.048
Forum Software © 2001–2009 phpBB
techForum Style © 2003–2009 techpatterns.com
info