Americas

  • United States
sandra_henrystocker
Unix Dweeb

Using curl and wget commands to download pages from web sites

How-To
Sep 20, 20235 mins
Linux

The curl and the wget commands make it easy to download content from web sites.

web pages browser internet search traffic seo
Credit: Getty Images

One of the most versatile tools for collecting data from a server is curl. The “url” portion of the name properly suggests that the command is built to locate data through the URL (uniform resource locater) that you provide. And it doesn’t just communicate with web servers. It supports a wide variety of protocols. This includes HTTP, HTTPS, FTP, FTPS, SCP, SFTP and more. The wget command, though similar in some ways to curl, primarily supports HTTP and FTP protocols.

Using the curl command

You might use the curl command to:

  • Download files from the internet
  • Run tests to ensure that the remote server is doing what is expected
  • Do some debugging on various problems
  • Log errors for later analysis
  • Back up important files from the server

Probably the most obvious thing to do with the curl command is to download a page from a web site for review on the command line. To do this, just enter “curl” followed by the URL of the web site like this (the content below is truncated):

$ curl https://www.networkworld.com/category/linux/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  124k    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0



You’ll see some timing data plus the content. To save the content to a file, redirect the output to a file using a command like this:

$ curl https://www.networkworld.com/category/linux/ > linux.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  124k  100  124k    0     0  23339      0  0:00:05  0:00:05 --:--:-- 30035

The downloaded file can then be viewed on your system using cat or more to see the html content or a browser to view the web page.

In the command below, a single html file is grabbed.

$ curl https://www.networkworld.com/video/series/8559/2-minute-linux-tips > linux_tips.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 79873  100 79873    0     0  56780      0  0:00:01  0:00:01 --:--:-- 56808

Any sequence of blank lines can be reduced to one with a command like this:

$ uniq linux_tips.html > linux_tips.html

More information on using curl is available in this previous post of mine: The Joy of curl

You can also get some quick help on options for using curl with the curl –help command:

$ curl --help
Usage: curl [options...] 
 -d, --data           HTTP POST data
 -f, --fail                 Fail fast with no output on HTTP errors
 -h, --help       Get help for commands
 -i, --include              Include protocol response headers in the output
 -o, --output         Write to file instead of stdout
 -O, --remote-name          Write output to a file named as the remote file
 -s, --silent               Silent mode
 -T, --upload-file    Transfer local FILE to destination
 -u, --user  Server user and password
 -A, --user-agent     Send User-Agent  to server
 -v, --verbose              Make the operation more talkative
 -V, --version              Show version number and quit

This is not the full help, this menu is stripped into categories.
Use "--help category" to get an overview of all categories.
For all options use the manual or "--help all".’

Using wget

The wget command makes it easy to download a web site recursively. While the site used in the command below is a single-page web site, it provides a quick example of how this command works.

$ wget -r http://example.com/
--2023-09-19 13:07:12--  http://example.com/
Resolving example.com (example.com)... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946
Connecting to example.com (example.com)|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1256 (1.2K) [text/html]
Saving to: ‘example.com/index.html’

example.com/index.html        100%[=================================================>]   1.23K  --.-KB/s    in 0s

2023-09-19 13:07:12 (56.1 MB/s) - ‘example.com/index.html’ saved [1256/1256]

FINISHED --2023-09-19 13:07:12--
Total wall clock time: 0.1s
Downloaded: 1 files, 1.2K in 0s (56.1 MB/s)

The downloaded content will include a directory with the name of the URL (example.com) and containing its contents – in this case a single file.

$ ls example.com
index.html
$ head example.com/index.html



    Example Domain

    
    
    
    

If you were to run the command below (no recursion) multiple times, generations of the file will build up.

$ wget http://example.com/
$ ls -l index.html*
-rw-r--r--. 1 shs shs 1256 Oct 17  2019 index.html
-rw-r--r--. 1 shs shs 1256 Oct 17  2019 index.html.1
-rw-r--r--. 1 shs shs 1256 Oct 17  2019 index.html.2
-rw-r--r--. 1 shs shs 1256 Oct 17  2019 index.html.3

The no-parent option

The no-parent options ensures that the command will not ever ascend to the parent directory when retrieving content recursively so that only the files below a certain hierarchy will be downloaded.

$ wget --no-parent -r https://uushenandoah.org/how-to-become-a-member/

Wrap-up

Both curl and wget are extremely useful commands for downloading and troubleshooting web content. Check out the man pages for information on the many options available.

sandra_henrystocker
Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author