Software & AppsOperating SystemLinux

How To Use WGET to Download the Exact Same Web Page HTML as a Browser

Ubuntu 10

In this article, we will explore how to use WGET to download the exact same web page HTML as a browser. This task can be a bit tricky due to the dynamic nature of modern web pages, but we will break it down step-by-step to make it as simple as possible.

Understanding WGET and Web Page Structure

WGET is a free utility for non-interactive download of files from the web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.

Web pages today are often more complex than just a single HTML file. They can include CSS files for styling, JavaScript files for interactivity, and media files like images and videos. Furthermore, some web pages use JavaScript to dynamically generate content, which can make it challenging to download the exact same HTML as a browser.

The Limitations of WGET

By default, WGET does not support JavaScript execution. This means that it will only download the initial HTML without any dynamically generated content. If a web page uses JavaScript to generate a data table, for example, this table may be missing when you use WGET to download the HTML.

Using WGET with Page Requisites

One way to get a more complete download of a web page with WGET is to use the --page-requisites option. This option tells WGET to download all the files that are necessary to properly display a given HTML page. This includes inlined images, sounds, and referenced stylesheets.

Here’s an example command:

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

In this command:

  • --mirror turns on options suitable for mirroring, such as infinite recursion.
  • -p or --page-requisites gets all images, etc. needed to display the HTML page.
  • --convert-links converts the links so that they are suitable for offline viewing.
  • -P ./LOCAL-DIR specifies the directory prefix where all files and directories are saved to.
  • WEBSITE-URL is the URL of the web page you want to download.

However, even with the --page-requisites option, WGET still won’t execute JavaScript. If the web page uses JavaScript to generate content dynamically, this content may still be missing.

Mimicking a Browser with WGET

Some websites may have measures in place to detect automated requests and provide different responses. In such cases, it may be necessary to mimic a browser’s user agent string using the --user-agent option in WGET. This makes the request appear more like it’s coming from a browser.

Here’s an example command:

wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537" -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

In this command, the --user-agent option is followed by a user agent string that represents a specific browser. In this case, it’s Chrome 58 on Windows 10.

Using a Headless Browser for JavaScript Execution

If a web page uses JavaScript to generate content dynamically, you may need to use a headless browser that supports JavaScript, such as PhantomJS or headless Chrome. These browsers can execute JavaScript and generate the complete HTML, including any dynamically generated content.

Here’s an example using PhantomJS:

  1. Install PhantomJS on your system.
  2. Create a script file, let’s say save_page.js, with the following content:
var system = require('system');
var page = require('webpage').create();

page.open(system.args[1], function() {
 console.log(page.content);
 phantom.exit();
});
  1. Run the following command in your terminal:
phantomjs save_page.js http://example.com > page.html

This will save the complete HTML of the web page, including the data table, into the page.html file.

Conclusion

Downloading the exact same web page HTML as a browser can be a bit challenging due to the dynamic nature of modern web pages. However, with the right tools and options, it’s certainly possible. Whether you’re using WGET with the --page-requisites and --user-agent options, or a headless browser like PhantomJS or headless Chrome, you can get a complete download of a web page, including any dynamically generated content.

Leave a Comment

Your email address will not be published. Required fields are marked *