Art of Technology: How to Convert Multiple Webpages Into PDFs With Wget

There are many online tools, browser extensions, and desktop plugins to turn a webpage into a PDF. If you regularly use these tools, you might come across a situation when you need to convert multiple links in one go. Doing them one-by-one is a waste of time and tedious.

You can automate this task through a simple, command-line utility called Wget. With the help of a few scripts or apps and this tool, we’ll show you how to save multiple webpages into a PDF file.

Why Wget?

Wget is a free software package to download files from the web. But it’s also a perfect tool to mirror an entire webpage into your computer. Here are the reasons:

It is non-interactive, meaning that it can work in the background and let you transfer data without requiring your presence.
Wget understands the directory structure of the website. It can follow links in the HTML page, to create a local version of remote website. In the process, it also respects the robots.txt file.
It can even work in slow or unstable network connections. If the server supports regetting, it’ll keep retrying until the whole file transfer to your computer.
Wget supports proxy servers. It helps to lighten the network load, speed up retrieval process, and provide access behind the firewalls.

Installing Wget

On macOS

The quickest way to install Wget is through the Homebrew. It’s a package manager for macOS to install useful Unix utilities and apps. Check out this article on how to install macOS apps through Homebrew. Then, type in

brew install wget

You’ll get real-time progress of installing all the tools (if any) required for Wget to run on your Mac. If you already have installed Homebrew, be sure to run brew upgrade to get the latest version of this utility.

On Windows 10

There are many versions of Wget available for Windows 10. Head to Eternally Board to download the latest 64-bit build. Put the executable file in a folder and copy it to the C: drive.

install location of wget in Windows

Now we’ll add Wget path to your system’s environment variable to run this tool from any directory. Navigate to Control Panel > System and click Advanced System Settings. In the window that opens, click Environment Variables.

click environment variable in advanced system settings

Select Path under System variables and click Edit. Then, click on the New button located at the upper-right corner of the window. Type in C:\wget and click Ok.

add wget to system environment variable

Open Command Prompt and type wget-h to test if everything works. In PowerShell type wget.exe -h to load the Wget help menu.

run wget from the command prompt

Save Your Links in a Text File

Since we’re dealing with multiple links, pasting links one-by-one is a difficult task. Thankfully, there are browser extensions to help you complete this task.

Link Klipper: It extracts all the links on a webpage as a CSV or TXT file. You can even drag a rectangular area on the webpage to selectively copy links. The file gets saved to the Download folder.

link klipper chrome extension

Snap Links Plus: This lets you lasso elements on a page and do things with them. Hold down the right mouse button and drag a selection rectangle around links. Press the Control key and copy your links to the clipboard.

Setting Up a Directory

Wget works like a web crawler by extracting web page assets from the HTML files, including logo, fonts, image thumbnails, CSS, and JavaScript files. It also tries to create a directory structure resembling the remote server. Create a separate directory for Wget downloads to save webpages and also to prevent clutter.

On your Mac Terminal or in a Command Prompt on Windows, enter

mkdir Wgetdown

This creates a new folder in Home directory. You can name it anything you like. Next, enter

cd Wgetdown

Change directory. This changes the present working directory to Wgetdown.

Details of the Wget Commands

After creating the directory, we’ll use the actual Wget command

wget -E -H -k -K -p -i [Path to Your Text File]

Wget uses GNU getopt to process command-line arguments. Every option has a long form along with the short ones. Long options are convenient to remember but take time to type. You may also mix different option styles. Let’s go into the details of these options:

-E (–adjust-extension): If a file of type “app/xhtml+xml” or “text/html” gets downloaded and the URL does not end with the HTML, this option will append HTML to the filename.
-H (–span-hosts): When you’re trying to retrieve links recursively, one does not wish to retrieve loads of unnecessary data. You want Wget to follow only specific links. This option turns on host spanning, that allows Wget recursive run to visit any host referenced by a link. For example, images served from a different host.
-p (–page-requisites): This option download all the files that are necessary for displaying an HTML page. It includes inline images, audio, and referenced stylesheets.
-k (–convert-links): It converts the links in the document to make them suitable for offline viewing. It includes embedded images, links to style sheets, hyperlinks to non-HTML content, and more.
-K (–backup-converted): When converting a file, this option backs up the original version with a .orig suffix.
-i (–input-file): Read URLs from a path to your local or external file.

Putting the Commands in Use

To demonstrate these commands in practice, consider a website manual called Writing Workflows. This manual consists of a table of contents with a link to individual chapters. The end goal is you want to create a separate PDF file of those sections.

Step 1: Open Terminal and create a new folder, as discussed above.

create folder through Terminal

Step 2: Use the Link Klipper extension to save your links as a text file. Save your file to the Downloads folder.

add links through link klipper extension

Step 3: While you’re in the Wgetdown folder, type in

wget -E -H -k -K -p -i /Users/rahulsaigal/Downloads/links.txt

putting wget commands in use

Step 4: Press Enter. Wait for the process to complete.

wget command in process

Step 5: Navigate to the Wgetdown folder. You’ll see the primary domain processedword.net folder with all web page assets and chapter1.html.

downloaded files with all the assets

Converting HTML to PDF

Converting a webpage to a PDF looks simple. But having them to look like the original webpage is a difficult task. Producing a desirable result depends on:

System fonts available.
The integrity of HTML markup.
Features needed that support tables or floating images.
Optionally table of contents.

Windows 10

PrinceXML is an app to quickly convert HTML files into PDF. It enables you to typeset, format, and print HTML content with configurable layout and web standards support. It comes with many useful fonts and also lets you customize the PDF output. The app is free for non-commercial use.

princexml app for converting html to pdf

macOS

On your Mac, you can create an Automator service to convert a batch of HTML files into PDFs. Open Automator and create a Quick Action document. Set the service option to receive files or folders from Finder. Next drag in Run Shell Script and set Pass input option to as arguments. Then, paste this script in the body

for theFileToProcess in "$@"
do
cupsfilter "$theFileToProcess" > "${theFileToProcess%.*}.pdf"
done

Save the file as HTML2PDF.

shell script to convert files in Mac

Now select all the HTML files in Finder. Right-click and choose Services > HTML2PDF. Wait for a few moments to convert all your files.

convert bunch of html files with this script

Ways to Convert Single Webpage into PDF

At first sight, the steps involved in converting multiple webpages into PDF looks complicated. But once you understand the steps and process, it’ll save time in the long run. You don’t have to spend money on any web subscriptions or expensive PDF converters.

If you’re looking for ways to turn a single webpage into PDF, read this piece on how to convert webpages into PDF.

Read the full article: How to Convert Multiple Webpages Into PDFs With Wget

from MakeUseOf https://ift.tt/34Ee097
via IFTTT

Art of Technology

Friday, 13 September 2019

How to Convert Multiple Webpages Into PDFs With Wget