How-to: Update PDF Metadata Using pdftk

4 minute read

Published:

I often download and read scientific journal articles on my Kobo eReader. Unfortunately, the metadata in the PDFs for such journal articles may lack title information. This means that the searchable title that appears for the article once loaded onto the eReader is something very ugly like an uninterpretable string of digits. An easy way to fix this is using the open-source tool pdftk. In this short article, I show how to use pdftk for this purpose.

A look at the man pages for pdftk via man pdftk tells us everything we need to know:

Figure (1): Manual for pdftk.

A bit of a deeper look at the man pages reveals that we need to dump the PDF meta data, modify this in a text editor, and then update the original PDF with the new metadata.

As an example, take Tsunami Propagation from a Finite Source (Carrier 2005). Once you’ve downloaded it, you can inspect the metadata contents with:

1
pdftk ~/Downloads/cmes.2005.010.113-2.pdf dump_data_utf8 output ~/Downloads/cmes.2005.utf 

Opening ~/Downloads/cmes.2005.utf, you’ll see a number of fields, one of which looks like the following:

1
2
3
InfoBegin
InfoKey: Title
InfoValue: main.dvi

If you change the InfoValue here in ~/Downloads/cmes.2005.utf to your desired name, e.g.,

1
InfoValue: Carrier 2005: Tsunami Propagation from a Finite Source

and then call

1
pdftk ~/Downloads/cmes.2005.010.113-2.pdf update_info_utf8 ~/Downloads/cmes.2005.utf output ~/Downloads/cmes_updated.pdf 

then your PDF now has the correct metadata and is ready for reading on an eReader!

Here is a permalink to a script that automates this update process for you. Here is a link to the same script on the main branch (in-case I ever update it). Here is the code pasted below for the script in case you don’t want to check out my github :)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
#!/usr/bin/bash
trap 'echo "Error at ${LINENO}: ${BASH_COMMAND}" 2>&1' ERR
usage=$(cat << EOF
usage: $0 [-h] [-i] PDF NEW_TITLE

Update a PDF metadata with a NEW_TITLE and then output the updated PDF to a
default path '\${PDF}_updated'. Pass -i to update the PDF inplace.

positional arguments:
    PDF        Path to PDF to update.

    NEW_TITLE  The new metadata title for the PDF.

optional arguments:
    -h         Print a help message and exit.

    -i         Update the PDF inplace, thus removing any intermediate outputs.
EOF
)
inplace=0
while getopts "hi" flag
do
    case "${flag}" in
        h)  echo "${usage}"
            exit 0
            ;;
        i) inplace=1;;
        *)  echo "ERROR: unrecognized flags"
            echo "try $0 -h"
            exit 1 
            ;;
    esac 
done

# shift past options to get positionals 
shift $(($OPTIND - 1))
n_positional_args=2
PDF=$1
NEW_TITLE=$2

# Script exits immediately on error and errors on uninitialized variables 
# NOTE: when using grep, grep exits 1 on failure to pattern match and therefore
# '|| true' is needed when saving grep output to use in test expressions
set -eu

# Check if all positional arguments were provided
if [ $# -lt $n_positional_args ]
then
    echo "ERROR: Missing positional arguments!"
    echo "Try '$0 -h' for more information."
    exit 1
fi

# Verify positionals
if [[ ! -f "${PDF}" ]]
then
    echo "ERROR: PDF does not exist"
    echo "Got ${PDF}"
    exit 1
fi

# update title metadata
cur_metadata=$(pdftk "${PDF}" dump_data_utf8)
has_title_metadata=$(grep -n "InfoKey: Title" <<< "${cur_metadata}" || true)
if [[ -n "${has_title_metadata}" ]]
then
    # remove existing metadata
    lineno_infokey_title=$(grep --only-matching --perl-regexp \
        "\d*(?=:)" <<< "${has_title_metadata}")
    lineno_infobegin=$((lineno_infokey_title - 1))
    lineno_infovalue=$((lineno_infokey_title + 1))
    cur_metadata="$(printf '%s\n' "${cur_metadata}" | \
        sed "${lineno_infobegin}d;${lineno_infokey_title}d;${lineno_infovalue}d")"
fi

# Define the new metadata
title_metadata=$(cat << EOF
InfoBegin
InfoKey: Title
InfoValue: ${NEW_TITLE}
EOF
)
title_metadata="${title_metadata}"$'\n'
new_metadata="${title_metadata}${cur_metadata}"
output_pdf_path="$(dirname $(realpath ${PDF}))/updated_$(basename ${PDF})"
tmp_new_metdata="/tmp/new_metadata"
echo "${new_metadata}" > "${tmp_new_metdata}"

# update the pdf 
pdftk "${PDF}" update_info_utf8 "${tmp_new_metdata}" output "${output_pdf_path}"

if [[ ${inplace} -eq 1 ]]
then
    rm "${PDF}"
    mv "${output_pdf_path}" "${PDF}"
    output_pdf_path="${PDF}"
fi

echo "output written to ${output_pdf_path}"
rm "${tmp_new_metdata}"