Automatic Screenshots of Drupal Content

In an earlier post I recommended webkit2png for automatically screenshotting a list of URLs. A lot of time has passed since that post, and I’ve discovered a more robust tool. Pageres is incredible, and it has a CLI and an api.

I’ll let you discover, on your own, what the Pageres tool can do. I needed to take screenshots of all of the content types on a site, at all of the important resolutions. Here’s a quick Drupal function I threw together to get N number of random nodes per content type:

function generate_random_node_urls_by_type($num_per_type = 3, $include_type = FALSE, $alias = FALSE, $node_types = array()) {
  $output = '';
  if (empty($node_types)) {
    foreach (node_type_get_types() as $type) {
      $node_types[] = $type->type;
    }
  }
  foreach ($node_types as $node_type) {
    $result = db_query_range('SELECT n.nid as nid, ua.alias as alias
                              FROM {node} n
                              LEFT JOIN {url_alias} ua ON ua.source = CONCAT(\'node/\', n.nid)
                              WHERE n.type = :ntype
                              ORDER BY RAND()', 0, $num_per_type, array(':ntype' => $node_type));
    if ($result) {
      while ($row = $result->fetchAssoc()) {
        if ($include_type) {
          $output .= str_pad($node_type, 35);
        }
        if ($alias && $row['alias']) {
          $output .= $GLOBALS['base_url']. '/' . $row['alias'] . "\n";
        }
        else {
          $output .= $GLOBALS['base_url'] . '/node/' . $row['nid'] . "\n";
        }
      }
    }
  }
 
  return $output;
}

// Example 1: 10 of each specific node type:
dpm(generate_random_node_urls_by_type(10, FALSE, TRUE, array('homepage_feature', 'page')));

// Example 2: 5 of every node type:
dpm(generate_random_node_urls_by_type(5, FALSE, TRUE));

function generate_random_node_urls_by_type($num_per_type = 3, $include_type = FALSE, $alias = FALSE, $node_types = array()) {

$output = '';

if (empty($node_types)) {

foreach (node_type_get_types() as $type) {

$node_types[] = $type->type;

}

foreach ($node_types as $node_type) {

$result = db_query_range('SELECT n.nid as nid, ua.alias as alias

FROM {node} n

LEFT JOIN {url_alias} ua ON ua.source = CONCAT(\'node/\', n.nid)

WHERE n.type = :ntype

ORDER BY RAND()', 0, $num_per_type, array(':ntype' => $node_type));

if ($result) {

while ($row = $result->fetchAssoc()) {

if ($include_type) {

$output .= str_pad($node_type, 35);

}

if ($alias && $row['alias']) {

$output .= $GLOBALS['base_url']. '/' . $row['alias'] . "\n";

}

else {

$output .= $GLOBALS['base_url'] . '/node/' . $row['nid'] . "\n";

}

return $output;

}

// Example 1: 10 of each specific node type:

dpm(generate_random_node_urls_by_type(10, FALSE, TRUE, array('homepage_feature', 'page')));

// Example 2: 5 of every node type:

dpm(generate_random_node_urls_by_type(5, FALSE, TRUE));

The function spits out a list of URLs ready for usage with pageres. Simply save the results to a txt file (urls.txt in my example below).

Here’s the pageres command I used to generate the screenshots:

pageres --delay 1 --header='Cache-Control: no-cache' --filename="<%= date %> - <%= url %> - <%= size %>" 1200x100 1024x100 768x100 520x100 320x100 < urls.txt

1	pageres --delay 1 --header='Cache-Control: no-cache' --filename="<%= date %> - <%= url %> - <%= size %>" 1200x100 1024x100 768x100 520x100 320x100 < urls.txt

Why the 100-pixel height? Well, the height doesn’t really matter unless you enable cropping. I use 100 on all of them so that it’s obvious the value doesn’t mean anything. I tried 1200×1 but it breaks pageres. 1200×100 works perfectly.

How about another quick function? Here’s one to generate a list of URLs within a menu:

function generate_node_urls_in_menu($menu_name, $alias = FALSE) {
  $output = '';
  $result = db_query('SELECT m.link_path as link_path, ua.alias as alias
                      FROM {menu_links} m
                      INNER JOIN {url_alias} ua ON ua.source = m.link_path
                      WHERE menu_name = :mname', array(':mname' => $menu_name));
  if ($result) {
    while ($row = $result->fetchAssoc()) {
      if ($alias) {
        $output .= $GLOBALS['base_url']. '/' . $row['alias'] . "\n";
      }
      else {
        $output .= $GLOBALS['base_url']. '/' . $row['link_path'] . "\n";
      }
    }
  }

  return $output;
}

dpm(generate_node_urls_in_menu('menu-for-undergraduates', TRUE));

function generate_node_urls_in_menu($menu_name, $alias = FALSE) {

$output = '';

$result = db_query('SELECT m.link_path as link_path, ua.alias as alias

FROM {menu_links} m

INNER JOIN {url_alias} ua ON ua.source = m.link_path

WHERE menu_name = :mname', array(':mname' => $menu_name));

if ($result) {

while ($row = $result->fetchAssoc()) {

if ($alias) {

$output .= $GLOBALS['base_url']. '/' . $row['alias'] . "\n";

}

else {

$output .= $GLOBALS['base_url']. '/' . $row['link_path'] . "\n";

}

return $output;

}

dpm(generate_node_urls_in_menu('menu-for-undergraduates', TRUE));

Now, how does this handle many URLs? Well, unfortunately not that well. Python comes to the rescue in just a few lines of simple code. This will process one URL at a time, generating all resolutions for each URL. I’m certain this could be better (filename should be an argument, for example), but it gets the job done.

import subprocess
with open("urls.txt", "r") as file:
    for line in file:
        print "Generating screenshots for", line
        p = subprocess.Popen("pageres --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>' 1200x100 1024x100 768x100 520x100 320x100",
            shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        p.stdin.write(line)
        p.communicate()[0]
        p.stdin.close()

import subprocess

with open("urls.txt", "r") as file:

for line in file:

print "Generating screenshots for", line

p = subprocess.Popen("pageres --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>' 1200x100 1024x100 768x100 520x100 320x100",

shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

p.stdin.write(line)

p.communicate()[0]

p.stdin.close()

UPDATE #1: Here’s a rough draft of a Python script that is a little more robust than the code above. It still lacks some niceties, but I’ll just wait until next time I need it to make improvements.

You would execute this like: python ~/repos/pageres_capture/pageres_capture.py urls.txt

#!/usr/bin/env python

import argparse
import subprocess
import logging
import sys

# Example:
# sizes = "1200x100 1024x100 768x100 520x100 320x100"
sizes = "1200x100"

LOG = logging.getLogger(__name__)
LOG.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S")

# Console logging
ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
LOG.addHandler(ch)

parser = argparse.ArgumentParser(description='Captures screenshots of URLs from a file using Pageres', version='1.0', add_help=True)
parser.add_argument('inputfile', action="store", type=file)
args = parser.parse_args()

# loop through all of the lines in the input file and process them
lines = args.inputfile.read().splitlines()

i = 0
for line in lines:
    # Increase the line number by one for our user messages
    i += 1

    # Clean the line
    lineclean = line.strip()

    if lineclean == '':
        LOG.info('Line %d - Ignoring blank line' % i)
        continue

    LOG.info('Line %d - Capturing %s' % (i, lineclean))
    p = subprocess.Popen("pageres --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>' " + sizes,
        shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    p.stdin.write(line)
    p.communicate()[0]
    p.stdin.close()

#!/usr/bin/env python

import argparse

import subprocess

import logging

import sys

# Example:

# sizes = "1200x100 1024x100 768x100 520x100 320x100"

sizes = "1200x100"

LOG = logging.getLogger(__name__)

LOG.setLevel(logging.DEBUG)

formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S")

# Console logging

ch = logging.StreamHandler(sys.stdout)

ch.setLevel(logging.INFO)

ch.setFormatter(formatter)

LOG.addHandler(ch)

parser = argparse.ArgumentParser(description='Captures screenshots of URLs from a file using Pageres', version='1.0', add_help=True)

parser.add_argument('inputfile', action="store", type=file)

args = parser.parse_args()

# loop through all of the lines in the input file and process them

lines = args.inputfile.read().splitlines()

i = 0

for line in lines:

# Increase the line number by one for our user messages

i += 1

# Clean the line

lineclean = line.strip()

if lineclean == '':

LOG.info('Line %d - Ignoring blank line' % i)

continue

LOG.info('Line %d - Capturing %s' % (i, lineclean))

p = subprocess.Popen("pageres --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>' " + sizes,

shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

p.stdin.write(line)

p.communicate()[0]

p.stdin.close()

UPDATE #2: Here’s a version that appends the URL to the top of the screenshot using ImageMagick. You can turn it off using –no-overlay. As with the code above, this is alpha code. As I’m looking at it it’s clear I should make “sizes” an argument/switch. In fact, I should probably allow several of the pageres options.

This requires ImageMagick. Before running, you must be able to run mogrify successfully from the command line.

#!/usr/bin/env python

import argparse
import subprocess
import logging
import sys

# Example:
# sizes = "1200x100 1024x100 768x100 520x100 320x100"
sizes = "1200x100"

LOG = logging.getLogger(__name__)
LOG.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S")

# Console logging
ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
LOG.addHandler(ch)

parser = argparse.ArgumentParser(description='Captures screenshots of URLs from a file using Pageres', version='1.0', add_help=True)
parser.add_argument('inputfile', action='store', type=file)
parser.add_argument('--no-overlay', help='Do not add URL overlay', action='store_true')
args = parser.parse_args()

# loop through all of the lines in the input file and process them
lines = args.inputfile.read().splitlines()

i = 0
for line in lines:
    # Increase the line number by one for our user messages
    i += 1

    # Clean the line
    lineclean = line.strip()

    if lineclean == '':
        LOG.info('Line %d - Ignoring blank line' % i)
        continue

    LOG.info('Line %d - Capturing %s' % (i, lineclean))
    p = subprocess.Popen("pageres --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>' " + sizes,
        shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    p.stdin.write(line)
    p.communicate()[0]
    p.stdin.close()

    if not args.no_overlay:
        p = subprocess.Popen('OUTPUT="$(ls -Art | tail -n 1)"; mogrify -pointsize 14 -background Gold -gravity North -splice 0x18 -annotate +0+2 \'%s\' "${OUTPUT}"' % lineclean,
            shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        p.stdin.write(line)
        p.communicate()[0]
        p.stdin.close()

#!/usr/bin/env python

import argparse

import subprocess

import logging

import sys

# Example:

# sizes = "1200x100 1024x100 768x100 520x100 320x100"

sizes = "1200x100"

LOG = logging.getLogger(__name__)

LOG.setLevel(logging.DEBUG)

formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S")

# Console logging

ch = logging.StreamHandler(sys.stdout)

ch.setLevel(logging.INFO)

ch.setFormatter(formatter)

LOG.addHandler(ch)

parser = argparse.ArgumentParser(description='Captures screenshots of URLs from a file using Pageres', version='1.0', add_help=True)

parser.add_argument('inputfile', action='store', type=file)

parser.add_argument('--no-overlay', help='Do not add URL overlay', action='store_true')

args = parser.parse_args()

# loop through all of the lines in the input file and process them

lines = args.inputfile.read().splitlines()

i = 0

for line in lines:

# Increase the line number by one for our user messages

i += 1

# Clean the line

lineclean = line.strip()

if lineclean == '':

LOG.info('Line %d - Ignoring blank line' % i)

continue

LOG.info('Line %d - Capturing %s' % (i, lineclean))

p = subprocess.Popen("pageres --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>' " + sizes,

shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

p.stdin.write(line)

p.communicate()[0]

p.stdin.close()

if not args.no_overlay:

p = subprocess.Popen('OUTPUT="$(ls -Art | tail -n 1)"; mogrify -pointsize 14 -background Gold -gravity North -splice 0x18 -annotate +0+2 \'%s\' "${OUTPUT}"' % lineclean,

shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

p.stdin.write(line)

p.communicate()[0]

p.stdin.close()

Update #3: Same as above but corrects behavior if a URL is not accessible (and shows an error as it encounters those). This still only works for Python 2.7:

#!/usr/bin/env python

import argparse
import subprocess
import logging
import sys
from urllib import urlopen

# Example:
# sizes = "1200x100 1024x100 768x100 520x100 320x100"
sizes = "1200x1200"

# CLI arguments from https://www.npmjs.com/package/pageres-cli
# Example:
# options = "--header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>'"
options = "--format=png --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>'"

LOG = logging.getLogger(__name__)
LOG.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S")

# Console logging
ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
LOG.addHandler(ch)

parser = argparse.ArgumentParser(description='Captures screenshots of URLs from a file using Pageres', version='1.0', add_help=True)
parser.add_argument('inputfile', action='store', type=file)
parser.add_argument('--no-overlay', help='Do not add URL overlay', action='store_true')
args = parser.parse_args()

# Loop through all of the lines in the input file and process them
lines = args.inputfile.read().splitlines()

i = 0
for line in lines:
    # Increase the line number by one for our user messages
    i += 1

    lineclean = line.strip()
    if lineclean == '':
        LOG.info('Line %d - Ignoring blank line' % i)
        continue

    try:
        urlopen(lineclean).getcode()
    except:
        LOG.error('Line %d - Error capturing %s' % (i, lineclean))
        continue

    LOG.info('Line %d - Capturing %s' % (i, lineclean))
    p = subprocess.Popen("pageres \"" + lineclean + "\" " + options + " " + sizes,
        shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    p.communicate()[0]
    p.stdin.close()

    if not args.no_overlay:
        p = subprocess.Popen('OUTPUT="$(ls -Art | tail -n 1)"; mogrify -pointsize 14 -background Gold -gravity North -splice 0x18 -annotate +0+2 \'%s\' "${OUTPUT}"' % lineclean,
            shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        p.stdin.write(lineclean)
        p.communicate()[0]
        p.stdin.close()

#!/usr/bin/env python

import argparse

import subprocess

import logging

import sys

from urllib import urlopen

# Example:

# sizes = "1200x100 1024x100 768x100 520x100 320x100"

sizes = "1200x1200"

# CLI arguments from https://www.npmjs.com/package/pageres-cli

# Example:

# options = "--header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>'"

options = "--format=png --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>'"

LOG = logging.getLogger(__name__)

LOG.setLevel(logging.DEBUG)

formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S")

# Console logging

ch = logging.StreamHandler(sys.stdout)

ch.setLevel(logging.INFO)

ch.setFormatter(formatter)

LOG.addHandler(ch)

parser = argparse.ArgumentParser(description='Captures screenshots of URLs from a file using Pageres', version='1.0', add_help=True)

parser.add_argument('inputfile', action='store', type=file)

parser.add_argument('--no-overlay', help='Do not add URL overlay', action='store_true')

args = parser.parse_args()

# Loop through all of the lines in the input file and process them

lines = args.inputfile.read().splitlines()

i = 0

for line in lines:

# Increase the line number by one for our user messages

i += 1

lineclean = line.strip()

if lineclean == '':

LOG.info('Line %d - Ignoring blank line' % i)

continue

try:

urlopen(lineclean).getcode()

except:

LOG.error('Line %d - Error capturing %s' % (i, lineclean))

continue

LOG.info('Line %d - Capturing %s' % (i, lineclean))

p = subprocess.Popen("pageres \"" + lineclean + "\" " + options + " " + sizes,

shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

p.communicate()[0]

p.stdin.close()

if not args.no_overlay:

p = subprocess.Popen('OUTPUT="$(ls -Art | tail -n 1)"; mogrify -pointsize 14 -background Gold -gravity North -splice 0x18 -annotate +0+2 \'%s\' "${OUTPUT}"' % lineclean,

shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

p.stdin.write(lineclean)

p.communicate()[0]

p.stdin.close()

One Comment

adam

August 2, 2021 at 5:02 pm Reply

I had to do the following on my M1 Mac to get Pageres-cli working:

Step 1: download and install Chromium manually.
https://download-chromium.appspot.com/

Step 2: add the env var below and source ~/.zshrc
export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true

Step 3: Manually tweak a puppeteer file for M1 issues (must be a better way though)
https://github.com/puppeteer/puppeteer/issues/5662#issuecomment-891297815

One Comment

adam

Leave a Reply Cancel reply