Automatic Screenshots of Drupal Content
In an earlier post I recommended webkit2png for automatically screenshotting a list of URLs. A lot of time has passed since that post, and I’ve discovered a more robust tool. Pageres is incredible, and it has a CLI and an api.
I’ll let you discover, on your own, what the Pageres tool can do. I needed to take screenshots of all of the content types on a site, at all of the important resolutions. Here’s a quick Drupal function I threw together to get N number of random nodes per content type:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
function generate_random_node_urls_by_type($num_per_type = 3, $include_type = FALSE, $alias = FALSE, $node_types = array()) { $output = ''; if (empty($node_types)) { foreach (node_type_get_types() as $type) { $node_types[] = $type->type; } } foreach ($node_types as $node_type) { $result = db_query_range('SELECT n.nid as nid, ua.alias as alias FROM {node} n LEFT JOIN {url_alias} ua ON ua.source = CONCAT(\'node/\', n.nid) WHERE n.type = :ntype ORDER BY RAND()', 0, $num_per_type, array(':ntype' => $node_type)); if ($result) { while ($row = $result->fetchAssoc()) { if ($include_type) { $output .= str_pad($node_type, 35); } if ($alias && $row['alias']) { $output .= $GLOBALS['base_url']. '/' . $row['alias'] . "\n"; } else { $output .= $GLOBALS['base_url'] . '/node/' . $row['nid'] . "\n"; } } } } return $output; } // Example 1: 10 of each specific node type: dpm(generate_random_node_urls_by_type(10, FALSE, TRUE, array('homepage_feature', 'page'))); // Example 2: 5 of every node type: dpm(generate_random_node_urls_by_type(5, FALSE, TRUE)); |
The function spits out a list of URLs ready for usage with pageres. Simply save the results to a txt file (urls.txt in my example below).
Here’s the pageres command I used to generate the screenshots:
1 |
pageres --delay 1 --header='Cache-Control: no-cache' --filename="<%= date %> - <%= url %> - <%= size %>" 1200x100 1024x100 768x100 520x100 320x100 < urls.txt |
Why the 100-pixel height? Well, the height doesn’t really matter unless you enable cropping. I use 100 on all of them so that it’s obvious the value doesn’t mean anything. I tried 1200×1 but it breaks pageres. 1200×100 works perfectly.
How about another quick function? Here’s one to generate a list of URLs within a menu:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
function generate_node_urls_in_menu($menu_name, $alias = FALSE) { $output = ''; $result = db_query('SELECT m.link_path as link_path, ua.alias as alias FROM {menu_links} m INNER JOIN {url_alias} ua ON ua.source = m.link_path WHERE menu_name = :mname', array(':mname' => $menu_name)); if ($result) { while ($row = $result->fetchAssoc()) { if ($alias) { $output .= $GLOBALS['base_url']. '/' . $row['alias'] . "\n"; } else { $output .= $GLOBALS['base_url']. '/' . $row['link_path'] . "\n"; } } } return $output; } dpm(generate_node_urls_in_menu('menu-for-undergraduates', TRUE)); |
Now, how does this handle many URLs? Well, unfortunately not that well. Python comes to the rescue in just a few lines of simple code. This will process one URL at a time, generating all resolutions for each URL. I’m certain this could be better (filename should be an argument, for example), but it gets the job done.
1 2 3 4 5 6 7 8 9 |
import subprocess with open("urls.txt", "r") as file: for line in file: print "Generating screenshots for", line p = subprocess.Popen("pageres --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>' 1200x100 1024x100 768x100 520x100 320x100", shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) p.stdin.write(line) p.communicate()[0] p.stdin.close() |
UPDATE #1: Here’s a rough draft of a Python script that is a little more robust than the code above. It still lacks some niceties, but I’ll just wait until next time I need it to make improvements.
You would execute this like: python ~/repos/pageres_capture/pageres_capture.py urls.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
#!/usr/bin/env python import argparse import subprocess import logging import sys # Example: # sizes = "1200x100 1024x100 768x100 520x100 320x100" sizes = "1200x100" LOG = logging.getLogger(__name__) LOG.setLevel(logging.DEBUG) formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S") # Console logging ch = logging.StreamHandler(sys.stdout) ch.setLevel(logging.INFO) ch.setFormatter(formatter) LOG.addHandler(ch) parser = argparse.ArgumentParser(description='Captures screenshots of URLs from a file using Pageres', version='1.0', add_help=True) parser.add_argument('inputfile', action="store", type=file) args = parser.parse_args() # loop through all of the lines in the input file and process them lines = args.inputfile.read().splitlines() i = 0 for line in lines: # Increase the line number by one for our user messages i += 1 # Clean the line lineclean = line.strip() if lineclean == '': LOG.info('Line %d - Ignoring blank line' % i) continue LOG.info('Line %d - Capturing %s' % (i, lineclean)) p = subprocess.Popen("pageres --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>' " + sizes, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) p.stdin.write(line) p.communicate()[0] p.stdin.close() |
UPDATE #2: Here’s a version that appends the URL to the top of the screenshot using ImageMagick. You can turn it off using –no-overlay. As with the code above, this is alpha code. As I’m looking at it it’s clear I should make “sizes” an argument/switch. In fact, I should probably allow several of the pageres options.
This requires ImageMagick. Before running, you must be able to run mogrify successfully from the command line.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
#!/usr/bin/env python import argparse import subprocess import logging import sys # Example: # sizes = "1200x100 1024x100 768x100 520x100 320x100" sizes = "1200x100" LOG = logging.getLogger(__name__) LOG.setLevel(logging.DEBUG) formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S") # Console logging ch = logging.StreamHandler(sys.stdout) ch.setLevel(logging.INFO) ch.setFormatter(formatter) LOG.addHandler(ch) parser = argparse.ArgumentParser(description='Captures screenshots of URLs from a file using Pageres', version='1.0', add_help=True) parser.add_argument('inputfile', action='store', type=file) parser.add_argument('--no-overlay', help='Do not add URL overlay', action='store_true') args = parser.parse_args() # loop through all of the lines in the input file and process them lines = args.inputfile.read().splitlines() i = 0 for line in lines: # Increase the line number by one for our user messages i += 1 # Clean the line lineclean = line.strip() if lineclean == '': LOG.info('Line %d - Ignoring blank line' % i) continue LOG.info('Line %d - Capturing %s' % (i, lineclean)) p = subprocess.Popen("pageres --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>' " + sizes, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) p.stdin.write(line) p.communicate()[0] p.stdin.close() if not args.no_overlay: p = subprocess.Popen('OUTPUT="$(ls -Art | tail -n 1)"; mogrify -pointsize 14 -background Gold -gravity North -splice 0x18 -annotate +0+2 \'%s\' "${OUTPUT}"' % lineclean, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) p.stdin.write(line) p.communicate()[0] p.stdin.close() |
Update #3: Same as above but corrects behavior if a URL is not accessible (and shows an error as it encounters those). This still only works for Python 2.7:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
#!/usr/bin/env python import argparse import subprocess import logging import sys from urllib import urlopen # Example: # sizes = "1200x100 1024x100 768x100 520x100 320x100" sizes = "1200x1200" # CLI arguments from https://www.npmjs.com/package/pageres-cli # Example: # options = "--header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>'" options = "--format=png --header='Cache-Control: no-cache' --filename='<%= date %> - <%= url %> - <%= size %>'" LOG = logging.getLogger(__name__) LOG.setLevel(logging.DEBUG) formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S") # Console logging ch = logging.StreamHandler(sys.stdout) ch.setLevel(logging.INFO) ch.setFormatter(formatter) LOG.addHandler(ch) parser = argparse.ArgumentParser(description='Captures screenshots of URLs from a file using Pageres', version='1.0', add_help=True) parser.add_argument('inputfile', action='store', type=file) parser.add_argument('--no-overlay', help='Do not add URL overlay', action='store_true') args = parser.parse_args() # Loop through all of the lines in the input file and process them lines = args.inputfile.read().splitlines() i = 0 for line in lines: # Increase the line number by one for our user messages i += 1 lineclean = line.strip() if lineclean == '': LOG.info('Line %d - Ignoring blank line' % i) continue try: urlopen(lineclean).getcode() except: LOG.error('Line %d - Error capturing %s' % (i, lineclean)) continue LOG.info('Line %d - Capturing %s' % (i, lineclean)) p = subprocess.Popen("pageres \"" + lineclean + "\" " + options + " " + sizes, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) p.communicate()[0] p.stdin.close() if not args.no_overlay: p = subprocess.Popen('OUTPUT="$(ls -Art | tail -n 1)"; mogrify -pointsize 14 -background Gold -gravity North -splice 0x18 -annotate +0+2 \'%s\' "${OUTPUT}"' % lineclean, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) p.stdin.write(lineclean) p.communicate()[0] p.stdin.close() |
One Comment
adam
I had to do the following on my M1 Mac to get Pageres-cli working:
Step 1: download and install Chromium manually.
https://download-chromium.appspot.com/
Step 2: add the env var below and source ~/.zshrc
export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true
Step 3: Manually tweak a puppeteer file for M1 issues (must be a better way though)
https://github.com/puppeteer/puppeteer/issues/5662#issuecomment-891297815