Post

Check availability of external links in your web pages

Check availability of external links in your web pages

Original post from linux.xvx.cz

When you create your web pages in most cases you are using the images, external links, videos which may not be a static part of the web page itself, but it’s stored externally.

At the time you wrote your shiny page you probably checked all these external dependencies to be sure it’s working to make your readers happy, because nobody likes to see errors like this:

YouTube missing video error message

Now the page is working fine with all external dependencies because I checked it properly - but what about in a few months / years / … ?

Web pages / images  / videos may disappear from the Internet especially when you can not control them and then it’s handy from time to time to check your web pages if all the external links are still alive.

There are many tools which you may install to your PC and check the “validity” of your web pages instead of manually clicking the links.

I would like to share how I’m periodically checking my documents / pages using the GitHub Actions.

Here is the GitHub Action I wrote for this purpose:  My Broken Link Checker

In short you can simply create a git repository in GitHub and store there the file defining which URLs should be checked/verified:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
git clone git@github.com:ruzickap/check_urls.git
cd check_urls || true
mkdir -p .github/workflows

cat > .github/workflows/periodic-broken-link-checks.yml << \EOF
name: periodic-broken-link-checks

on:
  schedule:
    - cron: '0 0 * * *'
  pull_request:
    types: [opened, synchronize]
    paths:
      - .github/workflows/periodic-broken-link-checks.yml
  push:
    branches:
      - master
    paths:
      - .github/workflows/periodic-broken-link-checks.yml

jobs:
  broken-link-checker:
    runs-on: ubuntu-latest
    steps:
      - name: Broken link checker
        env:
          INPUT_URL: https://google.com
          EXCLUDE: |
            linkedin.com
            localhost
            myexample.dev
            mylabs.dev
        run: |
          export INPUT_CMD_PARAMS="--one-page-only --verbose --buffer-size=8192 --concurrency=10 --exclude=($( echo ${EXCLUDE} | tr ' ' '|' ))"
          wget -qO- https://raw.githubusercontent.com/ruzickap/action-my-broken-link-checker/v1/entrypoint.sh | bash
EOF

git add .
git commit -m "Add periodic-broken-link-checks"
git push

The code above will store the GitHub Action Workflow file into the repository and start checking the https://google.com every midnight (UTC).

This is the screencast where you can see it all in action:

This URL checker script is based on muffet  and you can set its parameters by changing the INPUT_CMD_PARAMS variable.

Feel free to look at more details here: https://github.com/ruzickap/action-my-broken-link-checker

I hope this may help you to keep the quality of the web pages by finding the external link errors quickly.

Enjoy :-)

This post is licensed under CC BY 4.0 by the author.