Ideas for a Link Checker Plugin

I’m working on creating a Link Checker plugin for WordPress. The idea came to me from a post by Anne van Kesteren about the perfect weblog system.

  • All links should be stored in a separate database table. Referenced to from within the post.
    • This is needed so links can be easily checked. (If you find errors in the following, please contact me and I will update it accordingly.)
      • If the link returns a 200, leave it. (You might want to check if the title is updated though.)
      • If the link returns a 301 (permanent redirect), it can be updated. (The old URI could be stored in a separate database field temporarily so the that the end user can see what changes are made.)
      • If the link returns a 410, it can be removed immediately. (No need to check again, do inform the end user.)
      • If the link returns a 404, it should be checked each day one time 10 days long and after that by the end user to see if the file is really missing.
      • All other error codes should get the same treatment as 404.
      • All other status codes should get the same treatment as 200.
    • The title of the referenced URI should have a separate database field.
    • The language(s) of of the referenced URI should be stored.
    • The MIME type(s) of the referenced URI should be stored.
    • The status/error code should be stored and checked regularly.

Here’s my rough plan:

WP Link Checker

A proposed plugin to auto-check links wherever they occur.

Table lc_link_status

  • lc_type: 0=link, 1=href in a post, 2=postmeta
  • lc_id: id of the item where the link occurs
  • lc_updated: datetime that the link was last checked
  • lc_recheck: datetime that the link should be re-checked
  • lc_status: HTTP status code of the last time it was checked.
  • lc_score: score is affected by error statuses (4xx or 5xx). When it hits 100, status is considered permanent, and link is removed or updated.
  • lc_deleted: placed in to-be-deleted queue, awaiting confirmation by user.
  • lc_redirected: placed in the to-be-updated queue, awaiting confirmation by user.

On Save/Edit, Pull out all hrefs, and insert into lc_status table. Scheduled them for checking by setting status to -1 and lc_recheck to getdate()

On calls to wp_link_checker.php?check=N

  1. Get the top N links where lc_recheck < getdate() and lc_score < 100 order by lc_recheck ASC, lc_updated ASC
  2. For each link, CHECK STATUS.

CHECK STATUS

  1. Make a HEAD request to the target.
  2. Set lc_updated = getdate();
  3. Switch (HTTP STATUS)
    • 200: Record status as 200, set lc_score = 0, set lc_recheck = getdate() + 6 months.
    • 404: Add 10 to lc_score. If lc_score > 100, REMOVE LINK. else set lc_recheck = getdate() + 3 days.
    • 410: Add 30 to lc_score. If lc_score > 100, REMOVE LINK. else set lc_recheck = getdate() + 3 days.
    • 301: Get LOCATION header. Follow redirects until non-3xx response. UPDATE LINK, and re-process.
    • 302: Save status as 302, and add 4 to lc_score. If lc_score > 100, UPDATE LINK (treat as 301). else set lc_recheck = getdate() + 1 month.

      Although this violates the HTTP recommendation (client SHOULD continue to use the Request-URI for future requests), it works this way because people frequently use the 302 status code when they really mean 301. (For example, if you use the [R] flag in a RewriteRule directive, instead of [R=301] then a 302 status code is used, when it may in fact be a permanent redirect.)

    • Other 3xx: Treat as 302, but save status appropriately.
    • Other 4xx/5xx: Treat as 404, but save status appropriately.

Need to inform the user of what is happening. Perhaps an admin page where this action gets confirmed?

Links may only work locally at the user’s computer, or require logins (like a link to my work intranet, or my hotmail inbox, that is hidden from most users anyhow.)

  1. Place in Queue for user confirmation.
  2. Switch (User Selection)
    Delete
    • Remove the A tag from post contents but leave the innerHTML intact, remove postmeta field from db, make link invisible.
    • Remove record from lc_link_status
    Ignore
    • Set the status to 200, and lc_recheck to 2050-12-31.
  1. User Confirmation: not required in most cases. Perhaps an option to confirm changes before they are made?
  2. Change the HREF to the new value, everywhere that urls can occur.

ACTIONS

  • on wp_foot, check 5 links (if not necessary, it won’t happen anyhow.)
  • include a cronnable script that checks 50 links.

Leave a Reply

Comments are moderated like crazy using a variety of plugins. There is a very high likelihood that your comment won't show up right away, especially if you have never commented here before, but it was not deleted.

Please be patient, and do not post your comment more than once. It will show up once it is approved.

You must be logged in to post a comment.