Avoid changing URLs after publishing a page

Avoid changing URLs after publishing a page

30.Nov.2021

Your content is great, and you don’t want to change it. But you know that using the same URL for two very different pages could confuse visitors and search engines.

 

What should you do? Should you use a 301 redirect or change your page’s URL? Here are some guidelines to help:

 

1. Never change your URL after the page has been published. This will cause serious damage to both search engines and visitors.

 

2. Put this line at the top of every new page: /** Set $foo->url = 'http://example.com/archives/0123456789'; **/$foo->url = 'http://example.com/archives/0123456789';

 

3. In your template code, check before including the page whether the URL has been changed: if ($foo->url == 'http://example.com/archives/0123456789') include('mypage.html'); else include('noonecanreadthisbarf.html');

 

4. If you change the URL for a page that has already been published, search engines will still be able to see it using its old URL. But if someone already linked to your page with the old URL, your new content will not appear in their links and search engines won’t find you under either version of the URL. Search engines typically reindex fairly quickly, so they may find the new url fairly soon.

 

5. Authors can create redirects that send visitors to your new content with a 301 Moved Permanently header, which will pass most link juice along without losing it. But you don’t have control over search engine links, so you don’t have control over what happens to them. If some referrers have already linked to your page with the old URL, search engines will continue to see it for a while even if they reindex quickly, but they may calculate that the pages are not very relevant to your new URL, or that you moved content without updating all of your links.

 

6. If you have a large site, the best thing to do is create temporary redirects that send visitors with old URLs to new URLs temporarily while you update your pages’ links one-by-one. Then you can remove the redirects when it’s convenient.

 

If you are an SEO expert, you will know that this topic is a bit touchy and has been argued to death for some time. This article does a good job of making the case that changing URLs after they have been published is not such a big deal (or at least, not such an irreversible one). I disagree with the premise in the opening paragraphs, but like the rest of the article.

 

So, if changing URLs is OK, why do we need to worry about creating redirects? And how does this bot help us here at all?

 

The reason you want to create a temporary redirect (or even a permanent one) is that search engines will continue indexing the old URL for some time, even if they can see your new page live. The default behavior for web crawlers is to check the incoming URL against the current URL and see how well a match it makes - a 301 redirect tells them that there's a "new" version of this resource, and they will crawl accordingly.

 

The problem is that, if you want to create a permanent redirect, search engines will stop indexing the old URL at some point because they no longer see it in their logs. This could mean that there'll be nothing for your users to click on any more - someone who picked up your link from Facebook or Google or anywhere else, clicked on it, and arrived at the old location will be stuck.

 

Instead of creating a redirection that points to nothing (permanent or temporary), you may want to create an HTML tag like the one above, which will catch any links with your old URL and send them on their way (perhaps after reporting whether they are still valid). This is not a very elegant solution, but you can't expect search engines to catch every single one of the millions of pages that have your old URL in them.

 

This newly added HTML tag will not be crawled by search engines, so only humans will know about it - who are unlikely to report it since they won't see the page properly either. However, it is persistent, so that if they saved a link before, or it's in their browser history, they will still be able to check out the page with the old URL.

 

The bot I wrote goes one step further - it'll actually follow links from your page with the new URL and check whether they are valid. This is less important than it used to be, with the rise in social media shares, but it's even more important for private pages that aren't shared anywhere.

 

Some notes on the code: This was written in Python 3. I used Beautiful Soup to scrape the HTML off of each page and manipulate it, due to its natural elegance - plus I've done it before.

A lot of the code is just to get things into one place because I did a few different things and wasn't going to post them all separately. The scraped page has two lists in it - one for links on the page found using Beautiful Soup, and another containing the original URLs that not found by Beautiful Soup (like the ones in the target URL - these are not used).

The scraper will crawl every link on your page. It's possible I missed one because it can be tricky to tell whether a tag has anything in it or not. If you find any, let me know and I'll fix them and send you an updated bot!

I used Beautiful Soup to look at every link on the page, skipping any that started with https (since this bot is not secure). I then extracted the URL part of the tag and checked whether there was anything in it at all using regex - otherwise it's just blank.

If there wasn't anything in the target URL part of the tag, the link was not considered valid - but it might be if you deleted that bit or changed it to something else.

If there were multiple links on the page with the same target URL, they are checked individually for validity. This means that if someone had copied your page and put it somewhere else, this bot would still find them because their links are all still valid. If you'd rather it didn't do this, I'll change the code to make sure that only one link is checked for validity.

If there was no target URL in the tag, the link will be treated as invalid. This is unlikely but can happen - for example, if someone copied your page but missed out some of the tags.

There was a complication when I tried to get Beautiful Soup to follow links from other pages on site, but I think it's sorted out now. The bot will visit every page linked from your page(s) and check whether that link exists or not, then go back to your page(s). If there were any errors, it will post the errors to a file and email me with them (since I'd like to make sure nothing is broken).

If you want this for yourself or your company, let me know and I'll be happy to work on it for you. If you're interested in improving it, we can talk about that too! It was written for my job, but I'll be glad to share it with you.

You can download the script here . If there are any problems with this post or the script, just let me know and I'll fix them - it's a work in progress! Let me know what you think.