This is my first post here at Semetrical, and I’m very excited to guide you through the mysterious, yet easy world of Regular Expression and .htaccess.
1 – Non www redirect – Canonical issue
2 – Redirect index to root
3 – Remove cached SSL pages from search
4 – 301 redirects of all types
5 – Compressing files / caching
6 – Bad bots
7 – Custom error pages
8 – Simple dynamic URL rewriting
9 – Removing file extension
10 – Protecting images (and other files) from leaching
Firstly, let’s explore some basic regular expression’s (regex or regexp) so that you understand the basic expression, and then I will introduce you to the .htaccess file, and how it can be used to streamline technical SEO efforts on your site.
A Regular Expression (regex) is a special text string for describing a search pattern. Regex is used in all programming, whether you’re searching strings or validating input fields, the use of Regular Expressions will save you a great deal of time. Below you can see some of the more basic “Metacharaters” used in regex.
Regular Expression – Common Metacharacters
.htaccess – What is it and how can we use it?
An .htaccess (hypertext access) file is a configuration file that resides on Linux Apache servers. These allow for centralised management of web server configuration. The .htaccess file has many uses – from rewriting URLs and cache control, to blocking users and unidentified IP addresses. The power of this file when used correctly for your SEO campaign can have dramatic effects especially when trying to simplify a site issue.
Okay, hopefully that’s given you a little insight; I don’t want to go into huge amounts of detail on regex and .htaccess, as we could be here forever!
Let’s jump in
I have a passion for being able to take a problem that seems difficult, and simplifying it using just a couple lines of code. I’m going to outline the top ten issues we always encounter when auditing a site, and then I will provide you the code to amend them. Please remember these will only work if you have .htaccess enabled on an Apache platform server.
Firstly, you need to access the file for editing. I have come across FTP clients that do not allow you to see the file when logging into the root directory. CuteFTP is one such client, but I’m sure the below trick works on any client that does not allow it to be seen by default, you just need to have access to filtering.
Login into your FTP directory, once you can see a list of you files, right click in the middle and select ‘Filter’.
Select ‘Enable server side filtering’
Type the following filter command: -L-a
Once this has been done the .htaccess file will appear in the root directory ready for editing. As far as I’m aware this is the only client that does not allow you to see the file without making this change.
The non www redirect is a simple yet annoying problem that any SEO audit will come across, although it’s not a major issue and can be fixed using canonical tag. This is good practice from a technical standpoint as it eliminates URL duplication. Below is the following code to fix this:
Alternatively you can use:
When coding a new site, many developers forget about SEO, it’s no fault of their own, but it’s always a headache when performing an audit as very basic issues can get overlooked.
Linking back to the absolute domain when linking the “Home” button in the domain is often missed, and many developers link it back to www.yourdomain.com/index.php – .asp or .html. This issue only really appears in bigger e-commerce sites, and I still come across it now and again when performing an audit.
The above code works on my servers but sometimes it causes the site to loop. If you have this issue you need to use the code below:
Often in e-commerce websites, an SSL is installed to encrypt the checkout process. Often, if this is not installed correctly you can duplicate your entire domain on HTTPS. The problem is that HTTPS uses a different port number, and therefore becomes difficult to block.
The following code will solve this for you by creating a virtual robots.txt file when that specific port is accessed.
Create a text document and insert the following:
Save the file as robots-https.txt and upload to root.
Then add the following code to your htaccess file:
You can test to see if your code is working by visiting:
This should display the disallow: / to make sure you have the correct robots.txt displaying. For the main URL visit:
A common problem when building a new site is how to transfer an old page onto a new page. It becomes increasingly difficult when you’re working with a bigger site as more issues can arise if you make a mistake. Below, I have demonstrated how a number of variations can help you match multiple URLs, or just the one URL.
Speed is everything! A site that loads fast is not only good for user experience, but it’s also a Google quality score. A site that loads fast can dramatically increase time and lower your bounce rate. The following code will “compress” the file size of certain commonly used site files:
Caching images and other files is very important, especially if the given file being requested does not change on a regular basis. The following code will help with increasing site speed by making commonly requested files available as a cache.
Bad bots spider your site causing bandwidth issues, and will increase potential for your site to be scrapped and duplicated on proxy servers. I have seen this issue happen time and time again. There is an easy way to stop this. You block these bots and their user agents from accessing the site. There are many lists online that list all problematic bots. These lists can be added to the following code in order to block the bad bots.
When building any site, it’s always good to take in to consideration user experience. Most users when faced with an error tend to leave the site. To prevent this, you can create a custom page that can help a user get back into the site, or help them find what they were looking for – this is a great way to reduce your bounce rate and keep your users engaged. The code below will allow you to create dedicated pages for all those nasty server errors.
Remember to create a folder called “Error”, and place your files in there so the server knows where to find the files.
Creating relevance throughout your site is important to maximise your overall optimisation for a search engine.
Make sure your titles, descriptions and content have all been optimised in terms of keyword density. ‘Calls to Action’ are key to boost your search engine rankings, and retain user engagement on your site – with the overall objective being a conversion. This goes for your URL structure as well.
URLs that are created dynamically often have query strings and other unwanted parameters, which not only look ugly but do not have any relevance to the page; e.g. http://www.mydomain.co.uk/products.php?pn=9021
Imagine this page was for a pair of Adidas shoes. The URL does not contain a descriptive keyword and has no relevance to what the page is about. To improve and re-write the URL you could use the codes below:
The result would be the new URL:
Let’s say we have a range of cars where the database assigns the car in the URL as a variable. For example:
http://www.mydomain.co.uk/index.php?manufacturer=audi – We can rewrite this URL as the following:
We can use the following code to achieve this:
This will rewrite all of the pages that contain the variables, so for example, if another car page URL is the following:
The URL will be rewritten to: http://www.mydomain.co.uk/car-makes/bmw/
The above can be adapted to create more in-depth rewrites matching multiple patterns. You will need to have a play around to adapt this to work for all your pages.
This is not a problem. It’s more of a way to neaten up your URL structure, especially for smaller static html / php sites. It’s always good practice to make a URL shorter and more relevant. Removing a file extension is a way to do this.
Using the following code will create a 301 redirect addition:
Make sure to implement a base ‘href’ in your code so all your ‘CSS’ and images get picked up, otherwise your site will load in its most basic form.
Protecting your sites images from being used on other peoples website is a task that is not always easy to prevent, 9 times out of 10 people will just download your image and post it, but not always some users tend to hot link. This means they will load your image on their site using your URL and bandwidth. This can be prevented using the following code.
These are but a few techniques that can be implemented with .htaccess. I hope my first post was of interest to you, please keep an eye out for more. I’m glad to be back writing and sharing my knowledge with all of you fellow digital marketers out there.
Please get in touch if you have any questions or want to know anything I may have missed.