Because of this post’s past popularity, it has been reposted from my previous blog. I will be going through this post and updating some of the content soon, and I have been slowly working on a “phpBB Tips for Better SEO” as well, so be on the lookout for it as well.
The problem: How can I get Google (and other major search engines) to crawl my phpBB-based message board? Or, is Google capable of indexing phpBB-generated pages?
In my several years at Teh Community, I have seen these questions asked more times than anyone has cared to answer. And everytime it seems as though a different answer is given.
It is not my intention here to go into a hugely in-depth exposition about the dynamics of session management, URL-rewriting, SEO, or anything else. I’ve learned but a few key things regarding phpBB and Google, and those are what I want to share. There is a 70+ page thread on Teh Community for those wishing to browse through the mountains of disinformation looking for the nuggets of gold which may or may not still be relevant.
Does Google accept dynamically generated pages?
Yes! According to Google:
We’re able to index dynamically generated pages. However, because our web crawler could overwhelm and crash sites that serve dynamic content, we limit the number of dynamic pages we index. In addition, our crawlers may suspect that a URL with many dynamic parameters might be the same page as another URL with different parameters. For that reason, we recommend using fewer parameters if possible. Typically, URLs with 1-2 parameters are more easily crawlable than those with many parameters.
Additionally, and also from Google:
If you decide to use dynamic pages (i.e., the URL contains a “?” character), be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them few.
What this means is that typically, phpBB generates URLs which are perfectly acceptable to Google; they contain few parameters (typically either
t= for topics,
p= for individual posts, and
f= for forums) and for the most part generate content which is unique to each URL.
The biggest problem people face is the sometimes necessary
SID= (session ID) parameter. In theory, billions of versions of a single page are created using that method of session tracking: one page with a perfectly acceptable URL otherwise ends with countless different URLs pointing to it as unique SIDs are used to access it. Example:
In the eyes of Google, each of those URLs are unique resource identifiers; according to their Quality Guidelines:
Don’t create multiple pages, subdomains, or domains with substantially duplicate content.
Google and SIDs
When Googlebot hits your phpBB message board, it hits as a guest with cookies turned off. So turning off cookies in my browser and surfing on over to Teh Community, I find that SIDs are tagged onto the links.
Teh Community, however, has circumvented the appearance of SIDs on Googlebot’s links by having Google appeared as a logged in user. The phpBB software by default, however, does not have this feature.
So, obviously, a modification is required which will prevent SIDs from showing up when Google and the other search engines show up.
For this, I prefer CyberAlien’s excellent Guest Sessions Modification (currently version 0.04). It will give the correct URL to guests which have cookies turned off, specifically URLs which do not contain SID variables. This will ensure that everytime Google visits a certain page on your site, it will be able to do so without SID being appended and therefore will not assume that because the URL is slightly different it is viewing a different page.
Simple and effective: Install the MOD and forget it.
mod_rewrite and phpBB
It is an oft spread myth that Google penalizes URLs which contain query string variables; as we have seen above, Google has no problem with them. However, this myth has led to quite a few methods being devised of utilizing Apache’s mod_rewrite module to rewrite phpBB’s dynamic URLs to look more like static pages.
For example, this
viewtopic.php?t=2647&postdays=0&postorder=asc&start=540 may be rewritten into this
topic/2647/days/0/asc/start/540/ or any of many variants, so long as all the variables are accounted for.
However, this can be problem causing in that there can be a seemingly infinite amount of variable combinations created by the phpBB software. Some systems even supply entirely different URLs to users whether they are logged in or not. Some use standard users but contain a link to an “archive” which links to all the board’s content using seemingly static URLs.
All of this is entirely unnecessary and complicates things.
Ignore mod_rewrite; you don’t need it to get phpBB indexed.
As a matter of optimzation, however, search engines do look for keywords in the URL. A higher priority is given to the domain name than for directory or file names, but they are looked at. This is the reason why many blogs utilize permalinks which contain the title of the entry in URL-friendly form. It may be worthwhile to develop a permalink structure for a phpBB board which utilizes the topic title; however, there are far too many keyword-deficient topic titles out there to justify a change this big.
It may be a worthwhile venture to strengthen the keyboard density of the board itself. Force as much as possible (signatures, avatars, contact information, memberlists, profiles, usergroups, etc.) to show up for logged in users only. Cut out the clutter and give Google and the other engines exactly what they want: content-rich pages.
On a standard install of phpBB, you might consider dropping this robots.txt file into your root directory (if you access your board via
/phpBB/index.php then save this file at
Doing this will prevent Googlebot and other well-behaved bots from accessing non-content areas of your site and keep them focused on what they need: the index, forum pages, and topic pages. Further, you’ll save bandwidth because the bots will be crawling fewer pages.
Make use of the template switches to hide aspects of the templates that are unnecessary to the bots. In any of your templates, simply wrap content meant for logged in users only like this:
<!-- BEGIN switch_user_logged_in -->
...logged in users will see this stuff....
<!-- END switch_user_logged_in -->
There are many other things you can to help move your site up on Google; however, technical manipulation can only do so much. There are two tips which supercede everything else I have said here today:
- Create content! The more content-rich your site is, the more Googlebot will show up and the better it will be able to understand what your pages are about. The better it understands your pages, the better it will be able to place you in the search results.
- Get inbound links! As more people link to your board, the more important Googlebot will think your board is. Focus on getting quality links from similar sites, especially those already ranked higher than yours. The more links, the better off your board will be as you get more visitors from both people and bots.