Getting phpBB to Play Nice with Google

Because of this post’s past popularity, it has been reposted from my previous blog. I will be going through this post and updating some of the content soon, and I have been slowly working on a “phpBB Tips for Better SEO” as well, so be on the lookout for it as well.

The problem: How can I get Google (and other major search engines) to crawl my phpBB-based message board? Or, is Google capable of indexing phpBB-generated pages?

In my several years at Teh Community, I have seen these questions asked more times than anyone has cared to answer. And everytime it seems as though a different answer is given.

It is not my intention here to go into a hugely in-depth exposition about the dynamics of session management, URL-rewriting, SEO, or anything else. I’ve learned but a few key things regarding phpBB and Google, and those are what I want to share. There is a 70+ page thread on Teh Community for those wishing to browse through the mountains of disinformation looking for the nuggets of gold which may or may not still be relevant.

Does Google accept dynamically generated pages?
Yes! According to Google:

We’re able to index dynamically generated pages. However, because our web crawler could overwhelm and crash sites that serve dynamic content, we limit the number of dynamic pages we index. In addition, our crawlers may suspect that a URL with many dynamic parameters might be the same page as another URL with different parameters. For that reason, we recommend using fewer parameters if possible. Typically, URLs with 1-2 parameters are more easily crawlable than those with many parameters.

Additionally, and also from Google:

If you decide to use dynamic pages (i.e., the URL contains a “?” character), be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them few.

What this means is that typically, phpBB generates URLs which are perfectly acceptable to Google; they contain few parameters (typically either t= for topics, p= for individual posts, and f= for forums) and for the most part generate content which is unique to each URL.

The biggest problem people face is the sometimes necessary SID= (session ID) parameter. In theory, billions of versions of a single page are created using that method of session tracking: one page with a perfectly acceptable URL otherwise ends with countless different URLs pointing to it as unique SIDs are used to access it. Example:

  • www.example.com/index.php
  • www.example.com/index.php?SID=1
  • www.example.com/index.php?SID=2
  • www.example.com/index.php?SID=20058878

In the eyes of Google, each of those URLs are unique resource identifiers; according to their Quality Guidelines:

Don’t create multiple pages, subdomains, or domains with substantially duplicate content.

Google and SIDs
When Googlebot hits your phpBB message board, it hits as a guest with cookies turned off. So turning off cookies in my browser and surfing on over to Teh Community, I find that SIDs are tagged onto the links.

Teh Community, however, has circumvented the appearance of SIDs on Googlebot’s links by having Google appeared as a logged in user. The phpBB software by default, however, does not have this feature.

So, obviously, a modification is required which will prevent SIDs from showing up when Google and the other search engines show up.

For this, I prefer CyberAlien’s excellent Guest Sessions Modification (currently version 0.04). It will give the correct URL to guests which have cookies turned off, specifically URLs which do not contain SID variables. This will ensure that everytime Google visits a certain page on your site, it will be able to do so without SID being appended and therefore will not assume that because the URL is slightly different it is viewing a different page.

Simple and effective: Install the MOD and forget it.

mod_rewrite and phpBB
It is an oft spread myth that Google penalizes URLs which contain query string variables; as we have seen above, Google has no problem with them. However, this myth has led to quite a few methods being devised of utilizing Apache’s mod_rewrite module to rewrite phpBB’s dynamic URLs to look more like static pages.

For example, this viewtopic.php?t=2647&postdays=0&postorder=asc&start=540 may be rewritten into this topic/2647/days/0/asc/start/540/ or any of many variants, so long as all the variables are accounted for.

However, this can be problem causing in that there can be a seemingly infinite amount of variable combinations created by the phpBB software. Some systems even supply entirely different URLs to users whether they are logged in or not. Some use standard users but contain a link to an “archive” which links to all the board’s content using seemingly static URLs.

All of this is entirely unnecessary and complicates things.

Ignore mod_rewrite; you don’t need it to get phpBB indexed.

As a matter of optimzation, however, search engines do look for keywords in the URL. A higher priority is given to the domain name than for directory or file names, but they are looked at. This is the reason why many blogs utilize permalinks which contain the title of the entry in URL-friendly form. It may be worthwhile to develop a permalink structure for a phpBB board which utilizes the topic title; however, there are far too many keyword-deficient topic titles out there to justify a change this big.

Other Tips
It may be a worthwhile venture to strengthen the keyboard density of the board itself. Force as much as possible (signatures, avatars, contact information, memberlists, profiles, usergroups, etc.) to show up for logged in users only. Cut out the clutter and give Google and the other engines exactly what they want: content-rich pages.

Robots.txt
On a standard install of phpBB, you might consider dropping this robots.txt file into your root directory (if you access your board via /phpBB/index.php then save this file at /robots.txt).

User-Agent: *
Disallow: /admin/
Disallow: /cache/
Disallow: /db/
Disallow: /images/
Disallow: /includes/
Disallow: /languages/
Disallow: /templates/
Disallow: /common.php
Disallow: /config.php
Disallow: /extension.inc
Disallow: /faq.php
Disallow: /groupcp.php
Disallow: /login.php
Disallow: /memberlist.php
Disallow: /modcp.php
Disallow: /posting.php
Disallow: /privmsg.php
Disallow: /profile.php
Disallow: /search.php
Disallow: /viewonline.php

Doing this will prevent Googlebot and other well-behaved bots from accessing non-content areas of your site and keep them focused on what they need: the index, forum pages, and topic pages. Further, you’ll save bandwidth because the bots will be crawling fewer pages.

Switches
Make use of the template switches to hide aspects of the templates that are unnecessary to the bots. In any of your templates, simply wrap content meant for logged in users only like this:

<!-- BEGIN switch_user_logged_in -->
...logged in users will see this stuff....
<!-- END switch_user_logged_in -->

-=-=-=-

There are many other things you can to help move your site up on Google; however, technical manipulation can only do so much. There are two tips which supercede everything else I have said here today:

  • Create content! The more content-rich your site is, the more Googlebot will show up and the better it will be able to understand what your pages are about. The better it understands your pages, the better it will be able to place you in the search results.
  • Get inbound links! As more people link to your board, the more important Googlebot will think your board is. Focus on getting quality links from similar sites, especially those already ranked higher than yours. The more links, the better off your board will be as you get more visitors from both people and bots.

12 thoughts on “Getting phpBB to Play Nice with Google”

  1. I fear some of my information may be out of date, and I have no idea how the impending phpBB3 will fare… I look forward to finding out, though.

    Thanks for your comment!

  2. I don’t think anything has changed greatly – especially your comment regarding the actual content. At the end of the day, google wants to offer what the searcher is looking for and real content does that.
    phpBB3 will take a while to make inroads since most board operators will resist changing their modded set ups, especially if they’re working fine unless it offers pretty hot benefits.

  3. I’ve modified my board quite a bit — including re-modding it from scratch probably a dozen or so times — but I’m ready to drop it all in a heartbeat to upgrade, just for the benefit of XHTML 1 Strict output. Everything else it will have to offer is just icing on the proverbial cake.

  4. It might just be because I have a tendency to be a tad bit obsessive compulsive. Admittedly, not as badly as many — I haven’t hacked away at my current board to make it comply, for example — but having valid code which doesn’t misuse or overuse tables and is styled purely from CSS will make modifying it a ton easier. (I.e., without touching the template files themselves, one could make thousands of styles for phpBB just by editing the CSS and optionally creating new imagesets.) In phpBB2 such flexibility wasn’t possible as several stylistic elements were either expressed in the templates or were output by the core code itself.

  5. I see where you are coming from. Personally, I’m just hoping your anti-spam mod does the business now. I manually approve and get about 10 a day of these automated registrations for dubious sites. Perhaps SEO optimising phpBB wasn’t a brilliant idea as the spammers will find me more easily (joke)
    Incidentally, I noted what you said about google and mod rewrite. What is your feeling about forcing http://www.domain via mod rewrite? Is it worth the bother or just adding a bit of server load.

  6. John: Let me quickly say thanks for your comments; I appreciate them!

    So far, Raven’s Antispam has been 100% successful against spammers on the two boards I have it installed on. Note that some users on phpBB.com have complained of a “Hacking Attempt” error, which I have not been able to pin down yet. It works perfectly fine for others. I’m hoping the version I have available for download now ends the “Hacking Attempt” woes, but I haven’t heard back from anyone yet on phpBB.com.

    That being said, I am 100% “for” normalizing your web address to either “domain.com” or “www.domain.com.” Via Google’s Webmaster Tools, you can set your preferred domain to either one, which I recommend.

    Further, I have a tiny li’l MOD available on my Extras page designed specifically to remove the “www.” from your site. I personally dislike using “www” as it is deprecated, it is also wholly unnecessary.

    Standardizing to either “domain.com” or “www.domain.com” will keep all search engine juice (i.e., incoming links, PageRank, etc.) consolidated under one address rather than split between the two. Such consolidation is quite important! :)

    (I also recommend in enforcing the use of “index.php” when linking to your phpBB homepage, as that is the form of the link phpBB uses internally. having “/” vs. “/index.php” creates the same kind of split that “www.domain.com” vs “domain.com” does, albeit on a smaller scale.)

  7. Thanks for taking the trouble to reply – that’s really useful information. I’m hoping google will index more pages on our forums as they provide a great resource for vegetable growers as well as a friendly community. Sorry, not ‘plugging’ so much as explaining.
    To try and keep on topic, I’ll move to extras. Your spam prevention mod has started brilliantly. I installed it last night and this morning there were no spam registrations to delete. Seems far too good to be a coincidence. I’d normally have expected between 3 and 5 to do. I’ll send you another ‘report’ in a week or so on how it’s going.
    Deleting 10 fake user registrations a day may not seem like a big deal but it becomes a real chore, especially if you go away for a couple of days.
    Thanks
    John

  8. Update on the anti-spam modification for phpBB. It’s brilliant, working perfectly and I have now reverted to user account activation being confident the board won’t be over run by these spammers.

    Running a family friendly board, the last thing we want is to let them on.

    Fantastic and thanks again.

    John

  9. Glad to hear of the success, John! If you know other phpBB webmasters, spread the word, would ya? I gain nothing by people using the MOD, but others gain a spambot-free board which is simply what everyone wants. :)

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

the Rick Beckman archive
Scroll to Top