Johannessen Design Bureau

2009-10-03

WordPress and duplicate content

WordPress has come a long way with regards to it’s default URL scheme. I needed only two minor tweaks to make me completely happy with it. The first one really isn’t WorPress’ fault either. I really dislike publishing the same content on http://host.example/... and http://www.host.example/..., but at the same time I want to be able to accept incoming links with both hostnames. The solution is pretty well known, though I choose to turn it on it’s head: I remove the www from the hostname using Apache rewrite rules:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^db\.org$ [NC]
RewriteRule ^(.*)$ http://db.org/$1 [R=301,L]

The [NC] flag makes the condition case insensitive. I don’t think I’ve ever seen a HTTP client send the Host: header in anything except lower case, but the relevant standards certainly allow for it.

The R=301 flag tells Apache to make this a permanent redirect. This allows clients to update the links at their end. The L flags tells Apache to stop the rewriting process here and don’t apply any more rewrite rules. This is probably needed if you have additional rewrite rules in your .htaccess file.

The second thing bugging me is the way WordPress does paging. You get a predefined number of posts on the front page, and links to /page/2/, /page/3/ and so on. This holds true for archive, category, tag-cloud and probably others pages as well. The problem with this is that the content on these pages are not stable (it shifts as you post new content) and it duplicates content. I’m still pondering a solution to this…

2003-02-17

PHP+GD Progress Bar Demo

This demonstrates how to create a dynamic progress bar image using the PHP GD functions.

http://db.org/demo/2003/02/17/progress-bar/

2003-02-14

PHP+GD Scale and Overlay Demo

Someone posted a challenge to create a PHP script that would scale and overlay an image, and I couldn’t resist.

http://db.org/demo/2003/02/14/scale-and-overlay/

2003-02-11

Accept-* Header Logging Results

Introduction

Following a discussion on the Norwegian news group no.it.tjenester.www.design about language tags in different browsers and their default values, I configured db.org to log the contents of Accept-*: headers. The following is an analysis of this log, based on traffic from Tuesday February 4th to Tuesday February 11th 2003.

Accept-Language

The list below shows the different language tags seen during the week in question. Everything looks mostly like expected with two notable exceptions. The first is the tag no-bm, which i assume signifies Norwegian Bokmål. According to the User-Agent: header this is sent by Opera 5.12 running under Windows 95. There is no no-nn tag in the list, but it would be reasonable to assume that a browser using no-bm for Norwegian Bokmål also uses no-nn for Norwegian Nynorsk.

The second issue is the use of the language tag pdf, comming from Mozilla 4.79 running under Windows 98. If someone has an explanation to this, or even a plausible theory, I would like to know about it.

The list was produced with the following command:

cat negotiation.log \
| sed -e 's/^\"\(.*\)\" \"\(.*\)\" \"\(.*\)\" \"\(.*\)\"/\4/' \
| sed -e 's/,/\
/g'| sed -e 's/ *\(.*\)/\1/' \
| sed -e 's/^\(.*\);.*/\1/' \
| sort -f | uniq -i
  • bg
  • da
  • de
  • de-at
  • en
  • en-au
  • en-bz
  • en-ca
  • en-gb
  • en-ie
  • en-jm
  • en-nz
  • en-ph
  • en-tt
  • en-us
  • en-za
  • en-zw
  • es
  • es-mx
  • es-pr
  • fr
  • he
  • ie-ee
  • it
  • ja
  • lt
  • nb
  • nb-no
  • nl
  • nn
  • nn-no
  • no
  • no-bm
  • no-bok
  • no-nyn
  • pdf
  • pl
  • pt-br
  • ru
  • sk
  • sr
  • sv
  • tr
  • zh-cn

Accept-Encoding

The selection of encodings are noticeably smaller and mostly as can be expected. I noticed that Internet Explorer, Opera and Konqueror all send both gzip and x-gzip. x-compress on other hand is used only by robots/spiders, in this case RPT-HTTPClient and NPBot. For those of you who, like me, has never heard of the identity encoding, RFC 2616 offers the following explanation:

The default (identity) encoding; the use of no transformation whatsoever. This content-coding is used only in the Accept-Encoding header, and SHOULD NOT be used in the Content-Encoding header.

The list was produced with the following command:

cat negotiation.log \
| sed -e 's/^\"\(.*\)\" \"\(.*\)\" \"\(.*\)\" \"\(.*\)\"/\3/' \
| sed -e 's/,/\
/g'| sed -e 's/ *\(.*\)/\1/' \
| sed -e 's/^\(.*\);.*/\1/' \
| sort -f | uniq -i
  • compress
  • deflate
  • gzip
  • identity
  • x-compress
  • x-gzip

Accept-Charset

Another narrow selection. The only interesting thing to note is that Opera, on both Windows and Linux, is the only browser asking for documents in the windows-1252 character set.

The list was produced with the following command:

cat negotiation.log \
| sed -e 's/^\"\(.*\)\" \"\(.*\)\" \"\(.*\)\" \"\(.*\)\"/\2/' \
| sed -e 's/,/\
/g'| sed -e 's/ *\(.*\)/\1/' \
| sed -e 's/^\(.*\);.*/\1/' \
| sort -f | uniq -i
  • ISO-8859-1
  • ISO-8859-15
  • utf-16
  • utf-8
  • windows-1252

Apache Configuration

If you would like to make a similar analysis on your own traffic, the log file can be generated using the following directives in Apache’s httpd.conf.

LogFormat "\"%{User-agent}i\"       \
\"%{Accept-charset}i\"   \
\"%{Accept-encoding}i\"  \
\"%{Accept-language}i\"\" \
negotiation
CustomLog /path/to/logdir/negotiation.log negotiation

Copyright © 1997-2010: Bård Johannessen - Entries: RSS - Comments: RSS - Powered by: WordPress