16th January 2004, evening time | Comments (7)
A while back I promised I’d do a write-up on how I build my blogged people and blogged domain pages. So, here ’tis.
These are the basic steps I followed:
So, bearing those steps in mind, let’s see exactly what I did…
As soon as I sat down to think about this it became clear that I was going to need some kind of additional markup, above and beyond XHTML and its <cite>
tag, to consistently define ‘a person’ (and differentiate between individuals) within my blog posts.
Simply looking for ‘Mum’ or ‘Morag’ wouldn’t be enough to identify references to my Mother. I needed to have her full name in there somewhere, and I needed to be able to say “this string of letters refers to a person”. What I needed was a new ‘people’ tag to make the job easier.
I came up with this:
<dro:person value=""></dro:person>
The dro:
(my initials) designates a namespace that I’m fairly certain won’t be used by anyone else, and the value
attribute is where the person’s complete name goes. So, for example:
<dro:person value="Morag Orchard">Mum</dro:person>, <dro:person value="Paul Orchard">Dad</dro:person> and I went to town today.
Or:
<dro:person value="Paul Trainer">PT</dro:person> and I looked out the window today and saw a car pull into the yard, <dro:person value="Danny Dan">the driver</dro:person> honked his horn once, then drove off again.
Note: In hindsight I should have inserted a unique ID rather than a name into value
(and cross referenced this with the person’s name in another MySQL table), because as things stand if someone marries and changes their name (or two people have the same name) it’ll mess things up. I think I’ll have to make this change at some point.
This bit was easier, since all domain names appear as links I didn’t need to add in any additional markup. For example:
I previously wrote about <a href="/blog/archive/2003/09/26/chickens/"title="">our chickens and <a href="/blog/archive/2003/12/11/eggs/"title="">their eggs, well now here's a site showing <a href="http://www.foo.com/chicken.shtml" title="">how eggs are made</a>.
Note that I use relative URLs for links to content on my blog. This makes it easy differentiate between internal and external links later on.
I needed one table to hold the people data, and one to hold the domain data:
CREATE TABLE people (
people_name varchar(30) NOT NULL default '',
post_id int(11) NOT NULL default '0',
PRIMARY KEY (people_name,post_id)
) TYPE=MyISAM;
CREATE TABLE domain (
domain_name varchar(70) NOT NULL default '',
post_id int(11) NOT NULL default '0',
PRIMARY KEY (domain_name,post_id)
) TYPE=MyISAM;
Note that each table has a primary key set on both its columns, that way duplicate entries can never be inserted, even by mistake.
Now that I knew the information I wanted was in the markup, I had to work out a method to retrieve it and add it into the appropriate database table.
To that end I simply looped through all the entries in my post table, doing the following to the post body:
To grab people’s names I used regular expressions:
// set start and end tags
$starttag = '<dro:person value="';
$endtag = '">';
// escape characters that would mess up preg_match_all
$starttagesc = preg_quote($starttag, '/');
$endtagesc = preg_quote($endtag, '/');
// match all instances and assign value to $results array
preg_match_all("/$starttagesc.*?$endtagesc/", $post_body, $results);
// example result
// <dro:person value="Morag Orchard
// remove duplicates
$results = array_unique($results[0]);
$results
now contained a list of unique names that appeared in a single blog post. I could then loop through each name and insert it into my people database.
// loop through array
foreach($results as $key => $value)
{
// remove the markup so we're just left with the name
$sp = strpos($value, $starttag, 0) + strlen($starttag);
$ep = strrpos($value, $endtag);
$name = substr($value, $sp, $ep-$sp);
// set query
$query = "INSERT INTO people SET people_name = '$name', post_id = '$post_id'";
// run query
$result = mysql_query($query);
}
When that had finished I moved on to the next blog post and did the same thing there.
There were seven stages to my domain names collection:
Here I’m going to demonstrate how I collected the http links:
// set start and end tags
$starttag = '<a href="http';
$starttagsub = 'http';
$endtag = '"';
// escape characters that would mess up preg_match_all
$starttagesc = preg_quote($starttag, '/');
$endtagesc = preg_quote($endtag, '/');
// match all instances and assign value to $results array
preg_match_all("/$starttagesc.*?$endtagesc/", $post_body, $results);
// example result
// <a href="http://www.php.net/download-php.php3?csel=br
// loop through array
foreach($results as $key => $value)
{
// remove the href code so we're just left with the link
$sp = strpos($value, $starttag, 0) + strlen($starttag) - strlen($starttagsub);
$ep = strrpos($value, $endtag);
$link = substr($value, $sp, $ep-$sp);
// use PHP's parse_url function
// eg: http://www.php.net/download-php.php3?csel=br
// $url['scheme'] = http
// $url['host'] = www.php.net
// $url['path'] = /download-php.php3
// $url['query'] = csel=br
$linkbits = parse_url($link);
$host = $linkbits['host'];
// find 'www.'
$pos = strpos($host, 'www.');
// if not found
if ($pos === false)
{
$domains_http[] = $host;
}
// if found
else
{
$domains_http[] = substr($host, $pos + 4);
}
}
// if array
if (is_array($domains_http))
{
// remove duplicates
$domains_http = array_unique($domains_http);
}
$domains_http
then contained a list of unique domain names that appeared in a single blog post.
I could then do exactly the same thing for the ftp domains, combine both the result arrays, remove the duplicates and then finally loop through the resulting array and insert each domain name into my domains database table.
When that was finished I could move on to the next blog post and do the same thing again.
This bit was simple, so I’m not going to explain it here. I had my MySQL tables full of the appropriate information, I just pulled it out and displayed it in a suitable fashion (eg. blogged people and blogged domains).
At this point you may be saying to yourself, well hang on old boy, that’s all very nice, but this new markup you’ve introduced isn’t going to validate when you display your posts to users; it’s not part of the XHTML namespace. That’s very true, so when people request the regular pages of my blog I have to strip out my extra markup before it reaches their browser:
$search = array("'<dro:[^>]*?">'si", "'</dro:[^>]*?>'");
$replace = array('', '');
$post_body = preg_replace($search, $replace, $post_body);
And that’s it…
I’m sure someone will step in and point out I could have used 2 lines of PHP where I used 30, but it’s the best I could manage. However, please feel free to enlighten me, I’m always keen to learn.
(I expect that rather than access the database for each post I should simply add everything to an ongoing multi-dimensional array and do one big insert at the end. Still, it works quite happily as it is.)
At some point I’ll adapt this code so it can do two separate things:
The distinction is worth making as there’s no point in accessing every post if you’ve only altered one.
I hope that’s all clear enough. Any questions (or errors), ask below…
Jump up to the start of the post ↑
The image displayed above is a 1600 pixel-wide panoramic view from the top of my parents’ house in Dorset, England. The system uses an XML weather feed from their local airbase to provide an up-to-the-minute graphical representation of the current weather, moon, and daylight conditions at their house. It's pretty cool, really.
You can read more about the panorama, and how it was created, in the Colophon; or you can select an option from below and view detailed weather information for Dorset, or for San Francisco (where I currently live).