Post #158

Who, where, how?

16th January 2004, evening time | Comments (7)

A while back I promised I’d do a write-up on how I build my blogged people and blogged domain pages. So, here ’tis.

Overview

These are the basic steps I followed:

  1. Find a clear and infallible way to demarcate my data, so I can easily identify it amongst the rest of my code;
  2. Create a MySQL database table to store the results in;
  3. Run through each of my blog posts and strip out the information I want;
  4. Remove duplicates;
  5. Insert the information into the appropriate MySQL table;
  6. Build a page to show my new information;
  7. (Additional step: remove any proprietary demarcation markup I might have added in before sending blog posts to a user’s browser.)

So, bearing those steps in mind, let’s see exactly what I did…

Step 1: Finding a clear and infallible way to demarcate my data

Demarcating people’s names

As soon as I sat down to think about this it became clear that I was going to need some kind of additional markup, above and beyond XHTML and its <cite> tag, to consistently define ‘a person’ (and differentiate between individuals) within my blog posts.

Simply looking for ‘Mum’ or ‘Morag’ wouldn’t be enough to identify references to my Mother. I needed to have her full name in there somewhere, and I needed to be able to say “this string of letters refers to a person”. What I needed was a new ‘people’ tag to make the job easier.

I came up with this:

  1. <dro:person value=""></dro:person>
  2. Download this code: 158a.txt

The dro: (my initials) designates a namespace that I’m fairly certain won’t be used by anyone else, and the value attribute is where the person’s complete name goes. So, for example:

  1. <dro:person value="Morag Orchard">Mum</dro:person>, <dro:person value="Paul Orchard">Dad</dro:person> and I went to town today.
  2. Download this code: 158b.txt

Or:

  1. <dro:person value="Paul Trainer">PT</dro:person> and I looked out the window today and saw a car pull into the yard, <dro:person value="Danny Dan">the driver</dro:person> honked his horn once, then drove off again.
  2. Download this code: 158c.txt

Note: In hindsight I should have inserted a unique ID rather than a name into value (and cross referenced this with the person’s name in another MySQL table), because as things stand if someone marries and changes their name (or two people have the same name) it’ll mess things up. I think I’ll have to make this change at some point.

Demarcating domain names

This bit was easier, since all domain names appear as links I didn’t need to add in any additional markup. For example:

  1. I previously wrote about <a href="/blog/archive/2003/09/26/chickens/"title="">our chickens and <a href="/blog/archive/2003/12/11/eggs/"title="">their eggs, well now here's a site showing <a href="http://www.foo.com/chicken.shtml" title="">how eggs are made</a>.
  2. Download this code: 158d.txt

Note that I use relative URLs for links to content on my blog. This makes it easy differentiate between internal and external links later on.

Step 2: Create a MySQL database table to store my results in

I needed one table to hold the people data, and one to hold the domain data:

  1. CREATE TABLE people (
  2. people_name varchar(30) NOT NULL default '',
  3. post_id int(11) NOT NULL default '0',
  4. PRIMARY KEY (people_name,post_id)
  5. ) TYPE=MyISAM;
  6.  
  7.  
  8. CREATE TABLE domain (
  9. domain_name varchar(70) NOT NULL default '',
  10. post_id int(11) NOT NULL default '0',
  11. PRIMARY KEY (domain_name,post_id)
  12. ) TYPE=MyISAM;
  13. Download this code: 158e.txt

Note that each table has a primary key set on both its columns, that way duplicate entries can never be inserted, even by mistake.

Steps 3, 4, 5: Strip out the information I want, remove duplicates and insert that info into the appropriate MySQL table

Now that I knew the information I wanted was in the markup, I had to work out a method to retrieve it and add it into the appropriate database table.

To that end I simply looped through all the entries in my post table, doing the following to the post body:

Grabbing people’s names

To grab people’s names I used regular expressions:

  1. // set start and end tags
  2. $starttag = '<dro:person value="';
  3. $endtag = '">';
  4.  
  5. // escape characters that would mess up preg_match_all
  6. $starttagesc = preg_quote($starttag, '/');
  7. $endtagesc = preg_quote($endtag, '/');
  8.  
  9. // match all instances and assign value to $results array
  10. preg_match_all("/$starttagesc.*?$endtagesc/", $post_body, $results);
  11.  
  12. // example result
  13. // <dro:person value="Morag Orchard
  14.  
  15. // remove duplicates
  16. $results = array_unique($results[0]);
  17. Download this code: 158f.txt

$results now contained a list of unique names that appeared in a single blog post. I could then loop through each name and insert it into my people database.

  1. // loop through array
  2. foreach($results as $key => $value)
  3. {
  4. // remove the markup so we're just left with the name
  5. $sp = strpos($value, $starttag, 0) + strlen($starttag);
  6. $ep = strrpos($value, $endtag);
  7. $name = substr($value, $sp, $ep-$sp);
  8.  
  9. // set query
  10. $query = "INSERT INTO people SET people_name = '$name', post_id = '$post_id'";
  11.  
  12. // run query
  13. $result = mysql_query($query);
  14. }
  15. Download this code: 158g.txt

When that had finished I moved on to the next blog post and did the same thing there.

Grabbing domain names

There were seven stages to my domain names collection:

  1. Ignore all internal links;
  2. Collect all http:// links;
  3. Collect all ftp:// links;
  4. Combine the http and ftp results into one array;
  5. Use PHP’s parse_url() function to extract the domain name;
  6. Remove duplicate values;
  7. Insert everything into the database.

Here I’m going to demonstrate how I collected the http links:

  1. // set start and end tags
  2. $starttag = '<a href="http';
  3. $starttagsub = 'http';
  4. $endtag = '"';
  5.  
  6. // escape characters that would mess up preg_match_all
  7. $starttagesc = preg_quote($starttag, '/');
  8. $endtagesc = preg_quote($endtag, '/');
  9.  
  10. // match all instances and assign value to $results array
  11. preg_match_all("/$starttagesc.*?$endtagesc/", $post_body, $results);
  12.  
  13. // example result
  14. // <a href="http://www.php.net/download-php.php3?csel=br
  15.  
  16. // loop through array
  17. foreach($results as $key => $value)
  18. {
  19. // remove the href code so we're just left with the link
  20. $sp = strpos($value, $starttag, 0) + strlen($starttag) - strlen($starttagsub);
  21. $ep = strrpos($value, $endtag);
  22. $link = substr($value, $sp, $ep-$sp);
  23.  
  24. // use PHP's parse_url function
  25. // eg: http://www.php.net/download-php.php3?csel=br
  26. // $url['scheme'] = http
  27. // $url['host'] = www.php.net
  28. // $url['path'] = /download-php.php3
  29. // $url['query'] = csel=br
  30. $linkbits = parse_url($link);
  31. $host = $linkbits['host'];
  32.  
  33. // find 'www.'
  34. $pos = strpos($host, 'www.');
  35.  
  36. // if not found
  37. if ($pos === false)
  38. {
  39. $domains_http[] = $host;
  40. }
  41.  
  42. // if found
  43. else
  44. {
  45. $domains_http[] = substr($host, $pos + 4);
  46. }
  47. }
  48.  
  49. // if array
  50. if (is_array($domains_http))
  51. {
  52. // remove duplicates
  53. $domains_http = array_unique($domains_http);
  54. }
  55. Download this code: 158h.txt

$domains_http then contained a list of unique domain names that appeared in a single blog post.

I could then do exactly the same thing for the ftp domains, combine both the result arrays, remove the duplicates and then finally loop through the resulting array and insert each domain name into my domains database table.

When that was finished I could move on to the next blog post and do the same thing again.

Step 6: Build a page to show my new information

This bit was simple, so I’m not going to explain it here. I had my MySQL tables full of the appropriate information, I just pulled it out and displayed it in a suitable fashion (eg. blogged people and blogged domains).

Step 7: Remove any proprietary demarcation markup

At this point you may be saying to yourself, well hang on old boy, that’s all very nice, but this new markup you’ve introduced isn’t going to validate when you display your posts to users; it’s not part of the XHTML namespace. That’s very true, so when people request the regular pages of my blog I have to strip out my extra markup before it reaches their browser:

  1. $search = array("'<dro:[^>]*?">'si", "'</dro:[^>]*?>'");
  2. $replace = array('', '');
  3. $post_body = preg_replace($search, $replace, $post_body);
  4. Download this code: 158i.txt

And that’s it…

Conclusion

I’m sure someone will step in and point out I could have used 2 lines of PHP where I used 30, but it’s the best I could manage. However, please feel free to enlighten me, I’m always keen to learn.

(I expect that rather than access the database for each post I should simply add everything to an ongoing multi-dimensional array and do one big insert at the end. Still, it works quite happily as it is.)

At some point I’ll adapt this code so it can do two separate things:

  1. Run through every post in the database, or
  2. Run through a single post (which has just been written or updated).

The distinction is worth making as there’s no point in accessing every post if you’ve only altered one.

I hope that’s all clear enough. Any questions (or errors), ask below…

Jump up to the start of the post


Comments (7)

Jump down to the comment form ↓

Comment relationships: Clicking on a comment will highlight it and all those comments that relate to it. A red border shows the comment in focus, green borders signify parent comments, blue borders signify child comments. Read more…

  1. Stuart:

    If it was me, instead of <dro:person> I'd have done <cite href="http://www.1976design.com/people/Paul_Orchard">Dad</cite> and used the URLs as your key. Firstly, this makes your keys nice and readable; secondly, you only have to strip href attributes from cite tags on the way out, and thirdly, Just Say No To Databases :-)

    Posted 19 minutes after the fact
    Inspired: ↓ Stuart, ↓ Dunstan
  2. Stuart:

    Oh, and fourthly, you could actually *put* something at /people/Who_Ever, and have your outgoing tag stripper rewrite the <cite href="foo"> tag to <a href="foo"><cite> or something...

    Posted 19 minutes after the fact
    Inspired by: ↑ Stuart
    Inspired: ↓ Dunstan
  3. Dunstan:

    Good ideas Stuart. But I feel the <cite> tag would have been inappropriate here - I've gone around and around thinking this one out, and I've come to the conclusion that the <cite> tag has some very definite uses... and this isn't one of them.

    Plus, this method is expandable, I can markup anything I want - books, films, pets, songs. I know you could also use <cite> for those things, but I'd prefer to put a bit more effort in and do things 'correctly', so to speak. Also, having a framework like this in place is quite comforting to me :o)

    (I really wanted to go down the whole XML, XSLT route, but that's too much for me right now, so this is kind of a halfway step.)

    p.s. I _like_ databases! :op

    Posted 35 minutes after the fact
    Inspired by: ↑ Stuart
    Inspired: ↓ Stuart
  4. Dunstan:

    Good thinking that man:

    http://www.1976design.com/blog/archive/people/Paul+Orchard/

    I do have plans to link this into 'Characters' pages if it ever becomes appropriate.

    Also worth noting is that a person's name might already be part of a link, so to add in another link would arse everything up.

    But thanks for the ideas.

    Posted 38 minutes after the fact
    Inspired by: ↑ Stuart
    Inspired: ↓ Stuart
  5. [m]:

    Nicely done ! (about the same way I would've done it, but without the whole <dro:> stuff) But you are forgetting some things:

    1. Error checking. You have to have the posting process fully automated (read: controlled by the computer), even inserting the names into the post (selecting them via a drop-down box, for example), for it not to possibly throw up some errors eventually.

    2. Spaces. What if there is an additional space between the <&a and the href attribute? The whole regular expression fails.. This problem would fade away a soon as the posting process is fully automated, though.

    3. Usage of dro. Personally I would not use a <dro:person value="">, but something like <person>. I wouldn't keep track of blogged dogs, christmas events, spoons or other subjects, so it's fair to say that :person won't change. But that's my opinion, I don't know what lies in the future. ;)

    Speaking of <dro:person value"">, I feel that 'value' should be replaced by 'name'. It is semantically more correct (a name attribute to specify a person's name, who would've thought of that?!) and it keeps in spirit with it being (x)html.

    Oh, and did I say how wonderful this is coded and implemented? :)

    (PS why can't I post lists? This would've been a great comment to have it in. Also, entities are automaticly converted to their normal equivalent when a post is previewed. Look: & should actually read: &amp;)

    Posted 6 hours, 14 minutes after the fact
  6. [m]:

    Very much off-topic, but I don't care:

    I just switched browsers on my father's pc (my notebook's still not repaired :( ) and downloaded FireBird. I was so pleasantly surprised to see naked super dunstan technologies again! :D I was happy the whole day through.

    Posted 6 hours, 17 minutes after the fact
  7. Stuart:

    I do see your point about the extensibility of your method; you can mark up anything you want. Like you say, this is essentially inventing your own XML vocabulary, but without going the full way to doing everything in custom XML and then having to XSLT the lot into something sane and displayable.
    Me, I'm not a big fan of databases, unless you've got, like, *loads* of data (which blogs haven't). I think I might write up something about that particular subject myself, actually, rather than fill your comments section to bursting.

    Posted 7 hours, 50 minutes after the fact
    Inspired by: ↑ Dunstan, ↑ Dunstan

Jump up to the start of the post


Add your comment

I'm sorry, but comments can no longer be posted to this blog.