The magazine of the Melbourne PC User Group
Build a "Quick and Dirty" Search Engine
with Python and PHP
Peter Dann
pdann@melbpc.org.au |
|
Occasionally I have found a need to locate items within a large collection
of HTML documents that I have been unable to search using a conventional search engine.
This article describes qdse, a "quick and dirty search engine" built to address this
need. The program provides full text searching of HTML documents that may be located locally on a user's
machine (for example, programming documentation), on a corporate intranet (for example, product information),
or on the Internet.
The version described here runs locally on the author's Windows machine, where it is used to search the
documentation for PHP, a server-side scripting language, stored on a locally installed Web server. The
program can easily be adapted to other uses.
While qdse is a useful tool in its own right, it also illustrates in one compact application, key
features of some powerful but freely available software and programming tools that could be of interest to
any Web developer, including:
- Python - the programming language qdse uses to create a
document index.
- mySQL - the database qdse uses to store the document
index.
- PHP - the server-side scripting language qdse uses to
interrogate the document index and present details of hits on a Web page.
T his article describes the overall design of qdse and explains key
sections of the underlying Python and PHP code. While this is not a full-blown tutorial for either language,
it should provide a useful insight into how these tools can be used to build a real Web application.
The Key Components
qdse consists of three components:
- A mySQL database which stores:
- the URLs and titles of all the documents to be searched
- a record for each unique word that appears in these documents
- records indicating which words appear in which documents
- An indexing program written in Python, which creates and populates the
tables in the document index database
- A Web page user interface where the user enters a list of terms for which
to search, and which displays the search result as a list of hyperlinked hits. This interface uses PHP to
interrogate the document index database and present the resulting hits to the user.
As mentioned above, mySQL, Python and PHP are all available at no cost. For
more details, see "Obtaining the core components" at the end of this article.
The Document Index Database
At the heart of qdse is the document index database that stores details of the documents to be searched.
In qdse, this database runs under mySQL, an open-source database program which is a popular choice for Web
development projects. mySQL understands standard SQL - a widely-used language for querying databases.
It is well-documented, is available in both Windows and Unix/Linux versions and integrates easily with PHP,
the language in which qdse's user interface is written.
mySQL does not implement some features associated with fully-featured databases, including support for
transactions and foreign key relationships, and is optimised instead for speed. These limitations are not a
problem in many Web applications, including qdse.
mySQL runs as a server daemon under both Windows NT and Unix/Linux. Users can communicate with the database
server using client programs such as MySqlManager that come with the application, as well as through directly
through code (for example, using Python and PHP).
Tables Structure
qdse's document index database consists of three tables.
Table "urls" lists the URL and title of each document the application has indexed, together with a unique ID
for each record. Once a site has been indexed and is ready to be searched, this table might contain entries
such as those depicted in figure 1 as follows:
| id |
url |
title |
| 1 |
http://localhost/phpmanual/appendicies.html |
Appendicies |
| 2 |
http://localhost/phpmanual/calling-user-functions.html |
Calling User Functions |
| 3 |
http://localhost/phpmanual/class.dir.html |
Untitled |
| 4 |
http://localhost/phpmanual/config-apache.html |
Apache Module |
Figure 1. urls table
Table "words" contains one entry
for each unique word found in the indexed documents. No word appears more than once, and each word has a
unique ID. Once a site has been indexed, this table might contain entries such as those in figure 2 as
follows:
| id |
word |
| 1 |
php |
| 2 |
manual |
| 3 |
appendicies |
| 4 |
table |
Figure 2. words
table
Table "words_x_url" shows which words appear at which URL. This table refers to words and URLs by
their unique IDs. Once populated, it might contain entries like the following:
| word_id |
url_id |
| 179 |
6 |
| 180 |
6 |
| 1 |
4 |
| 2 |
4 |
| 171 |
4 |
Figure 3. words_x_url table
The first entry shown above indicates that in this particular
document index database the word whose ID is 179 appears in the document whose ID is 6.
Querying the Database Manually
When developing a database application we often need to pass queries to the database manually, rather than
through a programming language. When running mySQL under Windows, an easy way to do this is use mySqlManager,
a database client program that comes with the Windows distribution.
For example, to find the word that has ID 179 we can pass an SQL query to the database. We type the query
into the Query tab of MySqlManager (Figure 4) then click the small triangular icon to execute the query. In
this particular database, the word whose ID is 179 is "virtual". Result shown in figure
5.
|

Figure 4. Entering an SQL query in MySqlManager.
The results are available in the Results tab.
|

Figure 5. Viewing the results of the query.
|
To get the IDs of all the documents this which this word
appears, we would execute an SQL query such as the following:
SELECT url_id FROM words_x_url
WHERE word_id = 179
The Indexing Program
So, how are the tables described above created and populated in the first place?
While we could enter the information manually using an application like MySqlManager, obviously this would
be time consuming and error-prone. What we need is a program that will do the work for us. This program
should:
- create the tables described
above
- identify all the HTML documents to
be indexed
- open each document in turn -
extract its title and a list of all the unique words in the
document
- insert a record in the urls table
for every document opened showing its URL and title
- insert a record in the words table
for each new word encountered
- insert a record in the words_x_url
table each time the program finds a word (in a document) not previously found in that
document.
Perl or
Python?
While there are many languages in which it would be possible to write such a program, two obvious candidates
for this task are Perl and Python.
Both of these so-called "glue languages" excel at processing large quantities of text. Both offer excellent
facilities for extracting information from one place (for example, from a file or Web resource) and placing
it somewhere else (for example, into a database table). Both languages are interpreted, rather than compiled,
which tends to speed development. Both languages are distributed under an open source licence and are
available for free.
Perl's strengths include its large user base and the extensive range of libraries available for performing
specialised tasks (for example, creating and manipulating PDF files). On the other hand, Perl's syntax is
notoriously obscure and this tends to make it difficult for anyone who is not programming in Perl fulltime to
maintain Perl code. Perl's support for object orientation is rather inelegant.
Python shares many of Perl's strengths, while offering some additional benefits, including clean support for
object-orientation. One unusual but surprisingly helpful feature of the language is that white space is
significant: the level of code indentation indicates how lines of code are organised into blocks. At first
glance this may seem like a trivial "advantage", but it is an advantage one quickly comes to appreciate when
actually working with the language. Add a variable naming convention that allows the use of plain text
variable names and the result is code that is pleasantly clean and easy to maintain.
Creating the Tables
The first task performed by qdse's indexing program 'qdse.py' is to create the database tables described
above.
Listing 1 shows the code that creates the words table:
Line 1 imports (ie. makes accessible to our program) a Python module called "MySQLdb". This module
provides us with objects and functions we need to access a mySQL database.
Line 2 declares a global variable that holds the name of our document index database.
Lines 4 to 13 define a function called "create_WORDS_table()". Assuming that we have already created
an empty database called qdse using MySqlManager, line 5 makes a connection to this database,
while
line 6 creates a cursor object we will use to pass queries to the database.
In lines 7 to 11 we construct the query that will create the words table, before executing the query
in line 12.
|

Listing 1. Python function to create a database
table.
|
It is worth noting that the table we have
created has an index for the "word" field (line 10). This dramatically improves the performance of the
indexing program when looking up or inserting records in the words table.
qdse uses code similar to that shown in Listing 1 to create the other tables the application requires.
Identifying the Documents To Be Indexed
In any search engine indexing program a key routine is the function which determines which pages are to be
indexed, and can therefore be searched by the user. Fully-fledged search engines use a range of different
approaches to this task. So-called "spiders", for example, crawl through page after page identifying links,
and then following these links to find yet more links, and so on.
The original version of qdse was designed to search 1100 documents on an intranet site, the titles of which
were listed, as hyperlinks, on a single page.
The version published here takes a similar (and equally simple) approach, identifying the pages to be
indexed, and hence made searchable, by parsing the virtual index page returned by the Web server when passed
a directory name.
On the author's machine the documentation for PHP is installed in a directory named "phpmanual", immediately
beneath the document root of a locally running version of the Apache Web server. Entering the URL
"http://localhost/phpmanual/" in a Web browser causes Apache to send a virtual index page listing all the
files in the directory "phpmanual". It is this virtual index page that "qdse.py" reads and parses to
determine which pages to index for searching. Listing 2 shows the
function responsible for this task.
|

Listing 2 Python function to extract list of
URLs from HTML page shows the function responsible for this task.
|
The square brackets at the end of
line 2 indicate that the variable "result_list" is initialised to hold an empty list. Lists are one of
Python's key intrinsic data types, and the language offers powerful list-manipulating facilities.
Line 3 constructs a path to our virtual index page, which we open in line 4. The code at
line 5 reads each line of the page into a list called "lines".
Lines 6 to 11 create a "regular expression object". Regular expressions are patterns used to find,
extract or alter content in a line of text that matches the pattern. Regular expression syntax is quite
complex, and I won't attempt to explain it in detail here. Suffice it to say that here we are looking for the
shortest section of text in any line which is preceded by a double quote mark (line 7) and ends with ".html"
(line 10).
If found, such text will be stored as a so-called "group" called "url" (line 8).
Line 12 sets up a loop allowing us to examine in turn each line of the virtual index page opened at
line 4.
Within the loop, line 13 tests whether there is any text present in the line which matches our regular
expression. If so, it stores information about the match in the variable "result". Line 15 extracts
the actual matched text and stores it in the variable "url". Line 16 converts the URL from a relative
pathname to a fully-qualified pathname.
The the function "get-urls_list()" is quite adequate for the purpose for which qdse was originally designed.
However, this is one routine it would be necessary to extend if adapting qdse to search a more complex
site.
Parsing Each Document
Once the indexing program has created a list of all the pages we wish to index, it needs to open each one in
turn and extracts:
Assuming the variable "urls_list"
contains the fully-qualified URLs of all the documents we wish to index for our search engine, the code in Listing 3, along with some helper function we will discuss shortly, does
the job.
|

Listing 3. Master routine for extracting title
and word list from HTML documents.
|
In lines 1 to 3 we open
each document in turn and read its entire contents into the variable "text". This variable contains the whole
of the document, including all HTML tags.
In line 4 we create an HTML parser object of class myParser(). This class, defined elsewhere in
"qdse.py", is derived from Python's HTMLParser class and extended to provide methods allowing us to read the
document's title and the text outside all HTML tages. (The implementation of these methods - not discussed
here - is surprisingly easy thanks to the elegant design of Python's native HTMLParser class, and requires
only a few lines of code.)
After feeding the text of the entire document to our parser object in line 5, we close the parser,
then use its "get_text()" method at line 7 to read all the text in the document outside the HTML tags.
Similarly, we call the parser's "get_title()" method at line 8 to read the document's title.
At line 11 we pass the text extracted from the document to an object of another user-defined helper
class called "text_splitter". This returns the parsed text as a list of unique words, after first performing
some filtering to exclude words in which we are not interested (including very long words, and common words
such as and and the).
Once we have all this information, we insert the relevant details into our database tables. The code to do
this is similar to the code used to create the database tables (although of course the actual SQL is
different).
The User Interface
Once the document index database is created and populated, we need some way of searching this information to
locate documents that contain particular words.
In qdse, users enter the terms they are searching for in a simple form, then click a Search button. The same
page displays a list of hits, as shown in Figure. 6.
|

Figure 6. The qdse user interface.
|
Behind the scenes, the work of constructing appropriate database queries and then interpreting and displaying
the results is done by PHP.
A Quick Introduction to PHP
PHP is a free, open-source server-side scripting language and environment. PHP is broadly similar in its
capacities to Cold Fusion and Active Server Pages.
To use PHP, an author embeds PHP-specific tags in an HTML page, as in Listing
4.
|

Listing 4. A simple Web page on the server,
including PHP tags.
|
When a PHP-enabled Web server encounters the tags "<?php" and "?>", it passes the code they enclose to
PHP for processing. Depending on the installation, PHP itself may be loaded as a Web server module (like a
.DLL), or it may exist as a stand-alone executable. PHP processes the code passed to it then typically
returns text to the Web server for inclusion in the page to be sent back to the user. PHP's syntax shares
many similarities with C.
Listing 5 shows the page from Listing 4 as it would be sent to a user's
browser. The user never sees the PHP code itself, only the results this code generates.
|

Listing 5. The same page, as received by the
browser.
|
Many different servers are capable of running PHP, including Apache, IIS4 and Netscape.
The Interface Code
qdse's user interface consists of two pages:
-
"search.php", a page of HTML code with embedded PHP tags. This page contains the form where the user enters
the terms to be searched for, and is the only page the user sees.
-
"search.php.inc", a page of PHP code which contains several helper functions called from
"search.php".
Listing 6 shows the code for the form in "search.php" where the user enters
the terms to be searched for.
|

Listing 6. The form where user enters search
terms.
|
In
line 1, the action of the form is set to the URL of the current page. This means that when
the user clicks the Search button, the current page will be redisplayed.
The interesting part of the form is line 5, which is enclosed in PHP tags. PHP makes the contents of
all form variables available as similarly-named program variables. In PHP, all variable names begin with "$".
After the user submits this form (causing the current page to be redisplayed), the variable $terms will hold
whatever value the user previously entered in the form field named "terms".
If the variable $terms does contain a value, the isset() function will return true, causing the
"echo" statement to send a stream of text to the browser populating the form's input field with whatever
values the user entered before clicking the Search button. The input field will therefore appear to retain
the text the user previously entered.
Listing 7, also part of "search.php", shows the master routine which
analyses the user's search terms, then calls a range of helper functions to interrogate the document index
database and display a resulting list of hits. (Some error-processing routines have been removed to make the
code easier to read.)
|

Listing 7. Master PHP routine for processing
search terms and displaying hits.
|
Lines 3 and 4 split the user's search terms entry into an array of separate words.
Lines 7 and 8 create a connection to mySQL and select the document index database.
Line 11 calls a helper function to create a temporary table where we will insert the IDs of all the
documents which match one or more of the user's search terms. (This function has a similar purpose to the
code in Listing 1, which created a database table from Python.)
At
line 12 we call another helper function to populate the temporary table. This function executes a
series of SQL queries, one for each search term the user has entered. For example, if the user has entered
the search terms "mysql password", this helper function issues the following
queries:
Next, the helper function "get_hits()" retrieves a result object containing the URL and title of each
document in which all of the user's search terms appears. To get this information, "get_hits()" executes a
query such as the following (adjusting the last line to match the number of search
terms):
Finally, at lines 17 and 18, we pass the result object to the helper functions "show_hit_count()" and
"show_hits()" to list the number of hits and print a hyperlinked list of document titles. Listing 8 shows the code for "show_hits()".
|

Listing 8. PHP function for displaying list of
hits retrieved from document index database.
|
PHP
offers several different ways of analysing the result of a database query. The method used here is to treat
each row in the result object as a row object, and then to use the C style object->property syntax
we see in lines 3 and 4 to access the data we are interested in.
Lines 5 to 7 ensure that if we have been unable to determine a title for any document we print its
URL instead. Line 8 actually writes the details of each hit to the page.
Obtaining the Core Components
mySQL, Python, PHP and Apache are all available in Windows and Linux versions.
To learn more and download your own versions, see the following addresses:
http://www.mysql.com/,
http://www.python.org/, http://www.php.net/, http://www.apache.org/
Source Code
The source code for qdse accompanies this article. The code consists of three files:
qdse.py, search.php, and search.php.inc which have been grouped into the one file
qdse.zip which is available for download from the link, and then the individual files can be extracted.
About the Author
Peter Dann, pdann@melbpc.org.au is a Melb PC member, Web developer and
technical writer.
Reprinted from the December 2000 issue of PC Update, the
magazine of Melbourne PC User Group, Australia
|