Statisticus, Why and How?
Part V - Data Normalization
by Hakan Eskici
5. Data Normalization
We need to find a way to minimize the storage for our data.
By analysing the raw data inserted into the database, we
can see that some field values are repeated many times. For
example, Path field values only includes the page names in
our site. But the records in the Stats table have many of
repeated path names.
So let's give unique ID's for each path.
Let index.asp has the PathID of 1, so when there is an access
to index.asp to be inserted into our Stats table, we will insert
the ID 1 instead of page name.
Now we have a PathID field instead of Path field. PathID field
will have unique numbers each corresponding to a different
page name.
Here is a question now. How are we going to know which
path name corresponds to which PathID? We will have another
table called Paths, having the fields PathID and PathName.
For our example;
Table Paths:
PathID PathName
------ --------
1 index.asp
2 page1.html
3 login.asp
. ...
By linking the Paths and Stats tables togetger, we will reduce
the number of bytes to store the same data.
Table Stats:
StatID PathID ... (other fields)
------ ------
1 1
2 3
3 1
4 2
5 1
So we see that for stat record #1, the requested path ID is #1
which corresponds to index.asp.
Now here is another question. How are we going to manage the table
Paths if we have lots of pages? We won't! We will add some more code
for automatic management of Paths table.
The idea is simple, whenever there is a request for a page, we will
first look for its name in the Paths table. If the page has not
been requested before, we will add a record for it and get the
newly insterted record's PathID for a reference. If the page has
been requested before, we already have a PathID for it.
We will write a function called GetIDPath for this purpose;
Function GetIDPath(sName)
'Construct SQL Query
sSQL = "SELECT PathID, PathName FROM Paths"
sSQL = sSQL & "WHERE PathName = '" & sName & "'"
'Open the recordset
rs.Open sSQL,,,adCmdTable
if rs.RecordCount = 0 then
'If there is no record, add one
rs.AddNew
rs("PathName") = sName
end if
'Whether it's found or added, return ID
GetIdPath = rs("PathID")
rs.close
end function
We will use similar functions for Browser, Referer, Color, Screen
resolution and Operating system. These functions are almost similar
to each other except that the tables they access differs.
But let's first make another improvement to our database structure
by adding one more field to each of the Browsers, Referers, Colors,
Resolutions and Operating Systems tables.
The field is called "Total" which will store the number of times
accessed for a specific record. Therefore, we will have a easier
way to learn the access statistics.
Although this could already be done by executing a SQL query which
has COUNT function; using Total field method presents less overhead
for execution time.
Now, our GetIdPath function will look like:
Function GetIDPath(sName)
'Construct SQL Query
sSQL = "SELECT PathID, PathName FROM Paths"
sSQL = sSQL & "WHERE PathName = '" & sName & "'"
'Open the recordset
rs.Open sSQL,,,adCmdTable
if rs.RecordCount = 0 then
'If there is no record yet, add one.
rs.AddNew
rs("PathName") = sName
end if
'Increment the total access
rs("Total") = rs("Total") + 1
rs.Update
'Whether it's found or added, return ID
GetIdPath = rs("PathID")
rs.close
end function
We are almost ready for our Statistics application. We
collected a bunch of data from the visitors, now it's
time to analyse it to extract useful information.
In the next part of the article, we will discuss the
methods to convert data into information.
Part I - Introduction
Part II - Measurement Tools
Part III - Simple Statistics
Part IV - Adding More Fields
Part V - Data Normalization
|