XML Databases Spreadsheets
Alternative title: How I stopped worrying about Databases
I have always been ambivalent towards databases. In my head I know they are great. You can store and retrieve data reliably, SQL is a standardised way to handle information and you can have all sorts of permissions to control who can access what. They can also be very reliable, reasonably fast, and yet... and yet there is just
something about them that slightly depresses me.
As an organisation dhp11 have written hundreds of applications that use databases. We've used [http://www.dbase.com dBase] (yes, it's still going!), Oracle, Informix, Ingres, Microsoft Access and more recently [http://www.postgresql.org/ Postgres], so by now we should be really comfortable with the whole idea.
But we're not! And I want to know why.
Maybe it has something to do with the type of work we do, which usually involves:
- Some (often a lot) prototyping
- Graphics
Databases are good when you know in advance what is going to go in them. I think they work less well in a prototyping environment where fields and tables are constantly being changed. Especially if you are trying to keep the database tables well organised [http://www.bkent.net/Doc/simple5.htm].
But graphics - why should that cause a problem?
The CAD package we use is called
MicroStation [http://www.bentley.com]. It is a good CAD package and very programmable. Unfortunately it has a very weak linkage to databases as illustrated in the diagram below.
To explain why I call the linkage 'weak' I'll have to digress slightly and describe how
MicroStation database linkages work.
MicroStation stores graphical elements in a
design file. Typically this has a '.dgn' extension, for example 'sample.dgn'. This binary file lists all the graphical elements. Normal element information covers stuff like the element type (line, point, ellipse etc.), symbology (colour, lineweight etc.) and of course the (x,y) coordinates - (x,y,z) if the element is 3D.
If an element is linked to a row in the database then essentially a row ID (called mslink) is added to the element. There is a bit more to it than this, but that is the principle.
There are some important consequences to using this method. For instance:
What do you do if one of the graphical elements is copied?
The example below shows the database linkage copied along with the rest of the element. Now you have two elements pointing to the same database row. Sometimes you might want to do this. But sometimes you might not. And sometimes you might forget and make a mistake.
Also, what happens if you delete a row from the database?
Now you have a graphical element with a database linkage that doesn't point to anything! And don't even ask about the problems of simoultaneous access and how you manage 'undo' in the design file and 'rollback' in the database.
None of these problems are insurmountable and there are various things you can do to configure
MicroStation to make it harder to mess things up. However you can never be totally confident that the design file and the database have not got 'out of sync' in some way. Most systems that use
MicroStation and databases devote a considerable amount of time and energy to checking the validity of the database linkage and working out what to do if the two do get out of sync.
I have to make it clear that it is perfectly possible to create a robust and useful database aware application with
MicroStation, but with the following caveats.
- The system must be as self contained as possible. Ideally every procedure that affects the graphics must also be aware of any implications on the database. And vice versa. The more procedures there are that are not totally aware of the 'other half' of the system, the more likely it is that things will go wrong.
- Unless you are working in a heavily managed environment (like the Intergraph FRAMME product) you must be very careful about allowing simultaneous access to either the design file(s) or the database.
What these restrictions mean is that you either end up with:
- A vast and hugely-complex managed system (like FRAMME) where user-led changes are possible, but difficult and time consuming
- A simpler turn-key system where user-led changes are impossible.
In both cases there is a tendency for the system to become isolated from the rest of the organisation because it is so difficult to get data in and out. This is not a problem if you have decided in advance exactly what the system is going to do, but it is a problem if you want a system that is capable of gradual evolution rather than revolution.
A lot of these points are not restricted to database-driven
MicroStation based systems. Many of these problems are evident in other systems to a greater or lesser extent. I just happen to think that the weak
MicroStation database link exacerbates some of the problems inherent in designing and using complex multi-user graphical systems.
For about ten years I struggled with my head telling me that databases were good and should be used, while my heart told me that something wasn't quite right. There was also other evidence.
Spreadsheet v Database
Many times when we have been speccing a new project we've been handed a spreadsheet similar to the one below.
My heart used to sink at this point. I knew I was about see all kinds of hideous crimes against data. I'll point out just a few of the main problems:
- We're mixing two types of data: pipes and valves. They should be in separate tables.
- Some of the fields refer to pipes, some to valves and some to both. We can guess which are which, but it's not always entirely clear.
- The values in the (valve only?) rotation column are not standardised. We have 'ACW' and 'clockwise. I would have expected 'ACW' and 'CW' or 'anti-clockwise' and 'clockwise' - not both. And what is there an 'n/a' in the rotation column for pipe 315? Should all pipes have n/a?
- The Diameter column (pipes only?) has a mixture of units. Some imperial and some metric.
- The date column has a mixture of formats. Is it OK for a valve to have a survey date accurate to a month?
- The surveyor name looks suspiciously like it refers to the same person.
- The length column has 'Unknown', 'n/a' and blank values. Is this correct?
- Just what is the 'Cap6' column and why does it contain comma separated values? Why are they not in a separate column?
OK. I'm exaggerating. But not much. Most spreadsheets are stuffed with problems like these, and yet they still get used - why? Surely it would be better to put this data into a database. Like the one below.
Database Pro's and Con's
Pro's
- The data is well defined so in theory it should be easier to work with. But see 'Default Values' below.
- All the columns have default values. So it is now impossible to 'fiddle' some awkward data in there. Is this good or bad?
Con's
- One spreadsheet has become five tables.
- I was going to say that changes to the database are more controlled, and make this a 'Pro' for the database. But when I thought about it I realised that database changes are not inherently more controlled at all. Changes are made in a controlled way, but only because they are more difficult to do and can easily cause problems if they are done badly.
- Following on from the previous point, the spreadsheet could be amended 'on the fly'. To change the database you will probably have to edit a number of tables, or worse - some SQL scripts.
- Adding data is not as easy. For example adding a new survey or surveyor is now a non-trivial task.
- Instead of the water engineer being responsible for the spreadsheet you now need a db administrator.
So what have we got in favour of databases? Well it seems to me that from an 'IT Administrator' point of view a spreadsheet is chaotic and fragile and a database is organised and safe. From an 'end user' point of view a spreadsheet is flexible and accessible and a database is complicated and remote.
I'm not suggesting that a banking system should be run on a series of spreadsheets (although historically that is exactly how it was done - on paper to boot!) but I am suggesting that for small to medium scale applications with a high probability of future change maybe a database is not a good solution.
The last database 'Con', about needing a db administrator is worth condsidering in more detail. Most reasonably bright people can pick up the concept of a spreadsheet quickly enough. It's rows and columms, a format we seem to be mentally well tuned to. It's not too big a step to then follow the logic of splitting down a single spreadsheet into tables, as in the example above. Where it all starts to go pear-shaped is when we have to stop considering the data, i.e. tables and fields, and start considering the database, i.e. permissions, views and performance.
When we process a single database table (or a single spreadsheet) that has a thousand rows, and we add up the (for example) 'length' fields to find the total length of a network, the operation takes a specific amount of time. If we double the number of rows we expect the operation to take roughly twice as long. By and large we are correct in our assumption. This is good. The results correspond with our mental model of the process. We can make informed decisions about the likely impact of changes in data volume or the processing procedure.
It is my experience that when databases become more complicated and involve 'views' and 'joins' as data is consolidated and combined from many tables, our 'mental model' is overwhelmed and we can no longer correctly anticipate the effect of our changes.
This is when you need a db administrator with a thorough knowledge of your chosen database, not to actually enhance the system, but just to minimise the performance degredation. Conflicts arise between the db administrator (tasked with maintaining data integrity and performance) and the users (who want more functionality and certainly no loss of performance) because the users are unable to anticipate the consequences of the changes they request. After all, if they
could anticpated the consequences there would probably be no need for a db administrator.
All the work the database vendors put into query optimisation and 'smart' tools for database design cannot alleviate this situation. They help, up to a point, but in the end it just makes it harder for people to anticipate the effect of changes, and distances them from the data.
Default Values
Databases have limited ways of easily dealing with missing data. You can specify the 'type' of a field (integer, double, character, date etc.) and possibly some kind of format and maybe a default value. You also get one other choice, whether or not to allow 'NULL' values. A NULL means "there is no data". If you say that a field cannot have NULL values then it must be populated.
A typical use for this would be pipe lengths. If we want to force the length of pipes to be recorded (surely a good thing?) then we specify that the 'length' field must be non-NULL.
The diagram below show the length fields in the original spreadsheet and the subsequent 'Pipe' table. In this table NULL's are allowed for pipe length.
Even if NULL's are not allowed there will be real-life situations where the pipe length is unknown. The common solution is to enter zero as the pipe length. So even though the database has been set up to force pipe lengths to be entered, we now have zeroes in the table - representing an 'unknown' length.
At least in the spreadsheet you can enter 'Unknown'
A way round this is to have an additional column that somehow qualifies the the length.
I've gone to town on this table and added a 'units' column to please the db administrator and upset the engineers.
Although this approach to data is thorough and consistent, every additional column makes the data harder to 'get at'. What we really need is a way to combine the rigour of the database and the flexibility of the spreadsheet.
The Disadvantages of Spreadsheets
So far spreadsheets have sort of come out ahead of databases. Not because they are especially good, but rather because I have concentrated on the deficiencies of databases. Now is the time to redress the balance by pointing out some of the problems with spreadsheets.
The first problem comes directly from the feature that makes them so easy to use. They are two dimensional. Rows and columns, that's all you get. Yes, spreadsheets can have multiple pages, and it is possible to link them together. You can even link spreadsheets in different files, but all these methods soon get too hard to manage and you quickly lose the advantages of simplicity and accessibility.
The next problem is maintainability. Spreadsheets are a combination of application and data. Where a database focuses on the organisation and integrity of data, a spreadsheet mixes the storage of the data with the application. This makes is quite difficult for a spreadsheet to do more than one thing, and the more macro's or programming code you add to a spreadsheet the worse it gets. They are almost impossible to document and because everything takes place behind the scenes it is hard to be sure what is going on.
From my point of view as a software developer the most fundamental problem is getting data in and out. We generally transfer data via text files of 'comma separated values' (CSV), but this is not really the same as accessing the data in the spreadsheet directly. At least with a database you can use SQL as a standard way of manipulating the data.
Oh, and they are usually slow as well.
These few reasons are serious enough to mean that for more than 99 out of 100 projects that involve significant amounts of data we use databases (with all their faults) rather than spreadsheets.
If by pointing out all the problems with databases I have previously given the impression that spreadsheets are great and databases are useless then what I really mean is that even with all their problems databases are still really the only game in town for
storing data. Manipulate it in a spreadsheet, but store it in a database. Sort of.
Database Nemesis Part I
For years I harboured thoughts of "The Ultimate Database Driven CAD System". All we needed was the right project, the right budget and someone with enough vision to see it through. As the Chinese proverb says: "Be careful what you wish for in case it comes true".
My dream came true when we started work on a system to help design the Overhead Line Electrification (OLE) for the West Coast Mainline. This was a joint venture between WS-Atkins and Balfour Beatty. It was called the WSABBJV, short for WS-Atkins Balfour Beatty Joint Venture.
We were lucky enough to have a team of people that really knew their stuff. We knew
MicroStation inside out and the WSA and BB people knew both
MicroStation and the electrification design process. There were also a number of key people keen to try and keep as much of the project as possible under the control of a database.
Like any big project there was a mass of information to gather and collate but while this was underway we used Rapid Application Development to work on various sub-projects that would eventually be integrated into the whole.
I'm not going to go into the details, it would take too long, but the project was a success. For a development cost of about £200K the completed system saved about 2 million in design time. However, we never did store the information in a database.
Even though we had two and sometimes three people working on the database definition they could not keep up with the changes pouring out of the RAD programme. The RAD tools used spreadsheets and CSV files 'for convenience' and they were so convenient that we kept on using them. When the engineers needed data from the drawings, we exported dimensions, materials etc into CSV files that they sucked into spreadsheets of their own design. If their spreadsheet calculations changed the drawing dimensions they they exported data from the spreadsheets and we manipulated the CAD drawings.
Yes, the drawings and the spreadsheets got out of step occaisionally, but as long as someone knew which was the definitive version it was a simpe job to synch the other one.
Massive project, thousands of drawings created by hundreds of people. Ideal territory for a database. But it didn't happen.
But wait, there is another project on the horizon...
Database Nemesis Part II
What happens when one company buys another similar one? If they have duplicated computer systems they usually do some kind of comparison and then dump one of them. Sometimes they use this opportunity to dump both of them and adopt an entirely new one.
When Western Power Distribution (WPD) bought SWALEC they both had
MicroStation based systems. WPD had a totally 'CAD' based system, basically a huge collection of drawings. SWALEC had a hugely expensive Intergraph FRAMME system, i.e. CAD drawings, a database, and a set of 'rules' that defined the network.
Maybe this was an opportunity for WPD to 'upgrade' their CAD system into what might be termed a Geographic Information System (GIS)?
Not a bit of it. The amount of useful information in the FRAMME database was negligible. All it did was make accessing the data slow and difficult. The conversion tools we wrote for WPD ripped into the FRAMME data and spat it out as processed CAD files that the WPD system could use directly. From an IT perspective we were the barbarians at the gates, sacking the database and destroying the imperial rule base. OK, you can carry an analogy too far (or get carried away with an analogy) but apart from hammering another nail in the database coffin, we did end up simplifying the system and making the data more accessible.
Darkest Before the Dawn
For years we used databases without really liking them. Our customers kept prototyping things and running applications using unmaintainable spreadsheets. It looked like being aware of the deficiencies of both approaches and developing work-arounds was as good as it was going to get.
But almost without us noticing a 'third way' was creeping up on us. Luckily it had nothing to do with Tony Blair. Instead it was... XML.
I've done an XML Tutorial [http://192.168.1.10/cgi-bin/site04/wiki/wiki2.pl/XML] elsewhere, so I'm not going to go into much detail here. But I've redone the spreadsheet and the database in an XML form below.
300
34.8
PVC
1
6
Clay
300
12.0
PVC
2
Robert Smith
1999-05-01
CW
Robert Smith
1999-05-01
ACW
Robert Smith
1999-05-01
If you haven't seen any 'raw' XML before this probably looks confusing. Narrow minded people that are resistant to change are often frightened by new things and take an immediate and instinctive dislike to them. That doesn't apply to anyone here does it?
Let's look at it one bit at a time.
Parents and Children
If spreadsheets are defined by rows and columns, and databases are defined by tables and key-fields, then XML is defined by parents and children.
In the example above, the 'top' element (called the 'root' element) is
. The children of this element are and elements. So straight away we've made an explicit relationship that a network contains pipes and valves.
Here is a element with four children. Look how well the diameter/units issue is solved quite neatly. And the length 'measurement quality' issue:
300
34.8
PVC
1
Missing values? No problem. See how easy it is to define the length of this pipe as 'unknown' without compromising any of the other fields:
6
Clay
It works for me. But there is a downside.
Verbose and Repetitive
Some people say XML is verbose. There is simply too much of it. "I can't see the wood for the trees" is a common complaint.
This type of fragment drives some people crazy:
2
All that text for a simple value? What a waste of space. OK, I can't disagree. In a database it would be a single field and in a spreadsheet or CSV file you would just have a single digit and a comma.
But you can't deny that it is readable.
But what about this:
Robert Smith
1999-05-01
...
Robert Smith
1999-05-01
The survey information is exactly the same and it is repeated all over the place. Normalised this data is not!
Again, I can't disagree. The data is repeated and it isn't normalised. However, I like to think of a particular XML file as a 'view' of the data. In the example above the view is 'network centric'. We are looking at the data from the point of view of the network, so we're interested in the attributes of the pipes and valves.
I could instead construct a 'surveyor centric' view:
Robert Smith
1999-05-01
300
12.0
PVC
2
CW
ACW
This last bit is an important point. XML is just a way of organising data. But the way it is organised has a big effect on you can do with it, and how easily. I think that XML allows us to organise data more flexibly and therefore makes it easier for us to do the things we want to do with it.
A work in progress. To be continued...