Chris Sacksteder, April 2006, May 2008
PALS was designed in the Spring of 1995 and put into production that Fall Semester. It has undergone many changes over the years, and has been the System Development Group's largest programming effort. Besides the core servers, there are several databases, many utilities and batch jobs, and several web-based programs. Originally in Visual Basic 3, the server is now partly written in Visual Basic 6 but with essential database functions in VB.Net (Framework 2.x). Client programs that send "login" and "logout" commands to Pals are written in C.
PALS provides two basic services: printing control and accounting, and session logging. Printing control includes the ability to limit printing to X pages per semester, or disallow printing altogether. Session logging means every logon and logoff to every computer with a PALS client is recorded in a central database, allowing management to review computer usage by time, room, and platform.
PALS version 3.x was put into production on March 8, 2006, and represents the largest change to the server code. Prior to this version, the two production servers maintained their own private databases containing records for every user and records for every IP address. A daily transaction log of (nearly) every command verb is also maintained; these were scanned daily to load data into a MSSQL database, and were also combined and sent to the ITS accounts system for print job billing. The two Pals servers kept their databases synchronized in an awkward way, and schema changes were not easy to do. With the need to implement a new method of granting printing credits and a more robust way for multiple servers to have access to the same information, we decided to keep the dynamic user and IP address data on the SQL server.
In version 3 a "asynchronous distributed disconnected database" model is used. While our SQL Server is on a 2-node server cluster, and is therefore almost always available, we do not want even short outages to affect end users. Database mirroring in SQL 2005 was considered, but even the few seconds it takes to redirect a client to the mirror server is too long for PALS, which is processing up to 40 transactions per second, and, because the base code is still in VB6, is not multi-threaded. While we intend to rewrite the whole server in VB.Net, a transitional stage of some parts in VB.Net was developed.
To tolerate a loss of connectivity to the database server, the PALS servers keep local copies of the two essential database tables and keep the data synchronized with background tasks. One table consists of a record for each user who prints (or attempts to print) during the current semester; the other has a record for each IP address for which a login or logout request has been received. There is a separate DLL that manages each of these tables, keeping the corresponding table on the SQL server current and loading changed records from the SQL server.
A login request change the state of the record in the cached table (an instances of a DataTable object) and queues a task to execute a SQL stored procedure to make the same change in the corresponding table on the server, add a session record to the "Logins" table in the Pals database, and update yet another table in our Inventory database if the IP address is one of our lab computers. A logout does a similar thing. The task queue is process asynchronously on a timer (that timer is 2 seconds as of this writing, but is adjustable). If the database transaction finishes without error, the task is removed from the queue, otherwise is remains and will be retried the next time the timer is fired.
Thus far, this has worked very well for tolerating network outages and other (usually unexplained) errors connecting to the SQL server. There were some problems in the first few weeks, which have been resolved.
The Background Processing
This was done as simply as possible. A task queue is a collection of Task objects with just a few attributes, e.g.:
Dim WithEvents tmrDbTrans As New System.Timers.Timer
Dim TaskKey As Integer = 0 'make keys numbers from 1 up
Dim Queue As New Collection
Private Class Task 'for a collection of queued tasks
Public Key As String
Public sFunction As String
Public Args As String
Public dtAdded As Date
Adding a task to the queue is also simple:
Private Sub AddTask(ByVal sFunction As String, Optional ByVal Args As String = "")
Dim T As New Task
TaskKey += 1
T.Key = CStr(TaskKey)
T.sFunction = sFunction
T.Args = Args
T.dtAdded = Now
When the timer tmrDbTrans is fired, the queue is processed (if Queue.Count > 0):
For Each T As Task In Queue
Select Case T.sFunction
If db.Login(T.Args) > -1 Then RemoveTask(T)
If db.Logout(T.Args) > -1 Then RemoveTask(T)
. . .
Catch ex As Exception
Notice that one connection is open when the queue is processed. The RemoveTask is simply:
Private Sub RemoveTask(ByVal T As Task)
I'm not showing everything here, but you should get the idea. To help know what is going on, each DLL has a Status property which returns a string with a bunch of labels and counters or date/times. An administrative client can get that information from PALS. For example, here is the output from the one that handles the user table:
PalsUsers.dll Version: 1.2.2277.27661
Record count: 33365
Users added: 1115
Last load time: 1063
Update tasks: 17065
Records from last update: 12
Last update: 04/12/06 17:10:17
Newest record date: 04/12/06 17:10:18
Newest record updated: 04/12/06 17:10:18
Total records updated: 225355
Logging level: 1
Log file path: P:\Pals\AdminLogs\
Queue interval (ms): 2000
Queue size: 1
DB tasks: 206609
Autoupdate interval (seconds): 60
Unique print job list: 202320
Duplicate print jobs: 21
No printing user list: 30455
No printing user list loaded: 04/12/06 16:52:52
Database status: Opens: 133552 Errors: 2315
This information is useful only to the programmer. In fact, it is essential. Some interesting features are that parameters, such as the queue processing interval and logging level, are defined in local control files, but can be changed remotely. Here we see that in about 2 weeks of running, it has added 1,115 new users, done 206,609 database updates and made 133,552 connections to the database server. That's not all that much. The load is fairly light and does not impact the server much. All changes to the database and loading changed records to the local DataTable cache are done with stored procedures. A method to reset the counters would be advised.
These numbers bring to mind some other features of this DLL. All the clients retry the backup server and the primary server a few times if they don't get a reply. So, with occassional network glitches there are duplicate login, logout, or print requests from clients. While the database table for recording print jobs has a unique contstraint to prevent duplicates, we'd rather not process duplicates at all. So there is a list of print jobs, shown above with 202,320 records, which is used to ignore duplicate print jobs (21 at that time). That list is a StringDictionary found in System.Collections.Specialized, one of my most favorite things in the framework.
Another wrinkle is that the user table has only users who have printed or tried to print, which at 33,000 is much less than the total number of user accounts at Penn State, which is around 150,000. A large number of these (30,000+) are temporary accounts or other accounts that can never print. Because the stored procedure on the database server "decides" who can print and how much, a new user would have to be added to the database before PALS can know if they can print or not. But since the database transaction is delayed a second or two it was simpler to keep a list of the "never can print" accounts and let all other new users print at least the minimum 110 pages, than to keep a list of all user's status locally. "Correct" new user records are loaded locally within the "autoupdate interval", which is 60 seconds as of this writing. If a smart student reading this thinks there might be a hole here, guess again. A new user record is added to the local DataTable before it is added to the SQL table, and print jobs cause pages to be subtracted from it. You can get your 110 subsidized pages, but no more until your "real" record is loaded.
In the two weeks that this has been working ok, PALS has survived two network outages. One was about 3 minutes long, during which the SQL server could not be reached at all. That is when the bulk of the 2315 errors shown above were logged. During that outage Pals did not slow down at all and the outage did not affect any clients. All the queued database transactions executed without error when the network connectivity was restored. How can there be 2315 errors logged in 3 minutes when the queue is processed only once every 2 seconds (or 90 times in 3 minutes)? Well, the current version will re-try the open for every queued transaction; at the beginning of the outage there were 8 tasks queued; after 3 minutes there were 62. There's no harm in trying.
We ran into a few interesting problems that we could not find documented on the Internet, but for which solutions or circumventions were eventually found.
Framework 2.0 DataTable Load ConstraintException
New in the 2.0 Framework are features for the DataTable class for disconnected database access. These are similar to the features of the DataSet class in 1.1, but simpler and easier to use.
On startup, data is loaded into a new instance of the DataTable with the Load method, which takes a DataReader as a parameter. Simple enough. Updates (changed records) were added with the Load method on the same table and the load option OverwriteChanges. To make a long story short, we added an event handler to determine the date/time of the most recent changed record so that the next query to load changes could use that to select records changes since the last update.
This all worked fine in testing, even with artificial loads. But in production with a real load, the DataTable.Load started throwing a ConstraintException, for no obvious reason (the table had no contraints and had only a PrimaryKey assigned). Sometime the problem followed a TCP error, sometimes not. Code added to check DataTable.HasErrors and then to use DataTable.GetErrors showed all rows had "errors". Both DLL's did this randomly on both tables. Reloading the data into a new table cleared the errors, but reloading into the same table did not.
After two weeks of many changes but no solution (things that did not help: using SyncLock in the calling class to be sure only one thread is accessing the datatable at a time:adding DataTable.BeginLoad and EndLoad methods, getting rid of the DataRowChangeEventHander, changing to LoadOption.Upsert, or (re)moving the DataTable.AcceptChanges method), the code was changed to load changed records into a new DataTable then use DataTable.Merge method to incorporate the changed records. This is about 100ms slower on average, but the updates are still quick enough, and work almost all the time, except for infrequent NullReferenceExceptions (which later disappeared, not sure why). My conclusion is the DataTable.Load with existing rows and a PrimaryKey set is a little buggy.
DataTable Load Does Not Load all Records, No Error
Another weird problem was that sometimes, after these constraint exceptions on the DataTable.Load, reloading the table into a new DataTable resulted in fewer rows (sometimes a whole lot less), and no errors thrown at all. This was always after a connection timeout or TCP error or "Severe internal error". The fix there was to query the number of rows in the SQL table, and if the rows in the DataTable were less, leave the reload task queued to try again.
We were getting regular TCP Errors at a specific time each morning. It turned out a batch job was updating one if the tables that PALS was trying to update, but it was using a rather inefficient method. That was fixed and the problem went away. It is not known why a more appropriate SQL error such as a timeout or a deadlock expiration was not received.
Severe Internal Error
Sometimes we would just get "Severe Internal Error" on a database query or update. A Database class inside these two DLL's was creating and keeping a single SqlConnection object, and reusing it. We found that after any error, the SqlConnection object should be disposed and a new one created. That fixed this problem.
Both DLL's use stored procedure to change or add records on the server side. These functions sometime result in DataReader errors even though no DataReader is explictly used. I switched from SqlCommand.ExecuteScalar to SqlCommand.ExecuteNonQuery expecting that the scalar return value was implictly using a DataReader. That seemed to get rid of some errors, but then those functions starting throwing a lot of "There is already an open DataReader associated with this Command which must be closed first." . . . it seemed that sometimes the ExecuteNonQuery sometimes doesn't finish even when the function it was created in does. Assigning the return value from ExecuteNonQuery (the number of rows affected) didn't seem to help. Adding a Dispose method for the command seems to help.
Here is another amusing error message:
ExecuteNonQuery requires an open and available Connection. The connection's current state is open.
So either the connection was not open at the time ExecuteNonQuery was called, but became open after the exception was thrown, or the connection was open, but not "available". Not sure what "available" is. I'm not sure if it helped or not, but the functions that open and close the connection were changed to wait for the connection state to change. I'm inclined to think that getting rid of the SqlConnection object after an error is what fixed this.
Nevertheless, a little function to wait for a specific time looks like:
Function ConnectionWait(ByVal State As ConnectionState) As Boolean 'Wait for this state to be reached, or timeout.
Dim nLoops As Integer = 0
Dim tStart As Integer = timeGetTime
Do While cnn.State <> State And nLoops < 100
nLoops += 1
If nLoops > 1 Then LogError("Connection " & State.ToString & " took " & timeGetTime - tStart & "ms.")
Return (cnn.State = State)
With all this background, delayed database transactions, we have a problem with Pals the VB6 application running as a service or a generic cluster application in that we don't want to shut down the process without telling the DLL's to finish processing everything in their queues. A privileged verb "SHUTDOWN" was added to PALS, and a similarly named method added to each DLL to turn off the timers and immediately process queued transactions.
Then, a service called PalsMom was written in VB.Net (which is easy to do compared to one in VB6). This launches the Pals.exe when it is started, and it keeps the process ID handy. When the service is stopped, it sends PALS the SHUTDOWN command, and waits for the process to terminate. This service is installed on each physical server. On a cluster, a "generic service" is defined. This allows the cluster resource to be take offline, or failed to another node, without "pulling the rug out from under" PALS. Works great.
Another little "gotcha" is that DLL's written in .NET and called by VB6 via COM Interrop need to be registered with RegAsm.exe. Every time a DLL is recompiled it should be registered again. Just prior to production, and anticipating problems that would require updating the production servers on the fly, PalsMom was enhanced to check and update the .NET DLL's, and if they were updated, run RegAsm.exe to register them. It will also leave a bread crumb behind so if the DLL's were new, and PALS was on a cluster and the cluster resource was failed over to another node, PalsMom would know to register the DLL's on that node too. (Defining shared registry keys for the cluster resource would be difficult because the ClassId's change with every recompile . . . later the classids are statically set; ok as long as the interface doesn't change.)
It is relatively simple to take a 10-year old VB6 application, and upgrade parts of it to use VB.Net for high-performance and fault-tolerance. The DLL's can continue to be used when the base server code is rewritten.
A major change to the back-end databases, user and administrative web apps, and jobs was put in place May 12, 2008. The change was to replace sending billing records to the bursar with pages pre-purchased with LionCash accounts.
Last Updated November 21, 2014