This article describes an ambitious project related to the chess game.
The purpose of the project is not a chess playing engine; it is a chess database and a front-end program to explore it (with a dedicated web site).
Other articles on chess from the same author are available in this wiki.
Building the database
The building of the database should include the loading of MANY games (~106 and more if possible).
Games played by good players are preferred.
The loaded games should include (besides the sequence of moves), the result and the name and elo rating of the players.
Each game will be processed so that the 20 (or more...) first positions generated by the moves are stored in the database.
Note: A PGN file typically takes less than 1Kbyte. Multiple compressed PGN files (in a zip set) would take around 0.2 Kbyte per game.
In a tree (in computing science and in life science), the anastomosis is the reconnection of two streams that previously branched out.
In chess, which is tree game, anastomosis is highly present, but most of the time poorly handled.
Anastomosis brings complexity, because it eliminates several convenient properties of tree management. But it also provides wide benefits.
In order to manage a large number of positions, a good indexing scheme is required. An original and performing scheme is documented here, and provides a way to index any position on a compact bit set.
The number of bits fully defining a position is between 78 and 113, depending on the complexity of the position, but this is optimal in terms of compacity (would be hard to have it more compact).
The position indexing scheme is used to build huge HashSets of positions.
For each position in the database, the following fields at least should be present:
- the number of paths leading to that position (anastomosis factor)
- the various last moves leading to that position (and links to the corresponding previous positions)
- the various next moves played from that position, and for each of them
- the number of times they were played
- the average Elo rating of the players who chose them
- the results of the games (white/draw/black)
- an uncertainty measure for this position, using an entropic measure
- a global white/black balance measure (immediate value)
- the minimax value of the recursive immediate value.
- a global dynamic measure (measure of likelihood of a draw game)
- (the indexing key)
The size of the information stored by position should be around 250 bytes (including linked objects)
An efficient use of storage would then allow to manage up to 4x106 anastomo-positions per terabyte.
Specific data structures
The specific data structures involved in the project include
- back moves (5 bytes + 1 key per object)
- forward moves (20 bytes + 1 key per object)
- anastomo-positions (~80 bytes + 1 key per object)
[ a key is at most 15 bytes ]
The user interface should
- display all the information present in the database (listed above)
- allow backward links to previous positions for all last moves
- allow forward links to next positions for all next moves
Obviously the user interface should be fast, user friendly, visually attractive.
The relationship between the data server and the front-end is probably complex and should inlcude an aefficient usage of caching tools.
It is beyond the perimeter of this article.
Java / GWT / Collections.
The java libraries of PG would provide fast development path.
Between 0.75 and 2.5 year men, depending on programmer efficiency, front-end architecture choices,...
A small anastomotic database (with 50K positions) would be relevant to develop the anastomochess.
Downloader. This program collects a large numbre of PGN files (up to ~106).Most of the work may be achieved with some simple bash script commands (Scripts/icofy)
GameCompress. This program transforms a large number of PGN files into a unique massive file. The games selected have to
- include at least 25 moves for both sides
- have been played by 2 players higher than 2000 elo
The size of the resulting file is around 80 Mbytes per million of games.
PosLoader. Builds and fills the anastomotic tree, loading the compressed game files. Requires significant memory. THIS is the key program!
FrontEnd. A JavaFX animated web aplication, dynamiccaly displaying the positions and moves.
All objects are subclasses of fixBitSet.
Since fixBitSet is a subclass of BitSet, and since BitSet are backed, by arrays of Long objects, the actual memory consumption of these objects will always be mutiple of 64 bits (that's bad).
PosKey. The position identification, using at most 113 bits, and in average 100 bits.
Game. Compressed version of a game. Contains 2x25x4x3 bits (2 sides - 25 moves - 4 data per move, 2 rows and 2 cols - 3 bits for row or col identification) for the game, plus the result, and the ranking of both players. Roughly 620 bits per game, thus 80 bytes.
BackMove. Contains the key for the previous position, plus 12 bits for the back move. 112 bits in average.
ForwMove. Contains the key for the next position, plus 12 bits for the forward move, plus various technical attributes (around 5 integers, using Hufint compression). Globally ~ 160 bits in average.These objects will consume most of the memory when the PosLoader program runs.
Position. A posKey, plus a set of BackMoves, a set of ForwMove, and various technical attributes (~6 float values) . The size of a position object strongly depends on the forking density in the global tree. If 5 BackMove and 5 ForwMove are assumed as average values, then the average Position size should be around 1650 bits (100+5x112+5x160+6x32). A conservative estimation is then 250 bytes.
Sizing global capacity
Assuming that the main program (PosLoader) takes advantage of 4 Gb RAM, it should be able to manipulate 16x106 positions. With 1 million (106) loaded games, 2x25 moves per game and a average repetition factor of 3 (questionable), the number of positions would be 2x25x106 / 3 , which is indeed ~ 16x106.
The computing time will be a surprise. The PosLoader program will mostly work in pure RAM, but anyway the processing of 106 games and building of 16x106 positions might take some time.